git is a distributed version control system used everywhere. Under the hood, the entire history of the repository is tracked. Git has blobs for the files, trees for the directory structure, and comments for snapshot information. A blob is a large binary object that is saved based upon the sha256 hash of the contents and is zlib compressed. Many of these are compressed into a single file called a pack when they are no longer referenced by other objects (dangling).
The commit history represents a snapshot of the repository at a point in time. They store a reference to a ree object, pointers to parent commits and metadata.
When a file is removed via git rm, they can still be accessed because the history is immutable. The data of a commit is stored forever in the .git/objects folder. Additionally, the pack files contain information that is no longer referenceable by normal means.
The author wanted to target all dangling objects by traversing commits with their parent commits. If a file was dangled and deleted, they dumped it to disk. More there, they would run the tool TruffleHog to check for secrets on the repo. TruffleHog supports over 800 different secret formats! They also have a verify-only flag that will check if the secret is valid or not.
My main question, which they cover, is why not just use TruffleHog from the beginning? It will often skip .pack files if they were too big. By uncompressing these ourselves with the mechanism from above, TruffleHog can do its magic like normal.
They scanned a crazy number of projects doing this. They found the organization names by looking at various GitHub repos with names, using the GitHub search and directly with repos over 5000 stars. All in all, they made 64K off of this research. This goes to show that novel research pays. There were a large number of false positives. In particular, dummy users for testing and canaries were very common.
Why does this happen so much? The author claims that many developers just don't understand how git works with regard to deleting files. Additionally, bad .gitignores including .env and binary files were common as well. Overall, great research!