Statistical analysis is used all the time in computer science for solving hard problems. In particular machine learning has hit a big boom lately. Sometimes, simple statistical analysis can be used to solve hard problems instead of the insanity of LLMs. In this post, we get one of those.
n-gram statistical analysis is common in linguistics. Simply put, it uses a grouping of tokens, such as words, and shows what the likelihood of this is to occur. Based upon this, it's possible to predict words in linguistics by using the most likely next word.
The author has chosen to use this technique for binary analysis on machine code. From testing, they figured out that 3-grams work well without over fitting. I'm guessing they tried this with several different N-gram amounts for analysis. Previous work has shown the ability to identify both anomalies in code and find patterns to help reverse engineer unknown ISAs.
To do this analysis, the author lifted the binary into a binary ninja intermediate language. Additionally, they removed registers and memory addresses to make it more generalized. From this, they analyzed a large amount of binaries to get a ground truth. Now, they can start analyzing new binaries to look for anomalies!
While looking into malware, they were able to identify control-flow flattening obfuscation techniques. Every function identified by the heuristic is obfuscated or pinpoint a helper function managing the obfuscated state. In the Windows kernel, they analyzed the Warbird Virtual machine. By finding an obscure pattern of code in the asm, they were able to find VM handlers that were obfuscated in the VM.
They analyzed Mobile DRM that plays encrypted multi-media content. Using it, they were able to identify arithmetic obfuscated areas via Mixed Boolean Arithmetic and usages of hardware encryption. This was enough to demonstrate they were looking in the proper area.
Stats don't lie! Statistics is useful for many things, including binary analysis. Great post on using techniques from other disciplines in the realm of security.