Security research involves long hours of staring at code and is done only by a specialized group of people. With the rise of LLMs comes the ability to use AI tools to find vulnerabilities. They built a bot to think as security engineers do.
- Identify suspious behaviour
- Prove reachability of the code
- Prove controllability. Can the attacker influence the relevant data/state?
- Determine real world impact
Of those steps above, if any of them go wrong, then the bug won't be found. This is because it's long-form reasoning with compounding errors. Intuitive reasoning can be done locally, but it's bad globally. Precision decays the longer the chains get. The key insight is that you need checkpoints to enforce correctness and not just more tokens.
Instead of using better prompts, they created
harnesses. This is a set of constraints, scaffolding and checks to force an agent to be systematic in its approach. They do this with the following steps:
- Generate hypotheses explicitly.
- Collect evidence before escalating confidence.
- Use deterministic tools when possible.
- Fail fast and prune dead ends.
- Produce artifacts a reviewer can trust
The post includes a great graph that explains their reasoning. First, it is an exponentially decreasing value that scales with reasoning length; the longer a chain, the worse it does. The other value on the graph is a shark tooth. For each verifiable subtask, the confidence is regained. After this, they have some good insights into what has worked for them.
First, the usage of deterministic tools when possible. Using CodeQL to find sinks is better than asking an LLM to do so. This is because it's deterministic and only requires the LLM to use CodeQL. Another point is that native tools work better with their home model. For instance, Claude Code works best with Opus.
Scanners have multiple issues. From multi-step flow identification to boundary issues, they do fail. The authors claim they use static analysis tools as much as possible and then rely on agentic reasoning to bridge the gap. This uses LLMs only when necessary, keeping things deterministic.
When reviewing code, not all lines are equal in terms of threat. Some repos/components only need shallow checks, while others need deep integration. By putting spend only onto difficult and promising areas, the costs stay lower and you will find more bugs.
The final major benefit is testing. If the code has a bug, this should be provable. Run the simulation, execute the PoC, and check whether the expected outcome occurred. This tends to remove false positives and improve confidence in an issue. Although not all tests are created equal, there's a major difference between an isolated unit type and a full simulation.
This bot found a max payout critical of $250K on
Immunefi recently. No word on what the bug is but it's very interesting. They have other bugs on their profile as well.