Why AI Code Review Still Produces False Positives — and What to Do About It

Ask any developer who has worked with a SAST tool for more than a few weeks, and you'll hear some version of the same story: the scanner fired on 600 things, nobody looked at any of them, and a real vulnerability shipped anyway. The problem is not that static analysis is wrong. The problem is that it is right about too many things that do not actually matter — and in doing so, it trains developers to ignore it entirely.

We've seen this pattern across dozens of early-stage teams. The tool gets installed, runs in CI, produces a wall of findings, and within a month the developers have stopped reading the output. That is not a tooling problem. That is a signal-to-noise problem.

Why Traditional SAST Produces So Much Noise

Classic static analysis works by pattern-matching source code against a rule set. If your code matches a pattern associated with, say, SQL injection or an unsafe deserialization path, the scanner flags it. The problem is that most of these rules fire on the syntactic shape of code, not on whether a vulnerable path is actually reachable from an attacker-controlled input at runtime.

Consider a simple example: a developer writes a string concatenation into a database query. The scanner flags it as a potential SQL injection — CWE-89. But if the string value comes from an internal configuration file read at startup, and the application never exposes that value to user input, there is no meaningful attack surface. The finding is technically correct and practically useless.

In our experience building on top of AST-level analysis, the overwhelming majority of false positives fall into three buckets:

Most scanners catch none of these distinctions. They emit findings. Volume is the output.

What AI Tools Often Inherit

You might assume that adding AI to a code review tool would solve this. Partially right. The first generation of AI-enhanced SAST tools used language models to rerank findings or generate better explanations — but they still drew their finding list from the same underlying static analysis engine. They made the noise prettier. They did not remove the noise.

The deeper problem is that an LLM trained on general code does not, by default, understand the specific conventions your team uses. It does not know that your team always validates user input at the controller layer before it reaches service code. It does not know that your internal `safe_query()` wrapper parameterizes everything by design. So it still flags the same patterns that a pattern-matching scanner would flag — it just explains them in cleaner English.

We tested this directly: feeding 50 known false positives from a mid-size TypeScript codebase into three AI-assisted code review tools. All three still surfaced 40 or more of those findings. The explanations were better. The false-positive rate was not.

Context-Aware Ranking Changes the Equation

The approach that actually moves the needle combines two things: call-graph reachability analysis and data-flow tracing.

Call-graph reachability answers the question: is this vulnerable code path reachable from an entry point in the application? An entry point is anything that accepts external input — an HTTP handler, a message queue consumer, a WebSocket callback. If the vulnerable code path is not reachable from any entry point, the finding drops in priority immediately. It might still get surfaced as a hygiene note, but it does not get treated as an urgent security issue.

Data-flow tracing follows the actual movement of values through the code. Starting from a user-controlled source (a request parameter, a form field, a query string value), a taint analysis engine tracks every path that value can take through the application, looking for whether it reaches a dangerous sink (a SQL query, a shell exec, a deserialization call) without being sanitized or constrained along the way. If the taint trace shows that a value gets sanitized before it reaches the sink, the finding is suppressed.

Neither of these is new. The novelty in modern tools is doing both passes inside a PR review cycle — fast enough to block a merge — and then using historical reviewer behavior to calibrate future findings. When a developer dismisses a finding as a false positive, that signal feeds back into the ranking model. Within a few sprints, the tool learns which patterns your team intentionally accepts and pre-suppresses them in future scans.

The Learning Loop Is the Differentiator

Here is the part most vendors underemphasize: the initial false-positive rate matters less than the rate of improvement. A tool that starts at 60% false positives and drops to 12% within 90 days is more valuable than one that starts at 40% and stays there.

The mechanism is team-scoped fine-tuning. Each accept or dismiss action from a reviewer is a labeled training signal. "This pattern in this codebase, in this context, was not a real issue." Over 200 to 400 such signals — typically achievable within three to six weeks of daily use — the model's context window expands to include your team's specific idioms and suppress them proactively.

The important caveat: this only works if developers are actually engaging with the findings. If false-positive rates are so high that developers stop looking, you lose the signal source. This is the failure mode we most commonly see in teams that come to us frustrated with existing tools. The tool produced noise, the team disengaged, the tool never learned, and the noise persisted indefinitely.

What This Means for Your Triage Workflow

Teams we work with typically arrive with a SAST workflow that looks like this: scanner runs in CI, output gets emailed to a security engineer, security engineer triages findings every two weeks, finds three real issues buried under 80 false positives, and files Jira tickets that take another two weeks to get prioritized.

The goal is not to eliminate that workflow — it is to compress it into the PR review cycle where the developer already has the code loaded in their head. An inline PR comment that says "this data-flow trace shows user input reaching this SQL concatenation without parameterization — here's a suggested patch" costs a developer 30 seconds to evaluate and fix. The same finding discovered two weeks later, in a Jira ticket, with no original context available, costs an hour.

The key metric to track is not false-positive rate in isolation. It is developer engagement rate with findings — what percentage of findings are actually read, evaluated, and acted on. That number tells you whether your security signal is reaching your engineering team or disappearing into noise.

A Practical Starting Point

If your team is currently running a SAST tool with a high false-positive rate, the first step is not to replace the tool. It is to instrument the engagement rate. Find out how many findings are being dismissed without review versus dismissed after review versus genuinely fixed. That breakdown tells you where the signal is leaking.

If 80% of findings are dismissed without review, you have a noise problem and no amount of improved tooling will fix it without first rebuilding developer trust. Start by manually triaging one sprint's worth of findings yourself, suppressing the obvious false positives, and presenting only the real issues to the team. That restores the trust baseline that automated learning requires to work.

The false-positive problem in SAST is solvable. It just requires treating it as a calibration problem, not a rules problem. The rules are often correct. What they are missing is context — and context is exactly what code-review-integrated, team-scoped analysis is built to provide.

All articles SAST vs DAST Explained