Rules February 24, 2025

Reducing False Positives Without Killing Signal

The most common reason engineers turn off their SAST tool is noise. Here's how to tune without losing the findings that matter.

Why false positives kill adoption faster than any other factor

A SAST tool that fires on every pull request with 40 findings — half of which the engineer knows immediately are irrelevant — doesn't produce 50% effective security coverage. It produces engineers who stop reading the findings. The tool stays in the pipeline because removing it requires a security team decision, but it functions as wall art: present, ignored, and providing a false sense of coverage.

This is the false positive problem stated plainly. It's not that false positives are an inconvenience. It's that above a certain threshold, they destroy the recall value of the tool entirely. The engineer who dismisses 30 findings as noise is also dismissing the 10 that were real. You've lost precision and recall simultaneously.

The threshold varies by team context. In a high-trust engineering culture where security is part of the review process, engineers may tolerate and triage 15-20% false positive rates if the tool is clearly surfacing important findings. In a team where security checks feel like compliance overhead, even a 10% false positive rate generates enough friction to erode engagement. We're not saying false positives are acceptable at any rate — we're saying the acceptable rate is lower than most SAST vendors acknowledge in their default configurations.

Where false positives actually come from

Understanding the generation mechanism matters because the fix is different depending on the cause. False positives in taint analysis typically arise from one of three sources.

Sanitizer blindness. The engine traces a dataflow path from a taint source to a sink and flags it as a potential injection risk (CWE-89, CWE-79, CWE-22). But the actual code passes the value through a sanitization or validation function before it reaches the sink. If that sanitizer isn't in the engine's recognized sanitizer library — because it's an internal utility function, a framework wrapper, or a newer library version with a changed API signature — the engine doesn't know it exists and continues treating the value as tainted. The finding is technically a false positive: the vulnerability couldn't be exploited because the sanitizer is in place.

Unreachable code paths. Dataflow analysis considers all possible execution paths through the code. Some of those paths are unreachable at runtime — dead code, conditions that can never be true given the type constraints on their inputs, functions that are called in exactly one place and that one callsite always passes a literal. The engine, being conservative, flags the path anyway. This is especially common in large codebases where utility functions have wide parameter types but narrow actual usage.

Rule scope mismatch. Generic rules written for "Python" or "JavaScript" fire on patterns that are idiomatic in one framework but not in another. A rule for CWE-79 that flags any string interpolation into an HTML template will fire on your template engine's auto-escape mechanism if the rule was written without knowing about that framework's escaping behavior. The vulnerability doesn't exist; the rule just doesn't have the context to know that.

Rule tuning: the right way and the wrong way

The instinctive response to false positives is to disable rules. This is the wrong way. Disabling a rule because it generates false positives removes it for every finding that rule would have caught, including real ones. You've traded a precision problem for a recall problem.

The right approach is rule scoping. Most SAST frameworks that support YAML-based rule definitions allow you to specify path patterns, language constraints, and metavariable conditions that narrow a rule's applicability without eliminating it. A rule that fires on cursor.execute($QUERY) can be scoped to only flag that pattern when $QUERY was constructed using string concatenation rather than parameterized query syntax. That constraint eliminates the false positives from properly parameterized queries while retaining detection for the actual SQLi pattern.

For sanitizer blindness specifically, the fix is to register your internal sanitization functions in the rule's sanitizer list. In Semgrep-style YAML rule syntax, this is expressed under the pattern-not or pattern-sanitizers key, depending on the engine. The rule continues to track taint from sources to sinks but excludes paths that pass through your registered sanitizers. This is not suppression — it's accurate modeling of your codebase's security controls.

Consider a growing SaaS platform that integrated SAST into their Python monorepo and saw an initial false positive rate near 35%. The primary driver was a custom input validation decorator that wrapped all API endpoint handlers — the engine didn't recognize it as a sanitizer. After registering the decorator in the rule's sanitizer configuration and narrowing three generic string-handling rules to their actual framework's ORM patterns, the false positive rate dropped to around 8% over the course of a two-week tuning cycle. Real findings — including two SSRF patterns (CWE-918) in an image proxy endpoint — remained flagged.

Suppression files vs. baseline drift

Suppression is the mechanism for handling false positives that you can't eliminate through rule tuning — cases where a specific finding is genuinely not exploitable in your deployment context and you want to record that decision auditably. A suppression entry should include a justification string, the engineer who reviewed it, and a review date so it doesn't accumulate indefinitely without reconsideration.

The risk with suppression is baseline drift: the state where so many findings have been suppressed that the tool is no longer scanning meaningfully. Baseline drift is subtle — the finding count goes down, which looks like progress, but it's actually coverage erosion. A useful operational check is to periodically audit suppressions against the actual code they reference. If the code that caused a suppressed finding has been significantly refactored, the suppression may no longer be accurate — in either direction.

Most SAST tools export suppressions in the SARIF format, which includes a suppressions array with kind (inSource or external), justification, and state fields. Storing this in source control means suppression decisions are subject to the same review process as code changes — a staff engineer can't silently suppress a finding without a PR that makes the decision visible.

Team feedback loops as a tuning mechanism

Rule tuning performed by the security team in isolation from the engineers writing the code will always lag behind the codebase. The most accurate signal about whether a finding is a false positive comes from the engineer who just wrote the code in question — they know whether the sanitization logic is in place, whether the path is reachable, whether the library version in use has the vulnerability the rule assumes.

A structured feedback loop means that when an engineer marks a finding as a false positive, that action generates an audit record that the security team reviews periodically. Not to approve or reject individual suppression decisions — that would be a bottleneck — but to identify patterns. If the same rule generates 12 "not applicable" dismissals in a month, all from the same code path, that's a tuning signal: the rule needs a scope constraint or a sanitizer registration, not 12 individual suppressions.

This is the distinction between suppression as a coping mechanism and suppression as a data collection system. When suppression data feeds back into rule quality, false positive rates tend to fall over time. When suppression is just a way to make findings go away, they accumulate and the tool's effective coverage decays.

What precision actually looks like in practice

A mature SAST deployment — one that's been through a tuning cycle — looks different from the default out-of-the-box configuration. Rule counts are typically lower: not because rules have been disabled, but because generic rules have been replaced with framework-specific rules that are more precise. Suppression files exist but are audited quarterly. False positive rates sit in the 8-15% range across the scanned language set, with higher rates for dynamic languages (Python, Ruby, JavaScript) and lower for statically typed ones (Go, Java, Rust).

Importantly, recall hasn't been sacrificed to achieve this. The real findings that matter — injection vulnerabilities, insecure deserialization, hardcoded credentials, path traversal — still fire. The difference is that engineers see a smaller, higher-confidence set of findings per PR, which means the findings that do appear receive genuine attention rather than reflexive dismissal.

The precision/recall tradeoff in SAST isn't fixed. It's a function of how well the rule set models your actual codebase. The default configuration is calibrated for the average codebase in its language, not yours. Tuning it to your codebase is engineering work, but it's bounded work — a realistic tuning cycle runs two to four weeks for a medium-complexity codebase — and the compound return is a tool that engineers actually use.