AppSec November 11, 2024

AppSec Metrics That Engineering Teams Actually Track

Not every security metric tells you something useful. The ones that survive in practice tend to connect directly to review behavior, not to finding counts.

The metric proliferation problem

Security programs generate an enormous amount of data. Finding counts, severity distributions, time-to-remediation by severity tier, MTTR by team, false positive rates by rule, scanning coverage by repository, open vulnerability counts by CWE category. Most of this data is available from the tools without any instrumentation effort — it comes out of SARIF exports, dashboard APIs, ticket integrations.

The problem isn't data scarcity. It's that most of these metrics are vanity when extracted from context. A declining open HIGH finding count could mean the security program is working, or it could mean engineers are suppressing findings more aggressively to hit a metric target. A rising false positive rate could signal poor rule tuning or could signal that new code is exploring novel patterns that existing rules flag too conservatively. The number doesn't tell you which.

This article isn't about measuring everything available. It's about the small set of metrics that actually correlate with whether the security program is working — the ones that engineering teams with functioning AppSec programs tend to look at in practice, versus the ones that appear in compliance reports but don't drive behavior.

Mean time to detection (MTTD) at the PR stage

MTTD is typically framed as the time between a vulnerability being introduced and being detected. In a SAST-integrated PR workflow, the relevant question is tighter: how long after a vulnerable commit is pushed does a finding appear in the review interface?

For PR-stage SAST, MTTD should be measured in minutes — the time from push to finding appearance in the PR. If that number is above 10-15 minutes, engineers are waiting for scan results and the review process is blocked. If it's under 5 minutes, findings arrive before reviewers have typically started reading the code, which keeps the feedback loop tight.

MTTD at the PR stage is worth tracking because it degrades subtly. As a codebase grows, scan times increase unless incremental scanning (scanning only changed files rather than the full repository) is configured correctly. A scan that took 3 minutes on a 50k LOC codebase may take 18 minutes on a 400k LOC codebase if the incremental scan configuration hasn't been revisited. Watching MTTD trend over time catches this degradation before it materially impacts review behavior.

False positive rate per rule category

The aggregate false positive rate — total dismissed findings divided by total findings — is a useful headline metric, but it masks important signal. False positive rates vary significantly by rule category and by language. A taint analysis rule for CWE-89 in Go might have a 4% false positive rate. The same class of rule for CWE-79 in JavaScript might run at 22% because dynamic typing forces the engine to be more conservative about what's actually reachable.

Tracking false positive rate by rule category (and ideally by language) tells you where tuning effort has the highest return. A category with a 25% false positive rate and 40 findings per month is generating 10 false findings per month in the PR review stream — that's 10 reviewer-minutes of wasted triage time per month minimum, and the compounding cost in reviewer trust is harder to quantify but more significant.

In pilot deployments, teams that track this metric at category granularity identify their top-three highest-false-positive rule categories within the first 60 days of operation. Tuning those three categories — typically by registering internal sanitizers, narrowing pattern scope to specific framework APIs, or splitting a broad generic rule into framework-specific variants — produces disproportionate precision improvements relative to the total rule count touched.

Fix rate at review time vs. post-merge

This metric is the most direct indicator of whether shift-left security is actually functioning. A finding that appears in a PR comment and gets fixed before merge has a near-zero remediation cost relative to engineering effort — the engineer is already working in that code, the fix is in context. A finding that merges unaddressed and gets remediated two weeks later when the security team reports it requires context switching back to code that's been shipped, a hotfix process, a re-deploy, and potentially a disclosure if it was a HIGH severity vulnerability.

Fix rate at review time is defined as: (findings fixed in the PR that introduced them) / (total actionable findings in that timeframe). "Actionable" meaning non-suppressed, non-false-positive findings. The target for a well-functioning program is above 70%. Below 50% suggests that findings are not being addressed during review — either because they're not visible enough, because the review culture doesn't treat security findings as blocking, or because the false positive rate is high enough that engineers have stopped trusting the findings.

This metric requires integration between your SAST tool and your version control system to track which findings were in an open PR versus merged. SARIF exports include the commit SHA associated with each finding, which allows post-processing to classify findings by lifecycle stage. Some tools expose this directly in their API; others require constructing it from SARIF data and git history.

Suppression growth rate

A suppression file that grows without bound is a coverage erosion signal. Suppressions are a legitimate tool — some findings are genuinely not exploitable in a specific deployment context and recording that decision auditably is the right approach. But a suppression file that adds 15-20 new entries per month without corresponding rule tuning is more likely a symptom of unaddressed false positives being routed through suppression rather than through rule improvement.

Suppression growth rate should be tracked alongside the false positive rate. If the false positive rate is stable and the suppression file is growing, that means real findings are being suppressed — review the suppression justification strings to understand whether the decisions are sound. If both are growing together, the suppression growth is likely absorbing false positives that should be addressed through rule tuning.

We're not suggesting suppression is bad. We're saying suppression growth rate without accompanying rule tuning is a warning sign that the tuning work isn't keeping pace with the codebase's evolution.

OWASP category coverage depth vs. breadth

A common metric in AppSec dashboards is OWASP Top 10 coverage: what percentage of the top 10 categories does the tool have rules for? This metric is nearly useless for operational decision-making because it's a breadth measure, not a depth measure.

A more useful metric is coverage depth per category: for CWE-89 (SQL injection), how many of the language-framework combinations active in your codebase have specific rules? Python/Django? Python/SQLAlchemy raw? Python/Psycopg2 direct cursor? JavaScript/pg? JavaScript/mysql2? If three of those five patterns have rules and two don't, your CWE-89 coverage is 60% in practice, not 100%, regardless of what the "OWASP Top 10 covered" checkbox shows.

This metric is harder to generate automatically because it requires mapping your active language-framework combinations to the rule set. But it's the only framing that tells you where actual gaps exist. A growing SaaS backend that added a new data service in Go/pgx will have zero coverage for SQLi in that service until a rule is added — even if the dashboard shows "CWE-89: covered."

What not to track: finding counts as a headline metric

Finding counts — open HIGH findings, total findings resolved this month, findings per 1,000 lines of code — are the metrics most commonly requested in compliance contexts and the least useful for operational decision-making. They measure scanner output, not security outcomes.

A team that disables three noisy rules will show a dramatic drop in finding counts with no improvement in security. A team that adds a new service with 20,000 lines of Python will show a spike in finding counts with no degradation in security. The count moves for reasons unrelated to whether the security program is functioning.

If leadership or compliance requires headline metrics, time-to-fix by severity tier (derived from ticket creation to resolution timestamps) is a more defensible proxy than raw finding counts. It at least reflects the remediation cadence, which is a behavior rather than a tool configuration artifact. But the operational metrics — MTTD, fix-at-review-time rate, false positive rate per category — are the ones that drive week-to-week decisions about where to invest tuning and rule development effort.