Code Review January 29, 2025

Three Code Review Automation Pitfalls (and How to Avoid Them)

Automation should reduce cognitive load on reviewers. When it adds to it instead, you've hit one of three common implementation mistakes.

The promise and the failure mode

Adding automated checks to code review is supposed to reduce the cognitive load on human reviewers — offload the mechanical work (did anyone forget to sanitize this input? does this dependency have a known CVE?) so that the human's attention goes toward the judgment-requiring work (is this the right abstraction? does this error handling actually cover the failure modes?).

When automation achieves that, the review process gets faster and more thorough simultaneously. When it fails — and it fails in predictable, recurring ways — it adds cognitive load instead of removing it. Reviewers develop reflexes for ignoring automated comments. Security findings get dismissed unread. The tool stays in the pipeline but functions as noise generation rather than signal delivery.

Three failure modes account for the majority of cases we've observed where code review automation made the review process worse rather than better. None of them require unusual circumstances to trigger; they appear in standard configurations of standard tools. Understanding them is the prerequisite to avoiding them.

Pitfall one: comment volume without triage

A SAST scanner configured to run on every push and post its full output as PR comments will, on a modestly complex codebase, generate 20-50 comments on the first scan. Some of those are real findings. Some are false positives from rules that don't model the codebase's sanitization patterns correctly. Some are informational notices that don't require any action. And they all look the same in the PR comment thread.

The reviewer's experience is a wall of automated comments that they have to read, classify, and respond to before they can get to the human reviewer's comments. In practice, they don't do that. They scroll past them. They approve the PR anyway. And the automated tool achieves exactly the opposite of its purpose: by trying to show everything, it communicates nothing.

The design fix here is severity-gated posting. Not every finding belongs in the PR comment thread. HIGH severity findings — confirmed taint paths to injection sinks, hardcoded credentials (which map to CWE-798), insecure deserialization — deserve a blocking PR comment. MEDIUM findings warrant a non-blocking annotation in the diff view. LOW and informational findings belong in a dashboard that engineers can consult, not in the review thread where they compete for attention with the actual code review.

The threshold for what constitutes a blocking comment should be calibrated to the false positive rate of those rules. A HIGH severity finding with a 30% false positive rate shouldn't block a merge — it will train reviewers to dismiss HIGH findings. A HIGH severity finding with a 5% false positive rate probably should. The severity label and the merge-blocking behavior are separate configuration decisions that need to be made together.

Pitfall two: findings without context

A PR comment that says GRCD-0089: Potential SQL injection at line 47 is structurally correct and operationally useless. The engineer reading it needs to know: what's the taint source? What's the sink? Is there a sanitizer in the path that the tool missed? What's the fix?

Without that context, the engineer has two options: spend 10 minutes investigating the finding themselves, or mark it as "won't fix" and move on. Most choose the second option, not because they're careless, but because a PR review is already a time-constrained task and an unexplained warning isn't a specific enough signal to justify investigation time.

Effective automated security comments include the dataflow trace: source → path → sink. They show which variable is tainted, where it enters the function, and what function call it reaches untransformed. They provide a concrete remediation suggestion — "use parameterized queries" with a code example — not a generic "sanitize your inputs" instruction. And they link to the rule definition so an engineer who wants to understand the detection logic can read it.

This is a design requirement for the tool, not just good practice. A review comment that provides a dataflow trace gets acted on at a meaningfully higher rate than one that provides just a rule ID and severity. The additional context doesn't increase the comment volume — it's still one comment per finding — but it changes the action rate because the finding is now understandable without external research.

Pitfall three: automation that replaces rather than augments human judgment

The third failure mode is subtler and in some ways more damaging than the first two. It happens when the presence of automated security checks creates an implicit expectation that automation handles security, and human reviewers stop applying security judgment of their own.

Consider a team that introduces SAST and properly configures it to block on HIGH findings. Within a few months, reviewers' mental model of "this PR has no security issues" becomes "this PR has no HIGH SAST findings." These are not the same thing. SAST has bounded recall — it can't find authorization logic flaws, business logic vulnerabilities, or design-level issues like storing session tokens in localStorage (a CWE-922 concern) because those require understanding what the application is supposed to do, not just what the code literally does.

We're not saying automation should be held back to preserve human engagement. We're saying that the framing of automation matters. A SAST tool should be introduced as "this catches the class of issues that follow patterns we can describe statically — injection, path traversal, insecure deserialization" rather than "this handles security review." The human reviewer's job is the complement: architectural review, authorization modeling, threat modeling against the feature's actual attack surface.

Practically, this means the review checklist — if the team has one — should still include security items that automation doesn't cover. Authentication boundary changes, privilege escalation paths, new external data inputs, changes to cryptographic operations. These are engineer judgment items. The automated tool tells you about the code patterns; the engineer tells you about the design.

A design pattern that avoids all three

The common thread in all three pitfalls is configuration that treats the tool as a broadcast mechanism rather than a triage mechanism. The tool has findings to report; it reports all of them; the reviewer is left to sort them out. This is the wrong inversion of the human/automation boundary.

A better design: the tool has findings; the tool triages them; only the findings that meet the "worth the reviewer's attention right now" threshold surface in the review interface. Everything else is available in a dashboard for audit and trend analysis, but it doesn't appear in the PR thread.

For SAST specifically, this means: post inline PR comments only for HIGH severity findings with a false positive rate below 15% (measured against your codebase, not the tool vendor's benchmark). All other findings export to SARIF and appear in the security dashboard. Non-blocking annotations appear in the diff view for MEDIUM severity. Engineers can pull up the full finding list from the dashboard if they want to triage more comprehensively — but the default PR review experience is uncluttered.

This approach requires more upfront configuration than the default out-of-the-box setup. It requires knowing your false positive rate per rule category, which requires running the tool for a few weeks and tracking dismissal behavior. That's real work. But the alternative — deploying the tool in its default configuration and watching engineers learn to ignore it — means doing all that work plus the cost of rebuilding trust in a tool that spent weeks generating noise. Starting quiet and becoming more verbose as confidence in rule quality increases is operationally easier than starting loud and trying to dial it back.

On the GitHub Checks API and review integration

One technical detail worth flagging: how findings surface in the PR interface depends substantially on whether the tool uses the GitHub Checks API (or equivalent for GitLab MR comments) to post findings as diff annotations rather than as standalone PR comments.

Diff annotations — which appear inline in the diff view at the exact changed line — are less disruptive than top-level PR comments because they're scoped to the changed code and collapsed by default. A reviewer can expand them selectively. Top-level PR comments, by contrast, appear at the same level as human reviewer comments and create the wall-of-text problem directly.

Tools that post to the Checks API and use the annotations array in the check run result can place findings inline without cluttering the conversation thread. This is a concrete usability difference that affects whether engineers read findings or skip them. It's worth verifying in the tool evaluation process, not just assuming from the marketing description.