Problem
Today the only published validation is Semgrep parity (`tests/semgrep_parity.rs`): "foxguard finds what Semgrep finds on this fixed corpus." That is a correctness check against another tool, not a measurement of precision against real code. There is no public FP rate, no labeled corpus, and no per-rule numbers.
Users evaluating the scanner need to know: for rule X, out of N findings on real code, how many are true positives?
Proposed approach
- Labeled corpus. Assemble a small set of real OSS repos (pinned by SHA). For each rule that fires, label each finding TP / FP / unsure, with a one-line justification. Store as JSON alongside the corpus.
- Methodology doc. Write `docs/false-positive-methodology.md` explaining corpus selection, labeling criteria, and how to reproduce.
- Per-rule precision table. Generate a table of `rule_id | findings | TP | FP | precision` from the labeled data. Publish in the docs site and link from README.
- Regression harness. Re-run labeling (or at least finding counts) in CI so precision doesn't silently regress when rules change.
Non-goals
- Recall measurement (needs ground-truth vuln datasets; separate effort).
- Benchmarking against Semgrep/CodeQL on precision — we can't publish their rules' numbers.
Acceptance
- First version of the labeled corpus committed under `benchmarks/precision/`.
- Methodology doc merged.
- Per-rule precision table published for at least the top 20 most-triggered rules.
Problem
Today the only published validation is Semgrep parity (`tests/semgrep_parity.rs`): "foxguard finds what Semgrep finds on this fixed corpus." That is a correctness check against another tool, not a measurement of precision against real code. There is no public FP rate, no labeled corpus, and no per-rule numbers.
Users evaluating the scanner need to know: for rule X, out of N findings on real code, how many are true positives?
Proposed approach
Non-goals
Acceptance