Benchmark Results
We benchmarked 10 models on real repositories to understand which ones actually find real issues vs. generate noise. Total spend: $2.13.
Methodology
- Repos: python-dotenv (34 files, ~52K tokens) and noxaudit (88 files, ~126K tokens)
- Focus: All 7 areas (security, docs, patterns, testing, hygiene, dependencies, performance)
- Method: Batch API on all providers (50% discount), 1 run per model per repo
- Quality validation: Cross-model consensus — issues found by 4+ models are considered "real"
Scorecard
| Model | dotenv | noxaudit | Total | Cost | $/finding |
|---|
| gpt-5-nano | 4 | 6 | 10 | $0.01 | $0.001 |
| gpt-5-mini | 15 | 24 | 39 | $0.03 | $0.001 |
| gemini-2.5-flash | 18 | 16 | 34 | $0.07 | $0.002 |
| gemini-3-flash-preview | 8 | 10 | 18 | $0.10 | $0.005 |
| claude-haiku-4-5 | 24 | 15 | 39 | $0.11 | $0.003 |
| o4-mini | 8 | 6 | 14 | $0.20 | $0.014 |
| gpt-5.4 | 32 | 52 | 84 | $0.26 | $0.003 |
| gemini-2.5-pro | 17 | 21 | 38 | $0.33 | $0.009 |
| claude-sonnet-4-6 | 30 | 48 | 78 | $0.38 | $0.005 |
| claude-opus-4-6 | 40 | 51 | 91 | $0.65 | $0.007 |
Quality Validation
python-dotenv served as a canary — 6 confirmed real issues found by 4+ models out of 10.
| Issue | Models | Verdict |
|---|
| get_cli_string shell injection risk | 8/10 | Genuine security concern |
| test_list uses builtin format instead of output_format | 6/10 | Actual code bug |
| Duplicate files (README/CHANGELOG/CONTRIBUTING in docs/) | 6/10 | Maintenance burden |
| Broken mkdocs link (empty href) | 5/10 | Broken documentation |
| Unpinned dev dependencies | 5/10 | Reproducibility issue |
| Incorrect pre-commit command (precommit vs pre-commit) | 4/10 | Wrong package name |
Per-Model Quality
| Model | Consensus | Noise | Cost | Verdict |
|---|
| claude-sonnet-4-6 | 6/6 | Low | $0.38 | Best precision |
| gpt-5.4 | 5/6 | Low | $0.26 | Best mid-tier |
| gpt-5-mini | 5/6 | Low | $0.03 | Best daily value |
| claude-opus-4-6 | 6/6 | Moderate | $0.65 | Most findings overall |
| claude-haiku-4-5 | 4/6 | Moderate | $0.11 | Decent but pads with nits |
| gemini-2.5-pro | 3/6 | Low | $0.33 | Poor value vs gpt-5.4 |
| o4-mini | 3/6 | Moderate | $0.20 | Reasoning tokens wasted |
| gemini-2.5-flash | 2/6 | Moderate | $0.07 | Misses too much |
| gemini-3-flash-preview | 2/6 | Low | $0.10 | Preview quality |
| gpt-5-nano | 2/6 | Low | $0.01 | Too shallow |
Recommended Tiers
| Tier | Model | Cost/Run | Rationale |
|---|
| Daily | gpt-5-mini | $0.03 | 5/6 consensus issues, minimal noise, cheapest viable model |
| Deep dive | gpt-5.4 | $0.26 | 84 findings total, beats Sonnet quality at 68% the cost |
| Premium | claude-opus-4-6 | $0.65 | Most findings overall, best for maximum depth |
Dropped Models
- o3: 0 findings on python-dotenv, 7 on noxaudit at $0.33. Reasoning tokens wasted on non-reasoning task. Removed.
- gemini-2.0-flash: Deprecated. Returns errors in batch API.
All costs include 50% batch API discount. Different models genuinely find different things — only 6 issues had cross-model consensus.