Benchmark Results

We benchmarked 10 models on real repositories to understand which ones actually find real issues vs. generate noise. Total spend: $2.13.

Methodology

Repos: python-dotenv (34 files, ~52K tokens) and noxaudit (88 files, ~126K tokens)
Focus: All 7 areas (security, docs, patterns, testing, hygiene, dependencies, performance)
Method: Batch API on all providers (50% discount), 1 run per model per repo
Quality validation: Cross-model consensus — issues found by 4+ models are considered "real"

Model	dotenv	noxaudit	Total	Cost	$/finding
gpt-5-nano	4	6	10	$0.01	$0.001
gpt-5-mini	15	24	39	$0.03	$0.001
gemini-2.5-flash	18	16	34	$0.07	$0.002
gemini-3-flash-preview	8	10	18	$0.10	$0.005
claude-haiku-4-5	24	15	39	$0.11	$0.003
o4-mini	8	6	14	$0.20	$0.014
gpt-5.4	32	52	84	$0.26	$0.003
gemini-2.5-pro	17	21	38	$0.33	$0.009
claude-sonnet-4-6	30	48	78	$0.38	$0.005
claude-opus-4-6	40	51	91	$0.65	$0.007

python-dotenv served as a canary — 6 confirmed real issues found by 4+ models out of 10.

Issue	Models	Verdict
get_cli_string shell injection risk	8/10	Genuine security concern
test_list uses builtin format instead of output_format	6/10	Actual code bug
Duplicate files (README/CHANGELOG/CONTRIBUTING in docs/)	6/10	Maintenance burden
Broken mkdocs link (empty href)	5/10	Broken documentation
Unpinned dev dependencies	5/10	Reproducibility issue
Incorrect pre-commit command (precommit vs pre-commit)	4/10	Wrong package name

Model	Consensus	Noise	Cost	Verdict
claude-sonnet-4-6	6/6	Low	$0.38	Best precision
gpt-5.4	5/6	Low	$0.26	Best mid-tier
gpt-5-mini	5/6	Low	$0.03	Best daily value
claude-opus-4-6	6/6	Moderate	$0.65	Most findings overall
claude-haiku-4-5	4/6	Moderate	$0.11	Decent but pads with nits
gemini-2.5-pro	3/6	Low	$0.33	Poor value vs gpt-5.4
o4-mini	3/6	Moderate	$0.20	Reasoning tokens wasted
gemini-2.5-flash	2/6	Moderate	$0.07	Misses too much
gemini-3-flash-preview	2/6	Low	$0.10	Preview quality
gpt-5-nano	2/6	Low	$0.01	Too shallow

Tier	Model	Cost/Run	Rationale
Daily	gpt-5-mini	$0.03	5/6 consensus issues, minimal noise, cheapest viable model
Deep dive	gpt-5.4	$0.26	84 findings total, beats Sonnet quality at 68% the cost
Premium	claude-opus-4-6	$0.65	Most findings overall, best for maximum depth

o3: 0 findings on python-dotenv, 7 on noxaudit at $0.33. Reasoning tokens wasted on non-reasoning task. Removed.
gemini-2.0-flash: Deprecated. Returns errors in batch API.

All costs include 50% batch API discount. Different models genuinely find different things — only 6 issues had cross-model consensus.