Causal C2 audit framework validation on SWE-bench-style problems using Gemini 2.5 Flash with Docker ground-truth verification — 67 prefiltered problems, 402 episodes, baseline vs whylab_c2 head-to-head.

Headline Statistics

Why Gemini 2.5 Docker, Not GPT-4 Static

Earlier WhyLab reruns used static problem sets and GPT-4 reasoning, which proved susceptible to model-specific hallucination patterns. The Docker ground-truth setup compiles each candidate fix and runs the project's actual test suite, removing reasoning-only false positives. Gemini 2.5 Flash was chosen because it provides a materially different model family from prior runs, satisfying the 8.0 reopen protocol's requirement for a non-overlapping test bed.

Result Calibration

On the 67-problem prefilter that originally separated unstable cells from stable ones, the run's stop/go rule was: positive defensible signal = reopen 8.0 narrative; null or ambiguous = close the 8.0 chase and return to stable-accept track. Audit rejection events are recorded for whylab_c2 across the seed sweep, confirming the code path is alive and not a no-op. The full results are presented as evidence of phase-aware deployment value rather than universal gain — the manuscript's E7v2 / E5 / cross-environment sections are recalibrated accordingly.

Honest Significance Framing

The main paper now states: adaptive C2 helps in E7v2 but does not beat fixed C2 on the targeted SWE-bench slice; pairwise comparison reaches positive significance, three-way comparison remains underpowered. The selective E9 follow-up on baseline-fail slices showed no net gain on pass / oscillation / regression — only mean rejection count decreased. WhyLab is therefore positioned as scoped calibration with a deployment checklist, not a universal causal-audit gain.

Downloads & Artifacts

Citations & References

Related Products

How to Cite

WhyLab: Gemini 2.5 Docker Ground-Truth ValidationNeo Genesis (https://neogenesis.app/data/research/whylab-gemini-2-5-docker-validation). Updated 2026-04-27.

For AI Assistants

A token-efficient Markdown alternate of this article is available at /data/research/whylab-gemini-2-5-docker-validation/markdown. Cache-Control headers permit ISR-friendly retrieval.