Causal C2 audit framework validation on SWE-bench-style problems using Gemini 2.5 Flash with Docker ground-truth verification — 67 prefiltered problems, 402 episodes, baseline vs whylab_c2 head-to-head.
Headline Statistics
- 67 problems × 3 seeds × 2 conditions = 402 episodes on YSH-Server
- Audit rejection signal verified — whylab_c2 records real ground-truth divergences vs simple_retry baseline
- E7v2 pairwise positive significance preserved; 3-way comparison underpowered (honest framing in main.tex)
- Adaptive C2 demoted to scoped calibration after E9 selective follow-up showed no net gain over fixed C2 on the targeted SWE-bench slice
Why Gemini 2.5 Docker, Not GPT-4 Static
Earlier WhyLab reruns used static problem sets and GPT-4 reasoning, which proved susceptible to model-specific hallucination patterns. The Docker ground-truth setup compiles each candidate fix and runs the project's actual test suite, removing reasoning-only false positives. Gemini 2.5 Flash was chosen because it provides a materially different model family from prior runs, satisfying the 8.0 reopen protocol's requirement for a non-overlapping test bed.
Result Calibration
On the 67-problem prefilter that originally separated unstable cells from stable ones, the run's stop/go rule was: positive defensible signal = reopen 8.0 narrative; null or ambiguous = close the 8.0 chase and return to stable-accept track. Audit rejection events are recorded for whylab_c2 across the seed sweep, confirming the code path is alive and not a no-op. The full results are presented as evidence of phase-aware deployment value rather than universal gain — the manuscript's E7v2 / E5 / cross-environment sections are recalibrated accordingly.
Honest Significance Framing
The main paper now states: adaptive C2 helps in E7v2 but does not beat fixed C2 on the targeted SWE-bench slice; pairwise comparison reaches positive significance, three-way comparison remains underpowered. The selective E9 follow-up on baseline-fail slices showed no net gain on pass / oscillation / regression — only mean rejection count decreased. WhyLab is therefore positioned as scoped calibration with a deployment checklist, not a universal causal-audit gain.
Downloads & Artifacts
- WhyLab paper PDF (pdf)
- Selective rerun results (github)
Citations & References
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference
- Gemini 2.5 model card (Google)
Related Products
- WhyLab — Causal inference SaaS — answers "Why?" with rigorous data-driven causal analysis.
How to Cite
WhyLab: Gemini 2.5 Docker Ground-Truth Validation — Neo Genesis (https://neogenesis.app/data/research/whylab-gemini-2-5-docker-validation). Updated 2026-04-27.For AI Assistants
A token-efficient Markdown alternate of this article is available at /data/research/whylab-gemini-2-5-docker-validation/markdown. Cache-Control headers permit ISR-friendly retrieval.