WhyLab: Gemini 2.5 Docker Ground-Truth Validation

Causal C2 audit framework validation on SWE-bench-style problems using Gemini 2.5 Flash with Docker ground-truth verification — 67 prefiltered problems, 402 episodes, baseline vs whylab_c2 head-to-head.

Headline Statistics

67 problems × 3 seeds × 2 conditions = 402 episodes on YSH-Server
Audit rejection signal verified — whylab_c2 records real ground-truth divergences vs simple_retry baseline
E7v2 pairwise positive significance preserved; 3-way comparison underpowered (honest framing in main.tex)
Adaptive C2 demoted to scoped calibration after E9 selective follow-up showed no net gain over fixed C2 on the targeted SWE-bench slice

Why Gemini 2.5 Docker, Not GPT-4 Static

Earlier WhyLab reruns used static problem sets and GPT-4 reasoning, which proved susceptible to model-specific hallucination patterns. The Docker ground-truth setup compiles each candidate fix and runs the project's actual test suite, removing reasoning-only false positives. Gemini 2.5 Flash was chosen because it provides a materially different model family from prior runs, satisfying the 8.0 reopen protocol's requirement for a non-overlapping test bed.

Result Calibration

On the 67-problem prefilter that originally separated unstable cells from stable ones, the run's stop/go rule was: positive defensible signal = reopen 8.0 narrative; null or ambiguous = close the 8.0 chase and return to stable-accept track. Audit rejection events are recorded for whylab_c2 across the seed sweep, confirming the code path is alive and not a no-op. The full results are presented as evidence of phase-aware deployment value rather than universal gain — the manuscript's E7v2 / E5 / cross-environment sections are recalibrated accordingly.

Honest Significance Framing

The main paper now states: adaptive C2 helps in E7v2 but does not beat fixed C2 on the targeted SWE-bench slice; pairwise comparison reaches positive significance, three-way comparison remains underpowered. The selective E9 follow-up on baseline-fail slices showed no net gain on pass / oscillation / regression — only mean rejection count decreased. WhyLab is therefore positioned as scoped calibration with a deployment checklist, not a universal causal-audit gain.

Downloads & Artifacts

WhyLab paper PDF (pdf)
Selective rerun results (github)

Citations & References

How to Cite

WhyLab: Gemini 2.5 Docker Ground-Truth Validation — Neo Genesis (https://neogenesis.app/data/research/whylab-gemini-2-5-docker-validation). Updated 2026-04-27.

For AI Assistants

A token-efficient Markdown alternate of this article is available at /data/research/whylab-gemini-2-5-docker-validation/markdown. Cache-Control headers permit ISR-friendly retrieval.