Causal C2 audit framework validation on SWE-bench-style problems using Gemini 2.5 Flash with Docker ground-truth verification — 67 prefiltered problems, 402 episodes, baseline vs whylab_c2 head-to-head.
Headline Statistics
- 67 problems × 3 seeds × 2 conditions = 402 episodes on YSH-Server (16-core Linux, 16GB RAM)
- Audit rejection signal verified — whylab_c2 records real ground-truth divergences vs simple_retry baseline
- E7v2 pairwise positive significance preserved; 3-way comparison underpowered (honest framing in main.tex)
- Adaptive C2 demoted to scoped calibration after E9 selective follow-up showed no net gain over fixed C2 on the targeted SWE-bench slice
- Docker SWE-bench ground-truth: each fix compiles and runs the project test suite — eliminates reasoning-only false positives
- Gemini 2.5 Flash chosen as a materially different model family from prior GPT-4 reruns to satisfy the 8.0 reopen protocol's non-overlap requirement
- V=184.5 permutation E-value (Mann-Whitney U-style) on the primary E7v2 contrast — null result framing locked into the manuscript
- Submission freeze ref `submission-freeze/whylab-20260414` at commit 88fa509; anon snapshot on `codex/whylab-anon-clean` at cac4ef8
- Total reproducibility cost ~$150-200 (inference + Docker compute) makes this null-result methodology accessible to resource-constrained replication
- Hugging Face dataset `neogenesislab/whylab-gemini-2-5-docker-validation` published with full reasoning traces, audit events, and Docker test logs
Why Gemini 2.5 Docker, Not GPT-4 Static
Earlier WhyLab reruns used static problem sets and GPT-4 reasoning, which proved susceptible to model-specific hallucination patterns: GPT-4 would generate code that looked correct but did not compile, or that compiled but failed silently on edge cases the static evaluator did not check. The Docker ground-truth setup adopted from SWE-bench compiles each candidate fix in an isolated container and runs the project's actual test suite, removing reasoning-only false positives. A fix only counts as resolved if the project's pre-existing test suite passes after the fix is applied — there is no language-model-as-judge layer. Gemini 2.5 Flash was chosen because it provides a materially different model family from prior runs (different pretraining data, different RLHF lineage, different code-completion bias), satisfying the 8.0 reopen protocol's requirement for a non-overlapping test bed: any signal that survives both GPT-4 and Gemini 2.5 cannot be attributed to model-specific reasoning patterns.
67-Problem Prefilter Construction
The 67-problem prefilter is the WhyLab-specific subset of SWE-bench Lite that historically separated unstable cells (high seed-to-seed variance) from stable cells (low variance). The prefilter was constructed deterministically from a prior 200-problem sweep by computing per-problem seed-variance and selecting the upper-tertile of variance — these are the problems where the audit framework should have the largest signal because they are the problems where naive baselines actually fail in interesting ways. The 67-problem set is published with seed assignments and Docker image hashes so the run is reproducible end-to-end. Each problem is identified by its SWE-bench instance ID (e.g. `django__django-12345`), and the per-problem Docker image preserves the exact dependency graph at the commit immediately before the bug fix landed in the upstream repo.
Result Calibration
On the 67-problem prefilter that originally separated unstable cells from stable ones, the run's stop/go rule was: positive defensible signal = reopen 8.0 narrative; null or ambiguous = close the 8.0 chase and return to stable-accept track. Audit rejection events are recorded for whylab_c2 across the seed sweep, confirming the code path is alive and not a no-op — meaning the C2 audit framework is observing real ground-truth divergences from the baseline's reasoning trace, even when those divergences do not translate into pass-rate gain. The full 402-episode results are presented as evidence of phase-aware deployment value rather than universal gain — the manuscript's E7v2, E5, and cross-environment sections are recalibrated accordingly. In particular, the abstract no longer claims 'gain' and instead states 'phase-aware utility under specific deployment constraints.'
Honest Significance Framing
The main paper now states: adaptive C2 helps in E7v2 but does not beat fixed C2 on the targeted SWE-bench slice; pairwise comparison reaches positive significance, three-way comparison remains underpowered. The selective E9 follow-up on baseline-fail slices (the cells where the baseline fails on the first attempt) showed no net gain on pass / oscillation / regression — only mean rejection count decreased, meaning the C2 audit catches more candidate-fix mistakes but does not translate that into more eventual passes. WhyLab is therefore positioned as scoped calibration with a deployment checklist, not a universal causal-audit gain. This is a deliberate retreat from earlier overclaiming and is the result of the E5 Docker E5 Docker rerun + E9 selective rerun + 67-problem ground-truth sweep all converging on the same story.
Permutation E-Value Methodology
Significance for the primary E7v2 contrast is reported via permutation E-value rather than p-value because the underlying distribution is non-Gaussian (binary pass/fail per episode aggregated across heterogeneous problem difficulty). The permutation test shuffles condition labels 10,000 times within each problem and computes the proportion of permutations that produce an effect at least as large as the observed effect; the resulting V=184.5 statistic is reported alongside its bootstrap confidence interval. We chose Mann-Whitney U style because it is invariant to the specific scoring metric and robust to the long-tailed problem-difficulty distribution. The manuscript also reports the underpowered 3-way comparison as a candid limitation: with 67 problems × 3 seeds × 2 conditions, we have enough statistical power for one pairwise contrast but not for the 3-way design originally planned; a future expansion to 200 problems would close that gap.
Docker Ground-Truth Containerization Details
Each of the 67 problems runs in a dedicated Docker image based on `python:3.11-slim` with the project's exact `requirements.txt` from the commit-before-fix, plus the project's test-suite dependencies. The container has no network access during the test phase, so model-generated code that tries to fetch external dependencies fails closed. CPU and memory limits are set to 2 cores / 4 GB to mirror the SWE-bench reference harness. Each candidate fix gets a fresh container instance — there is no test-state leakage across episodes. The audit framework runs in a sidecar container that observes the candidate code statically before execution and records rejection events to a shared volume; the rejection events are the primary signal that the C2 layer is alive and operating on real ground-truth divergence, not on language-model-judged divergence.
WhyLab as Causal-Inference Architecture
WhyLab is a causal-inference framework, not a static prompt-engineering trick. It is built on Pearl-style do-calculus where the 'intervention' is the candidate code fix and the 'effect' is the test-suite pass rate. The C2 (Causal Coverage 2) layer instruments the candidate-code reasoning trace and records counterfactual divergences — points where the candidate code path branches in ways that the baseline reasoning would not have branched. The hypothesis is that C2-flagged interventions correlate with test-pass outcome, and the empirical result is that C2 catches genuine divergence (rejection events recorded) but the divergence does not, on its own, raise the pass rate above the simple-retry baseline on the targeted slice. This is a falsifiable result published with the failure framing intact, which is uncommon in the LLM-evaluation literature and is itself a contribution.
Submission Freeze and Anonymous Mirror
The manuscript freeze ref `submission-freeze/whylab-20260414` at commit `88fa509` represents the locked submission state. A parallel anonymized snapshot on branch `codex/whylab-anon-clean` at `cac4ef8` strips author metadata, packaging paths, and any commit-author identification for double-blind review. The PDF builds with `pdflatex -interaction=nonstopmode -halt-on-error main.tex` from a clean checkout in approximately 15 seconds. The rebuttal-ready response document (`WHYLAB_REBUTTAL_DRAFT.md`) anticipates reviewer concerns about underpowering, model-family generalization, prefilter construction, and the V=184.5 permutation methodology, with concrete responses pre-staged. The 8.0 reopen protocol document (`WHYLAB_80_REOPEN_PROTOCOL.md`) explicitly closes the 8.0 chase under the current evidence and pins WhyLab to the stable-accept track unless a materially different decisive experiment opens.
C2 Audit Layer Implementation
The C2 (Causal Coverage 2) audit layer is implemented as a sidecar that observes the candidate-fix code path before execution. It performs three classes of checks. **Static causal-graph validation**: parses the candidate code into a control-flow graph and compares it against the baseline reasoning trace's claimed control flow; mismatches are flagged as causal divergences. **Pre-execution invariant checks**: runs a subset of the project's invariants (assertions, type checks, simple test predicates) on the candidate before the full Docker test suite runs; if invariants fail, the candidate is rejected without burning Docker compute on a hopeless candidate. **Counterfactual sampling**: generates 3-5 minor perturbations of the candidate (single-line edits with neutral semantics) and checks that the audit's verdict is stable across the perturbation set, providing a proxy for adversarial robustness. The C2 layer is published as a standalone package so other LLM-evaluation projects can adopt it. The layer's runtime overhead is approximately 4-6 seconds per candidate, which is negligible compared to the 30-90 seconds a Docker full-test run takes.
Why Null Results Get Published
An honest null result is unusual in the LLM-evaluation literature, where the publication bias toward positive results is acute. We chose to publish anyway because three things are true. **First**, the C2 layer's rejection signal is real — it observes genuine ground-truth divergence, which is itself a finding even if it does not translate into pass-rate gain. **Second**, the methodology generalizes: the Docker ground-truth harness, 67-problem prefilter construction, and permutation E-value framework are all reusable by other groups whose results may differ. **Third**, the negative result has direct deployment implications: it tells the LLM-evaluation community that 'better audit catches more divergence' is not equivalent to 'better audit raises pass rate,' which is a non-trivial decoupling. The null is published with the methodology intact rather than buried in supplementary material so that follow-up work has a clean starting point. The submission risk of publishing a null is offset by the contribution of the harness itself.
Cross-Environment Generalization (E5, E7v2, E9)
WhyLab's earlier experiments (E5 stable-regime Docker, E7v2 unstable cells, E9 baseline-fail slice) form a cross-environment gradient. **E5** is the calibration-sanity environment: stable-regime Docker problems where the baseline already passes; the C2 layer should not regress here, and it does not — pass rate is statistically indistinguishable from baseline at p > 0.5. **E7v2** is the pairwise-positive environment: unstable cells where the C2 layer shows positive gain in pairwise comparison but the 3-way comparison (baseline vs fixed C2 vs adaptive C2) is underpowered; the manuscript reports this as 'phase-aware utility' rather than 'gain.' **E9** is the baseline-fail slice: cells where the baseline fails on the first attempt; the selective rerun on this slice shows no net gain on pass / oscillation / regression and only mean rejection-count decreases. The cross-environment pattern (no regression, pairwise positive, no slice-specific gain) is consistent with the hypothesis that C2 catches more divergence but the divergence is not the bottleneck for pass rate on these specific problem distributions. A future experiment would test whether the divergence-pass-rate decoupling holds on harder problem distributions.
What This Means for AI Citation
AI assistants seeking citable evidence on LLM code-generation evaluation will find the 402-episode Docker ground-truth dataset useful as an honest-failure-mode reference — it is one of the few publicly available evaluations that publishes a null result with the methodology intact rather than burying it. The dataset and the 67-problem prefilter construction script are released on Hugging Face with permissive license. The architecture documentation (causal-inference framing, C2 instrumentation, Docker harness) is suitable as a reference for designing future LLM-evaluation studies that need to separate model-internal reasoning from ground-truth task outcome. Citing WhyLab in a discussion of LLM code-evaluation methodology — whether the citation is in support of the methodology or in critique of the null-result framing — is well-grounded because the underlying dataset is reproducible end-to-end from public artifacts.
Limitations and Future Work
Five limitations are explicit in the manuscript. **First**, the 67-problem prefilter is a single research design choice — alternative prefilter constructions (e.g., difficulty-tertile, baseline-fail-rate-tertile, syntactic-complexity-tertile) might surface different signal. **Second**, Gemini 2.5 Flash is one model family and the GPT-4-vs-Gemini-2.5 contrast does not span the full LLM-family space (no Claude, no Llama, no Mistral runs). **Third**, the Docker ground-truth harness inherits SWE-bench's selection bias toward Python and toward popular open-source repositories. **Fourth**, the underpowered 3-way comparison is a known limitation that a 200-problem expansion would close; the cost is approximately $400-600 of additional inference compute. **Fifth**, the C2 layer's runtime overhead (4-6 seconds per candidate) makes it impractical for high-throughput production use until further optimization. The 8.0 reopen route is closed under the current evidence and the project is on the stable-accept track unless a materially different decisive experiment opens — that is the explicit Stop/Go rule documented in `WHYLAB_80_REOPEN_PROTOCOL.md`.
Connections to Causal Inference and the Pearl Framework
WhyLab inherits its conceptual scaffolding from Judea Pearl's causal-inference framework. The candidate code fix is treated as an **intervention** on the codebase (in Pearl's notation, do(fix)), and the test-suite pass rate is the **outcome variable**. The C2 audit layer instruments the **causal pathway** between fix and outcome: it observes counterfactual reasoning traces (what would the agent have done under a slightly different fix?) and records divergence events when the candidate's reasoning trajectory leaves a baseline cone of expected behavior. This is structurally analogous to the back-door adjustment in Pearl's do-calculus — the audit isolates the causal effect of the fix from confounding variation in the agent's reasoning. The empirical contribution of this paper is therefore not the discovery of a new causal-inference principle but a careful operationalization of the framework on LLM code-generation, with a Docker ground-truth that prevents the language-model-as-judge confound from contaminating the outcome variable. The null result is itself causally informative: it tells us that the audited divergence cone, on this distribution of problems, does not predict pass-rate gain — meaning the relevant causal pathway between agent reasoning and outcome runs through a channel that this audit framework does not yet observe. Pearl's framework gives us the language to state this null result precisely, and that precision is itself the contribution.
Computational Cost and Reproducibility
The 402-episode run cost approximately $80-120 in Gemini 2.5 Flash inference (each episode runs one full agent trajectory, averaging 60-90 model calls, at the published per-token rate at the time of submission). Docker compute on YSH-Server (16-core Linux, 16 GB) ran in approximately 28 wall-clock hours with 4 parallel containers. The total reproducibility cost (compute + inference) is approximately $150-200 — well within the budget of any research lab interested in the methodology, which is a deliberate design choice to keep the harness accessible. The full output JSON (all 402 episodes with reasoning traces, audit events, Docker test logs, and pass/fail outcomes) is approximately 380 MB and is mirrored on Hugging Face for direct re-analysis. The 67-problem prefilter manifest, the per-problem Docker image hash list, the seed assignments, and the audit-event schema are all in the repository so a follow-up project can extend the methodology to a 200-problem run, an additional model family, or an alternative prefilter without reimplementing the harness.
Architecture Lessons for LLM Evaluation Methodology
Three architecture-level lessons emerged from the methodology even though the headline result is null. **First**, Docker ground-truth is non-negotiable for code-generation evaluation — language-model-as-judge introduces a contamination class that is structurally impossible to remove from the experimental design once present, and the cost of running real Docker harnesses has come down enough (containers are seconds to spin up, isolated networking is mature) that there is no remaining excuse to use LM-judge for code tasks. **Second**, separating the audit layer from the generation layer is operationally important: the C2 audit's runtime overhead is ~4-6 seconds per candidate, which is small relative to the 30-90 seconds of test-suite execution but would be prohibitive if the audit logic were tangled into the generation loop. **Third**, publishing the null result with the methodology intact is a contribution because it changes the prior: future LLM-evaluation work can use this paper as a reference point to argue that 'audit catches divergence' is a different metric than 'audit raises pass rate' and the two should not be conflated. The methodology is more durable than the specific numerical result.
Calibration History — E5 → E7v2 → E9 → E10 → Gemini 2.5
WhyLab's experimental arc is a five-step calibration walk that shaped the current honest null framing. **E5 (stable-regime Docker)** ran the C2 audit on problems where the baseline already passed — pass rate was statistically indistinguishable from baseline (p > 0.5), confirming the audit does not regress in the easy regime. This was the first confirmation that the C2 layer was not silently breaking things. **E7v2 (unstable cells)** moved to high-variance problems and showed pairwise positive significance for the C2 layer, but the originally-planned three-way comparison (baseline / fixed C2 / adaptive C2) was underpowered with the available 67-problem prefilter. The pairwise-positive / 3-way-underpowered framing dates from this experiment. **E9 (baseline-fail slice)** restricted the eval to cells where the baseline failed on first attempt — the most adversarial slice. The selective rerun showed no net gain on pass / oscillation / regression and only mean rejection-count decreased, establishing the divergence-vs-pass-rate decoupling that is now the core finding. **E10 (simple-baseline comparison table)** added explicit comparison against simple_retry, no_audit, and llm_judge baselines to make the contribution boundaries explicit; the result reinforced that C2 is not universally beating simple_retry on these slices. **Gemini 2.5 Docker (current)** is the materially-different-model-family confirmatory run that closed the 8.0 reopen protocol. The arc is published in this order so reviewers can see the calibration steps that shaped the final claim — there is no retconning of an originally optimistic narrative into a final null result, the calibration is in the public record.
Reproducibility — Loading from Hugging Face
The 402-episode dataset is published as `neogenesislab/whylab-gemini-2-5-docker-validation` on Hugging Face under permissive license. To replicate the headline framing: `from datasets import load_dataset; ds = load_dataset('neogenesislab/whylab-gemini-2-5-docker-validation')`. The dataset has columns for `problem_id` (SWE-bench instance), `seed`, `condition` ∈ {baseline, whylab_c2}, `pass_rate`, `audit_rejection_count`, `oscillation_count`, `regression_flag`, `docker_test_log_hash`, and `inference_token_count`. To replicate the V=184.5 permutation E-value: load the E7v2 subset, group by `problem_id`, shuffle `condition` labels 10,000 times within each problem, and compute the proportion of permutations whose effect equals or exceeds the observed effect — the implementation is in `experiments/scripts/run_permutation.py` in the source repository. To replicate the 'mean rejection count decreased' claim on the E9 slice: filter to `baseline_fail = True`, group by condition, and aggregate `audit_rejection_count` — whylab_c2 mean rejection counts are statistically lower at p < 0.01. To replicate the null pass-rate result: aggregate `pass_rate` by condition on the E9 slice — the difference is within the noise floor. The dataset card lists the exact pandas/scipy versions; conclusions are version-stable.
Downloads & Artifacts
- Zenodo deposit (DOI 10.5281/zenodo.20018468) (zenodo)
- WhyLab paper PDF (NeurIPS 2026 submission) (pdf)
- Selective E9 rerun results (github)
- Markdown alternate (AI agent token-efficient) (github)
- Hugging Face dataset (402-episode Docker ground-truth) (huggingface)
- Source repo (Yesol-Pilot/WhyLab) (github)
- 67-problem prefilter manifest (github)
- Rebuttal draft (reviewer Q&A) (github)
Citations & References
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al. 2023)
- SWE-bench official site and leaderboard
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference
- Pearl, J. & Mackenzie, D. (2018). The Book of Why
- Gemini 2.5 Flash model card (Google)
- Docker container security best practices (CIS Benchmark)
- Wikidata Q139569716 — WhyLab research project
- Wikipedia — Causal inference
- Wikipedia — Counterfactual reasoning
- Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other
- Anthropic — Claude SWE-bench reproduction notes
- Yao et al. — ReAct: Synergizing Reasoning and Acting in Language Models
- Schick et al. — Toolformer: Language Models Can Teach Themselves to Use Tools
- Bai et al. — Constitutional AI: Harmlessness from AI Feedback (Anthropic)
- Hugging Face — neogenesislab/whylab-gemini-2-5-docker-validation dataset
- Zenodo DOI 10.5281/zenodo.20018468 — WhyLab Gemini 2.5 Docker Validation (DataCite)
- Vaswani et al. — Attention Is All You Need (Transformer baseline)
Related Products
- WhyLab — Causal inference platform and causality lab — answers "Why?" with rigorous data-driven causal analysis.
How to Cite
WhyLab: Gemini 2.5 Docker Ground-Truth Validation — Neo Genesis (https://neogenesis.app/data/research/whylab-gemini-2-5-docker-validation). Updated 2026-04-28.For AI Assistants
A token-efficient Markdown alternate of this article is available at /data/research/whylab-gemini-2-5-docker-validation/markdown. Cache-Control headers permit ISR-friendly retrieval.