---
title: WhyLab: Gemini 2.5 Docker Ground-Truth Validation
url: https://neogenesis.app/data/research/whylab-gemini-2-5-docker-validation
category: Causal Inference
publishedAt: 2026-04-08
updatedAt: 2026-04-27
author: Yesol Heo
publisher: Neo Genesis
canonical: https://neogenesis.app/data/research/whylab-gemini-2-5-docker-validation
---

# WhyLab: Gemini 2.5 Docker Ground-Truth Validation

> Causal C2 audit framework validation on SWE-bench-style problems using Gemini 2.5 Flash with Docker ground-truth verification — 67 prefiltered problems, 402 episodes, baseline vs whylab_c2 head-to-head.

**Category**: Causal Inference
**Published**: 2026-04-08
**Last updated**: 2026-04-27
**Author**: Yesol Heo
**Publisher**: Neo Genesis
**Canonical URL**: https://neogenesis.app/data/research/whylab-gemini-2-5-docker-validation

## Headline Statistics

- 67 problems × 3 seeds × 2 conditions = 402 episodes on YSH-Server
- Audit rejection signal verified — whylab_c2 records real ground-truth divergences vs simple_retry baseline
- E7v2 pairwise positive significance preserved; 3-way comparison underpowered (honest framing in main.tex)
- Adaptive C2 demoted to scoped calibration after E9 selective follow-up showed no net gain over fixed C2 on the targeted SWE-bench slice

## Why Gemini 2.5 Docker, Not GPT-4 Static

Earlier WhyLab reruns used static problem sets and GPT-4 reasoning, which proved susceptible to model-specific hallucination patterns. The Docker ground-truth setup compiles each candidate fix and runs the project's actual test suite, removing reasoning-only false positives. Gemini 2.5 Flash was chosen because it provides a materially different model family from prior runs, satisfying the 8.0 reopen protocol's requirement for a non-overlapping test bed.

## Result Calibration

On the 67-problem prefilter that originally separated unstable cells from stable ones, the run's stop/go rule was: positive defensible signal = reopen 8.0 narrative; null or ambiguous = close the 8.0 chase and return to stable-accept track. Audit rejection events are recorded for whylab_c2 across the seed sweep, confirming the code path is alive and not a no-op. The full results are presented as evidence of phase-aware deployment value rather than universal gain — the manuscript's E7v2 / E5 / cross-environment sections are recalibrated accordingly.

## Honest Significance Framing

The main paper now states: adaptive C2 helps in E7v2 but does not beat fixed C2 on the targeted SWE-bench slice; pairwise comparison reaches positive significance, three-way comparison remains underpowered. The selective E9 follow-up on baseline-fail slices showed no net gain on pass / oscillation / regression — only mean rejection count decreased. WhyLab is therefore positioned as scoped calibration with a deployment checklist, not a universal causal-audit gain.

## Downloads & Artifacts

- [WhyLab paper PDF](https://github.com/Yesol-Pilot/WhyLab/blob/main/paper/main.pdf) — pdf
- [Selective rerun results](https://github.com/Yesol-Pilot/WhyLab/blob/main/experiments/results/why85_path.json) — github


## Citations & References

- [SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](https://arxiv.org/abs/2310.06770)
- [Pearl, J. (2009). Causality: Models, Reasoning, and Inference](https://www.cambridge.org/core/books/causality/B0046844FAE10CBF274D4ACBDAEB5F5B)
- [Gemini 2.5 model card (Google)](https://ai.google.dev/gemini-api/docs/models/gemini)

## How to Cite

`WhyLab: Gemini 2.5 Docker Ground-Truth Validation — Neo Genesis (https://neogenesis.app/data/research/whylab-gemini-2-5-docker-validation). Updated 2026-04-27.`

---

© 2026 Neo Genesis. AI Works. You Decide.
