WhyLab Docker Validation vs Traditional Rubric Scoring: When Null Results Pass the Test

Traditional code-evaluation rubrics score against expected output. WhyLab grounds validation in Docker execution against SWE-bench. The 67-problem prefilter showed selective adaptive C2 does NOT exceed fixed C2 on the baseline-fail slice. That null finding is a published result, not a buried negative. We compare the two methods and explain why null results count as evidence.

Two ways to evaluate AI-generated code

The dominant pattern for evaluating AI-generated code is rubric scoring: the AI produces output, an evaluator (human or LLM-as-judge) compares the output to a reference solution along scoring dimensions (correctness, style, edge-case handling), and a numeric score is assigned. Rubric scoring is fast, cheap, and scales — and it has a well-known failure mode: the evaluator can mark code as correct that does not actually run.

Docker validation is the alternative: the AI's output is executed in an isolated container against the actual test suite, and the binary outcome (pass/fail) is recorded. This is the methodology used by Princeton's SWE-bench and it is the methodology WhyLab adopted for the HF dataset 3 — Gemini 2.5 Flash validated on 67 SWE-bench problems × 3 seeds × 2 conditions = 402 episodes.

The published null result

WhyLab tested whether selective adaptive C2 (an adaptive cooperation-constraint policy applied only to the baseline-fail slice) exceeds fixed C2 (a constant cooperation-constraint policy applied uniformly). Hypothesis: selective application should target effort where it matters and produce gains. Result: on the 67-problem prefilter at T=0.7, max_attempts=3, the selective adaptive C2 condition does NOT exceed fixed C2 on pass-rate, oscillation, or regression metrics. Mean rejection count decreased — but pass-rate did not improve.

This is a null finding. Under traditional rubric scoring, this finding would have been borderline — adaptive C2 *looked like* the right idea, and a rubric evaluator might score it favorably. Under Docker validation, the binary pass/fail outcome is unambiguous. The dataset, raw episodes, and per-condition statistics are published under CC-BY-4.0 with Zenodo DOI 10.5281/zenodo.20018468.

Why null results matter

Publication bias — academic publishing systematically suppresses null results, creating a literature that overstates effect sizes. The reproducibility crisis (Open Science Collaboration, 2015) traced much of the problem to this dynamic.
Honest scoping — a null finding clarifies the boundary of what an intervention does. Selective adaptive C2 does not exceed fixed C2 *on this slice with these parameters*. That is information.
Reproducibility — null findings save other teams from rediscovering them. The 402 episodes, raw output, and per-condition statistics are public so future work can verify or refute.
Career incentives — publishing null results requires a venue that accepts them. Neo Genesis self-publishes under CC-BY-4.0 because no upstream venue currently has the right incentive structure.

Side-by-side comparison

Methodology: Rubric = compare to reference + score; Docker = execute + pass/fail
Evaluator: Rubric = human or LLM-as-judge; Docker = test suite (deterministic)
Cost: Rubric = low; Docker = moderate (container spin-up + test execution)
Scalability: Rubric = high (thousands of problems/day); Docker = moderate (depends on test-suite complexity)
False positive rate: Rubric = moderate-high (correct-looking code that does not run); Docker = ~0 (binary outcome)
Null result handling: Rubric = ambiguous (partial credit possible); Docker = unambiguous
Public evidence: WhyLab = 402 episodes CC-BY-4.0; typical rubric studies = aggregate scores only

What rubric scoring does better

Rubric scoring shines when execution is impractical: subjective code review (style, naming, architectural choices), partial credit for substantially correct solutions, or evaluation of code that requires complex infrastructure (distributed systems, GPU-dependent code). For SWE-bench-style problems with a well-defined test suite, Docker validation dominates. For 'is this code well-structured?' rubric scoring is the right tool.

What Docker validation does better

Docker validation is honest. The code either passes the test suite or it does not. There is no partial credit, no rubric drift, no LLM-as-judge bias. When you report a positive result under Docker validation, the bar is clear. When you report a null result, the null is real — not an artifact of evaluator subjectivity. This is the property that lets WhyLab publish a null finding with confidence.

How to choose

Have a deterministic test suite? Use Docker validation
Evaluating subjective code quality (style, architecture)? Use rubric scoring
Hybrid: use Docker for correctness + rubric for code quality
Always: publish null results. They are evidence. WhyLab's HF dataset 3 is the template.
Always: ship raw episodes, not just aggregate scores

Frequently asked

Why publish a null result?

Null findings constrain the space of true effects. If selective adaptive C2 does NOT exceed fixed C2 on the baseline-fail slice, every team building cooperation-constraint policies has new information about where the gains do not come from. Suppressing null results inflates the apparent effect-size in the literature; publishing them produces an honest evidence base.

Can rubric scoring detect null results?

Rubric scoring can detect null results when the rubric is calibrated and the evaluator is consistent. The problem is rubric drift over time and rater disagreement, which can mask small null findings. Docker validation removes the evaluator from the loop and makes nulls unambiguous. For correctness questions, Docker dominates; for subjective quality questions, rubric is the right tool.

Is the WhyLab finding a refutation of adaptive C2?

No. The finding is that selective adaptive C2 does not exceed fixed C2 on this specific slice (67 SWE-bench problems, baseline-fail filter, T=0.7, max_attempts=3). The finding does not refute adaptive C2 in general. It clarifies the boundary of the effect under one well-defined configuration. Reproducibility-friendly framing.

What's the cost of Docker validation per problem?

On a Docker-in-Docker setup with the SWE-bench harness, validation costs ~30s-2min per problem depending on test-suite complexity. The 402 episodes in HF dataset 3 took approximately 12 hours of compute on a Gemini 2.5 Flash + Docker pipeline. The cost is dominated by test-suite execution, not API calls.

Could I use the WhyLab dataset for my own research?

Yes. It is CC-BY-4.0 with Zenodo DOI 10.5281/zenodo.20018468. The 402 episodes ship raw output, per-condition results, and the evaluation harness. Cite the dataset as: Anonymous authors (2026, under peer review). WhyLab Docker Validation Evidence. Neo Genesis Research. Zenodo DOI 10.5281/zenodo.20018468. Author identity is intentionally withheld pending venue review; the BibTeX template at /llms-full.txt reflects the same blind-review framing.

Why Gemini 2.5 Flash and not GPT-4 or Claude?

Model choice was driven by cost (Flash tier is the cheapest frontier model for 402 episodes), Docker-execution latency (Flash returns code fast), and the specific evaluation question (relative comparison of selective vs fixed adaptive C2, where the model is held constant). Cross-model replication is an open follow-up — the dataset structure supports adding new model columns without breaking the schema.

References

EthicaAI Mixed-Safe vs Anthropic Constitutional AI: Public Evidence vs Internal Telemetry — Both approaches address multi-agent safety. Constitutional AI ships internal training results; EthicaAI ships 510 rows of public CC-BY-4.0 evidence with Welch t-test and bootstrap CI. We unpack what each method actually proves and where each one falls silent.
Open-Source Research at Neo Genesis: NeurIPS, Datasets, Zenodo DOIs — Why every research output ships under CC-BY-4.0 to Hugging Face + Zenodo, and the rule that distinguishes open research from closed product code at Neo Genesis.
V-Score Quality Gating: Rejecting AI Content That Falls Below 184.5 — How Neo Genesis blocks 30%+ of AI-generated drafts before they ship: V-Score formula, six-factor breakdown, and the 184.5 hard threshold that protects every published post.
Inside HIVE MIND — Our Autonomous Content Engine — Multi-agent architecture: how research, writing, SEO optimization, and quality gating combine.

Markdown alternate available at /blog/whylab-docker-validation-vs-rubric-scoring-2026/markdown for AI agents.