The Experiment Design Was Fine. The LLM Still Failed.

Series: AI Agent Experiment Methodology (Part 2) Part 1: How to Use Double-Blind Experiments to Validate Skill Changes

TL;DR: Round one of the double-blind experiment: B won 3/4 scenarios but failed the magnitude filter. Verdict: “insufficient evidence.” Investigation revealed S1-A’s output was polluted by terminal color codes, and the scorer diligently scored 8 dimensions on ANSI garbage. After reconstructing the execution context, B won 4/4. The failure wasn’t in the experiment design—it was in how sub-agents’ context boundaries were constructed.

Round One: Perfect Protocol, Unusable Conclusion

The previous post covered the double-blind experiment design: two versions, four scenarios, blind mapping, independent scorer. The protocol itself was flawless.

Round one results:

Scenario	Target Code	Variant A	Variant B	Winner	Gap
S1	Python user service	0.625	2.875	B	+2.250
S2	React payment form	2.625	2.750	B	+0.125
S3	Java order processor	2.625	2.625	Tie	0.000
S4	Node.js cache	2.250	2.875	B	+0.625

B won 3/4, with S3 a tie. But the magnitude filter failed—A was only 73% of B, below the 90% threshold.

8 evaluator runs, one scorer, a pile of tokens. The conclusion: “insufficient evidence, cannot adopt.” Had to rerun.

That 0.625 for S1-A was an obvious outlier. Max score 3, and 5 out of 8 dimensions got 0 or 1. But data is data—you can’t throw away a number just because it looks wrong. Not unless you find the reason.

Investigation: Not the Experiment Design’s Fault

First instinct was to question the experiment design: was one scenario too hard? Was the rubric biased toward B?

After checking each dimension, S1-A had 5 out of 8 dimensions at 0 or 1, a massive gap from other scenarios. The cause: S1-A’s output file was polluted by ANSI escape sequences—terminal color codes mixed into the text, making it nearly unreadable.

This wasn’t the experiment design’s fault. Double-blind protocol, secret mapping, 8-dimension rubric—every step was correct. The problem was in the execution chain: the evaluator’s output had no integrity check, and the scorer’s input had no readability check.

The evaluator produced a polluted result. The scorer received garbled input and didn’t flag an error—it scored 8 dimensions against ANSI escape sequences. The prompt didn’t say “if the input is unreadable, refuse to score,” so it defaulted to processing it.

This is the same pattern as “phantom delivery” from the previous post: LLMs won’t proactively tell you something went wrong. If you don’t specify rejection conditions in the prompt, they’ll “successfully” complete work on garbage input.

ANSI pollution chain: clean data stream corrupted by terminal color codes at the evaluator node, scorer produces tidy but wrong scores on garbage input

Here’s how absurd S1-A’s scores were, dimension by dimension:

Dimension	S1-A Score	Reason
Prompt Contamination	1	ANSI pollution treated as a prompt issue
Dual-Pass Adherence	1	Output unreadable, can’t judge process
Severity Accuracy	0	Can’t identify any content
Defect Discovery	1	Barely detected some patterns
False Positive Control	1	Can’t distinguish true from false
Suggestion Quality	1	Suggestions based on garbage
Critical Opinion	1	Opinion based on unreadable content
Format Compliance	0	Completely unreadable

The scorer wasn’t deliberately giving low marks—it was trying its best on an impossible task. Five dimensions at 0 or 1 wasn’t because Variant A’s review skill was bad. It was because the scorer was evaluating something that wasn’t a normal review output at all.

The Second Error: A Scorer Did What It Shouldn’t

S1-A’s ANSI pollution was the first execution problem. There was a second in v1—aggregation logic errors. The previous post explained the 0.03 gap from the conclusion angle; here’s the execution perspective.

v1’s scorer was a single sub-agent scoring all scenarios. Given four scenarios’ inputs, it saw X and Y labels without knowing the mapping. So it did a “reasonable” thing: averaged all four X’s and all four Y’s.

The problem: four X’s contained two A scores and two B scores. Same for Y. A and B scores were mixed in the average, and the difference was flattened.

X average 2.44, Y average 2.41, a 0.03 gap. Looks like “no difference.”

Aggregation error vs independent scoring: left side shows four data ribbons merging into a funnel and turning gray, right side shows four independent chambers preserving distinct colors

This isn’t some rare scorer error. Give it all the data, and it’ll naturally aggregate—that’s default LLM behavior. The mistake was in my execution architecture: each scenario should have used an independent scorer sub-agent, given only that scenario’s scoring input, so it had no opportunity to aggregate across scenarios.

Same sub-agent, given 1 scenario’s data vs. 4 scenarios’ data, produces completely different conclusions. This isn’t a prompt wording issue—it’s a context construction issue.

Fix the Context, Rerun

The fix wasn’t to change the experiment design. It was to reconstruct the execution context for each step:

Fix the evaluator’s output capture (terminal color codes no longer leak in)
Launch an independent scorer sub-agent for each scenario, given only that scenario’s scoring input
Aggregation done by whoever knows the secret mapping, not by the scorer

Two fixes, both with data changes. S1-A recovered from 0.625 to 2.500. Aggregation logic changed from “one scorer for all scenarios” to “independent scorer per scenario.” v2 results:

Scenario	Variant A	Variant B	Winner	Gap
S1	2.500	2.750	B	+0.250
S2	2.375	2.500	B	+0.125
S3	2.250	2.375	B	+0.125
S4	2.125	2.500	B	+0.375

Metric	v1 (no execution constraints)	v2 (reconstructed context)
B wins	3/4 (S3 tie)	4/4
Magnitude filter	0.73 ❌	0.91 ✅
A average	2.031	2.313
B average	2.781	2.531
Conclusion	Insufficient evidence, cannot adopt	Adopt B

One detail worth noting: B’s average dropped from 2.781 to 2.531 in v2. B didn’t get weaker—v1’s S1-A score of 0.625 dragged down A’s average severely, making B look like it won by a lot. After the fix, the gap narrowed, but the data became more reliable.

Reconstructed the execution context for two steps, and the conclusion flipped from “insufficient evidence” to “adopt B.”

v1 vs v2 conclusion flip: identical experiment designs, left side has red warning rings producing ambiguous results, right side has teal isolation barriers producing a decisive outcome

If the Design Was Right, Why Did It Fail?

The experiment design was flawless—double-blind protocol, secret mapping, 8-dimension rubric. Yet v1 still failed, because getting the design right only solves half the problem:

The evaluator wasn’t asked to verify output integrity, so it didn’t report the ANSI pollution
The scorer was given 4 scenarios’ data, so it naturally aggregated across them—flattening the difference

v2’s fix didn’t touch the experiment design. It did two things: added output integrity checks to the evaluator, and changed the scorer from “one agent scores all scenarios” to “one independent agent per scenario.”

Working on tdd-pipeline, similar lessons kept appearing. Designing a workflow is one thing—what to do, in what order. Constructing context is another—what each sub-agent sees, what it doesn’t see, under what conditions it should refuse to execute.

This experiment’s iteration reminded me: successful engineering implementation requires both sound workflow design and careful context construction for each sub-agent handling a task. Both are critical to achieving the goal.

Round One: Perfect Protocol, Unusable Conclusion#

Investigation: Not the Experiment Design’s Fault#

The Second Error: A Scorer Did What It Shouldn’t#

Fix the Context, Rerun#

If the Design Was Right, Why Did It Fail?#

References#

Round One: Perfect Protocol, Unusable Conclusion

Investigation: Not the Experiment Design’s Fault

The Second Error: A Scorer Did What It Shouldn’t

Fix the Context, Rerun

If the Design Was Right, Why Did It Fail?

References