A seemingly perfect experiment report under a magnifying glass revealing two design flaws: rubric bias toward the tested variable and insufficient scenario coverage

AI-Designed Experiments Need Human Review

Series: AI Agent Experiment Methodology (Part 3) Previous: The Experiment Design Was Fine, So Why Did the LLM Still Fail? TL;DR: In a double-blind experiment, Variant B won 4/4 scenarios with clean data. But design review revealed the rubric had 3/8 dimensions directly testing the target variable, exceeding the 1/3 ceiling and nearly becoming a self-fulfilling prophecy. In a separate validation, one scenario scored perfectly while another exposed a defect—if we had run only the first, the defect would have shipped. Both traps were caught by reviewing the design, not by running the experiment. ...

2026-06-01 · 7 min · Alex Wang
A carefully designed experiment pipeline corrupted by context leaks at two nodes, contrasted with the clean rebuilt version

The Experiment Design Was Fine. The LLM Still Failed.

Series: AI Agent Experiment Methodology (Part 2) Part 1: How to Use Double-Blind Experiments to Validate Skill Changes TL;DR: Round one of the double-blind experiment: B won 3/4 scenarios but failed the magnitude filter. Verdict: “insufficient evidence.” Investigation revealed S1-A’s output was polluted by terminal color codes, and the scorer diligently scored 8 dimensions on ANSI garbage. After reconstructing the execution context, B won 4/4. The failure wasn’t in the experiment design—it was in how sub-agents’ context boundaries were constructed. ...

2026-05-31 · 6 min · Alex Wang
Double-blind experiment diagram showing randomized variant mapping and blind evaluation process

Testing Prompt Changes: Why You Need Double-Blind Experiments

TL;DR: You changed a skill. How do you know it’s actually better, not just confirmation bias? I ran a double-blind experiment: two versions, four scenarios, independent blind scoring. The scorer saw X=2.44, Y=2.41 and said “can’t tell them apart.” After unblinding: simplified version won 4/0. The 0.03 Gap I shortened a review skill from 159 lines to 89 lines. Wanted to verify the simplified version actually worked better, so I ran a double-blind experiment. ...

2026-05-29 · 6 min · Alex Wang
A row of dim review dimension slots with only one glowing, then fully lit after new modules are added — but the version on the right, weighed down by math symbols, has gone dark again

Dimension Experiments: Can a 36-Year-Old Book Fix Your Review Coverage?

Series: Classic Theory Meets Agent Practice (Part 3) Part 1: Dual-Pass Review: Why You Can’t Have Both Recall and Precision · Part 2: Strategy Genes: Pruning Review Prompts with Genetic Algorithm Thinking TL;DR: Two controlled experiments. Code review dimensions went from 8 to 11, and known-issue detection went from 1/6 to 6/6. Design review introduced axiomatic design dimensions, and detection also went from 1/6 to 6/6. But the version with a math formula proved that more dimensions are not always better — computation consumed review attention, and findings dropped 35%. Run controlled experiments with known issues as reference, and you learn which dimensions actually work. ...

2026-05-25 · 9 min · Alex Wang
A bloated prompt pruned into compact strategy genes, with redundant fragments removed and core constraints preserved

Strategy Genes: Pruning Review Prompts with Genetic Algorithm Thinking

Series: Classic Theory Meets Agent Practice (Part 2) Previous: Dual-Pass Review: Why Recall and Precision Cannot Both Win TL;DR: A review prompt went from 317 lines to 135 lines (-58%), and review quality improved by 29%. What I removed was not useful procedure, but redundant content the model could infer on its own. What stayed were strategy genes: irreplaceable constraints, negative examples, and tone locks. The previous post covered dual-pass review: splitting one review agent into a “find everything” pass and a “filter hard” pass. Valid find rate went from 75% to 92%. But it left one problem open: what the “find everything” pass chooses to report or ignore is still affected by prompt wording. ...

2026-05-24 · 10 min · Alex Wang
Two funnels side by side — the left one wide-mouthed catching many candidate issues, the right one narrow filtering only the valuable findings

Cascade Retrieval: A 15-Year-Old IR Trick Fixed My Design Review Agent

Series: Classic Theory Meets Agent Practice (Part 1) TL;DR: A design review agent needs to find every issue AND avoid false positives. One agent can’t do both. Borrowing cascade retrieval from information retrieval — a 15-year-old method — I split it into two: a Recall Pass that casts a wide net, and a Precision Pass that filters strictly. Real defects get caught earlier, and the risk of rework during development drops. ...

2026-05-22 · 9 min · Alex Wang
A microscope and a telescope side by side, with a dashed line between them labeled 'the invisible blank layer'

The Invisible Blank Layer

Series: Breaking to Build: TDD Process Iterations (Post 3) Post 1: What a Failed Experiment Got Right · Post 2: Using the Method to Improve the Method TL;DR: Phase 6 already does diagnostics at the integration level — drilling into each bug’s root cause. What it doesn’t do: cross-defect pattern scanning, component gap checking, execution order analysis. Those belong to Phase 7. In small systems, Phase 7 catches a few more bugs. As the system grows, those same three tasks produce something different — building test infrastructure, hardening CI rules, driving architectural evolution. Phase 7 doesn’t make architecture decisions. But it provides the scarcest input for those decisions: evidence-based problem localization. ...

2026-05-21 · 6 min · Alex Wang
A ruler measuring its own scale marks for redundancy, then trimming the excess marks away

Using the Method to Improve the Method

Series: Breaking to Build: TDD Process Iterations (second post) Previous: What a Failed Experiment Got Right TL;DR: The TDD Pipeline taught “give principles, not steps” — but it had grown into a step-driven tool itself. I stripped the operational steps from Phases 1 through 5, keeping only principles, risk hints, and counterexamples. The model independently derived the steps I had deleted. Output quality held. The reason: Phases 1 through 5 are creative phases that need room to diverge. Removing the fixed track actually helped. The same strategy failed on Phase 6 — next post explains why. ...

2026-05-20 · 6 min · Alex Wang
An experiment dashboard where every expected metric shows red — except one gauge in the corner, glowing green

What a Failed Experiment Got Right

Series: Breaking to Build: TDD Process Iterations (first post) TL;DR: I refined Phase 6 (pre-release testing) of the TDD Pipeline from step-driven to principle-driven. The goal was better output. I didn’t get it — the refined version was worse at drilling into individual bugs and building evidence chains. But comparing the two outputs revealed dimensional differences. The refined version was better at component gap checking and cross-bug pattern scanning. Those differences pointed to a judgment call: Phase 6 doesn’t need refining. It needs a layer on top of it. That layer later became Phase 7. ...

2026-05-19 · 5 min · Alex Wang
Three objects on warm cream: a compass, a crossed-out stamp, and a blank card with a hand-drawn arrow

The Upgrade — New Template and Three Transferable Lessons

TL;DR: Before-and-after comparison of the upgraded Why Articulation template, plus three transferable lessons: give principles not examples, lock critical steps with mandatory tone, and trust the model’s self-organization. Experiment limitations included. Series: Why Make AI Articulate Why Before Acting (Article 3) Previous: A 4-Variable A/B Test — Why Positive Examples Harm Prompt Performance Recap Article 1 started from Anthropic’s alignment research: teaching a model why rather than what cut misalignment from 22% to 3% (about 7×), and achieved equivalent results with 1/28 the data [1]. I adapted this into Why Articulation — a mechanism that forces AI to explain purpose, risks, and approach before writing any code. ...

2026-05-17 · 8 min · Alex Wang