TDD | Chuanxilu for Skilled Homo sapiens

The Invisible Blank Layer

Series: Breaking to Build: TDD Process Iterations (Post 3) Post 1: What a Failed Experiment Got Right · Post 2: Using the Method to Improve the Method TL;DR: Phase 6 already does diagnostics at the integration level — drilling into each bug’s root cause. What it doesn’t do: cross-defect pattern scanning, component gap checking, execution order analysis. Those belong to Phase 7. In small systems, Phase 7 catches a few more bugs. As the system grows, those same three tasks produce something different — building test infrastructure, hardening CI rules, driving architectural evolution. Phase 7 doesn’t make architecture decisions. But it provides the scarcest input for those decisions: evidence-based problem localization. ...

A ruler measuring its own scale marks for redundancy, then trimming the excess marks away

Using the Method to Improve the Method

Series: Breaking to Build: TDD Process Iterations (second post) Previous: What a Failed Experiment Got Right TL;DR: The TDD Pipeline taught “give principles, not steps” — but it had grown into a step-driven tool itself. I stripped the operational steps from Phases 1 through 5, keeping only principles, risk hints, and counterexamples. The model independently derived the steps I had deleted. Output quality held. The reason: Phases 1 through 5 are creative phases that need room to diverge. Removing the fixed track actually helped. The same strategy failed on Phase 6 — next post explains why. ...

An experiment dashboard where every expected metric shows red — except one gauge in the corner, glowing green

What a Failed Experiment Got Right

Series: Breaking to Build: TDD Process Iterations (first post) TL;DR: I refined Phase 6 (pre-release testing) of the TDD Pipeline from step-driven to principle-driven. The goal was better output. I didn’t get it — the refined version was worse at drilling into individual bugs and building evidence chains. But comparing the two outputs revealed dimensional differences. The refined version was better at component gap checking and cross-bug pattern scanning. Those differences pointed to a judgment call: Phase 6 doesn’t need refining. It needs a layer on top of it. That layer later became Phase 7. ...

Three objects on warm cream: a compass, a crossed-out stamp, and a blank card with a hand-drawn arrow

The Upgrade — New Template and Three Transferable Lessons

TL;DR: Before-and-after comparison of the upgraded Why Articulation template, plus three transferable lessons: give principles not examples, lock critical steps with mandatory tone, and trust the model’s self-organization. Experiment limitations included. Series: Why Make AI Articulate Why Before Acting (Article 3) Previous: A 4-Variable A/B Test — Why Positive Examples Harm Prompt Performance Recap Article 1 started from Anthropic’s alignment research: teaching a model why rather than what cut misalignment from 22% to 3% (about 7×), and achieved equivalent results with 1/28 the data [1]. I adapted this into Why Articulation — a mechanism that forces AI to explain purpose, risks, and approach before writing any code. ...

Left: a stamp copying identical patterns. Right: freeform marks for independent thinking. Red X marks the imitation path as wrong

A 4-Variable A/B Test — Why Positive Examples Harm Prompt Performance

TL;DR: A 4-variable A/B test on Why Articulation — structure, tone, position, and examples. Positive examples made output worse. The model imitated instead of reasoning. Open-ended prompts improved quality directionally and cut tokens by 33%. Series: Why Make AI Articulate Why Before Acting (Article 2) Previous: From Anthropic’s Alignment Research to a Prompt Design Insight Where We Left Off Anthropic’s alignment research [1] landed on a sharp insight: teaching a model why beats telling it what. I took that idea and built Why Articulation into my TDD Pipeline — a mechanism that forces the model to explain its understanding before it writes any code. Early results looked good. ...

An arched gateway inscribed with WHY, two rods of different length and color on the ground

From Anthropic's Alignment Research to a Prompt Design Insight

TL;DR: Anthropic’s alignment research shows that teaching a model why works better than teaching it what — misalignment dropped from 22% to 3%. This post breaks down four experiments and distills three lessons you can use in prompt design. I ran an A/B test comparing two prompt strategies. One group got positive examples — “do it like this.” The other got no examples. Instead, the AI had to explain why a choice was correct before acting on it. ...

Six bug patterns: components correct in isolation, broken after integration, diagnostic clarity emerging from chaos

Green Tests, Broken System: Six Bug Patterns AI Left at the Integration Layer

TL;DR: Before releasing Aristotle v1.1, I found 18 bugs. Unit tests caught four (22%). The other 14 lived at the integration layer — component wiring, config propagation, process startup seams. Root cause analysis revealed six patterns: path/environment mismatch (5), registration omission (3), startup hang (2), silent failure (2), test-production path divergence (2), integration seam errors (4). The root cause isn’t harder problems — it’s AI bypassing the defenses that experience built. Implementation and review rhythms decouple, code appearance misleads quality judgment, and integration shifts from an explicit action to an implicit assumption. Includes an eight-dimension integration checklist and a 16-type bug roadmap at the end. ...

Seven Conditions to Keep AI's 5-Why from Going Off the Rails

TL;DR: The inquiry protocol sets seven conditions to keep AI’s 5-Why on track: T1–T3 are floor conditions (can’t stop until all three are met), HC1–HC4 are guardrails (prevent the process from spiraling). T2’s preventive counterfactual check is the most important design — preventive framing forces the inquiry to go deep, while counterfactual questions deliberately construct negation scenarios to counter confirmation bias. ← Previous post The last post diagnosed three problems when AI runs 5-Why: stopping too early (depth insufficient), single-path tracking (breadth insufficient), and confirmation bias (reasoning bias). These three are independent but tend to show up together — a shallow conclusion becomes an anchor, which simultaneously compresses the exploration space and biases evidence selection. This post designs the inquiry protocol: encoding the tacit judgment of “when to stop, when to keep going” that human experts use, into explicit rules that bring AI’s reasoning quality up to the standard 5-Why actually requires. ...

Pipeline from requirements to code, each stage catching what the previous one missed

The Full Pipeline: Five Stages from Requirements to Code

This is article 6 in “Taming AI Coding Agents with TDD.” The first four covered requirements disambiguation with the GEAR protocol, tech spec guardrails, test documents before test code, and convergent review loops. Article 5 upgraded the review layer with procedural justice. This one strings everything together into a single pipeline you can actually run. The Complete Pipeline Product Design → Tech Spec → Test Plan → Test Code → Production Code ↑ ↑ ↑ ↑ ↑ Ralph Loop Ralph Loop Ralph Loop Ralph Loop Ralph Loop Each stage has its own inputs, outputs, and review rules: ...

Procedural justice encoded: adversarial review where every decision is verifiable

Procedural Justice Encoded: Making Every Step of AI Review Verifiable

My Ralph Loop review mechanism had a hidden problem. v0.2’s flow was straightforward: find issues → fix → confirm convergence. In part 4 of this series, I mentioned that if the creator disagrees with the reviewer’s judgment, they can present evidence in the next round for reassessment. But that was one sentence in the rules — not a formal protocol. Nobody was checking whether the review itself was sound. The reviewer might mislabel severity. The main agent might blindly accept bad suggestions. ...