The bug loop: four rounds of root cause diagnosis and regression tests breaking the spiral

The Bug Loop You Can't Escape: Root Cause Diagnosis with AI

1. The Loop That Never Ends A few days ago, the Aristotle project [1] — aimed at fully implementing the GEAR protocol — finally validated all its core technical pathways. The codebase had gone through its third refactoring, core features were working, and testing was complete. Right before merging the development branch into main for release, I ran a manual test and discovered that SKILL.md instructions weren’t being executed correctly — the model received the action but didn’t call task() to launch a background subagent. Instead, it loaded LEARN.md. From investigating this issue, more bugs kept surfacing: ...

2026-05-01 · 17 min · Alex Wang
Pipeline from requirements to code, each stage catching what the previous one missed

The Full Pipeline: Five Stages from Requirements to Code

This is article 6 in “Taming AI Coding Agents with TDD.” The first four covered requirements disambiguation with the GEAR protocol, tech spec guardrails, test documents before test code, and convergent review loops. Article 5 upgraded the review layer with procedural justice. This one strings everything together into a single pipeline you can actually run. The Complete Pipeline Product Design → Tech Spec → Test Plan → Test Code → Production Code ↑ ↑ ↑ ↑ ↑ Ralph Loop Ralph Loop Ralph Loop Ralph Loop Ralph Loop Each stage has its own inputs, outputs, and review rules: ...

2026-04-30 · 9 min · Alex Wang
Procedural justice encoded: adversarial review where every decision is verifiable

Procedural Justice Encoded: Making Every Step of AI Review Verifiable

My Ralph Loop review mechanism had a hidden problem. v0.2’s flow was straightforward: find issues → fix → confirm convergence. In part 4 of this series, I mentioned that if the creator disagrees with the reviewer’s judgment, they can present evidence in the next round for reassessment. But that was one sentence in the rules — not a formal protocol. Nobody was checking whether the review itself was sound. The reviewer might mislabel severity. The main agent might blindly accept bad suggestions. ...

2026-04-30 · 10 min · Alex Wang
Ralph Loop: multi-round convergent review, two consecutive clean rounds to exit

AI Errors Converge, They Don't Randomize: The Review Loop That Catches What You Miss

in “Taming AI Coding Agents with TDD.” The first covered test-driven requirements anchoring, the second introduced the GEAR protocol for disambiguation, the third laid out what the tech spec must nail down. This one covers the last line of defense: review. The Problem the Tech Spec Cannot Solve Article 3 ended with an uncomfortable admission. The PRD locks down “what to build.” The tech spec locks down “how to build it.” Together they compress the AI’s improvisation space down to implementation details. That is a huge improvement. ...

2026-04-29 · 11 min · Alex Wang
PRD to tech spec: documents as guardrails, not burden

Why PRD Alone Is Not Enough: What the Tech Spec Must Cover in AI-Assisted Development

in the “Taming AI Coding Agents with TDD” series. The first covered test-driven requirements anchoring, the second covered the GEAR protocol for requirements disambiguation. This one fills the gap between them: after the PRD is done, what must the tech spec cover? Requirements Locked, Code Still Wrong Before the second Aristotle refactor, I spent two full days writing requirements. Following the structured approach from the previous article, I captured every acceptance criterion, boundary condition, error path, and platform constraint[1]. The AI consumed the document, passed all 37 static assertions plus end-to-end tests. The codebase was split into four files by responsibility. Information flow was switched from push to pull. ...

2026-04-29 · 11 min · Alex Wang
Structured requirements vs one-liner: the trap of AI auto-filling gaps

Why AI-Assisted Development Needs Structured Requirements First: Lessons from the GEAR Protocol

in the “Taming AI Coding Agents with TDD” series. The first article covered requirement anchoring at the test layer[1]. Tests assume clear requirements. This one goes upstream — to the practice of disambiguating requirements before a single line of code gets written. The v1 Lesson: One-Line Requirement, 371 Lines of Pollution Aristotle v1 had no GEAR protocol[2]. No role separation. The entire reflection feature lived in a single 371-line SKILL.md. The requirement was roughly one sentence: the system should detect when a user corrects an AI mistake, then generate a reusable rule. ...

2026-04-25 · 8 min · Alex Wang
Requirement anchoring: test plan before test code before business code

Write Test Plans Before Test Code: Requirement Anchoring in AI Development

This is the first article in the series “Taming AI Coding Agents with TDD.” The series has one thesis: AI-assisted development demands stricter process discipline than traditional development, and here is exactly how to enforce it at every step. The series follows the pipeline order — requirements, design, testing, review, implementation. This article starts at the testing layer. During Aristotle’s third refactoring, the test plan document was where I learned the hardest lesson. I’ll cover this layer first, then work backward and forward in subsequent posts. ...

2026-04-23 · 16 min · Alex Wang
Context rot: an easily overlooked problem in AI coding

Context Rot: An Easily Overlooked Problem in AI Coding

Yesterday someone in a group chat said GPT-5.4 performed worse than Doubao. When they asked questions, the model would often give irrelevant answers without even reading the question. I asked a few follow-up questions and found they had fed it a lot of documents, and the conversation had gone on for many turns. This probably wasn’t the model’s problem—it was context rot. I’ve had similar experiences myself. After talking to a model for a long time, it starts “forgetting” what we discussed earlier, or repeats mistakes that were already corrected. The model hasn’t gotten stupider. The conversation has just gotten too long. ...

2026-04-18 · 14 min · Alex Wang
Seven human-AI collaboration patterns from the Aristotle project

Looking Back: Seven Human-AI Collaboration Patterns in the Aristotle Project

Five articles in. Time to step back and look at the path itself. Aristotle: Teaching AI to Reflect on Its Mistakes covered the design philosophy and initial implementation. claude-code-reflect: Same Metacognition, Different Soil told the story of porting across platforms. Trust Boundaries: One Idea, Two Systems proposed a trust tiering model. From Scars to Armor: Harness Engineering in Practice validated the theory through refactoring. A Markdown’s Three Lives: From Static Rules to a Git-Backed MCP Server evolved the rule storage from append-only to the GEAR protocol. ...

2026-04-16 · 11 min · Alex Wang
A Markdown's three lives: from static rules to Git-backed MCP Server

A Markdown's Three Lives: From Static Rules to Git-Backed MCP Server

The previous article, From Scars to Armor: Harness Engineering in Practice, ended with Aristotle having a streamlined router (SKILL.md compressed from 371 lines to 84), an on-demand progressive disclosure architecture, and a working reflect→review→confirm workflow. But one thread never got pulled: Where do confirmed rules actually live? This article follows that thread. It wasn’t planned from the start. Three concrete problems in actual use forced the design out, step by step. ...

2026-04-16 · 21 min · Alex Wang