AI Practice

Six bug patterns: components correct in isolation, broken after integration, diagnostic clarity emerging from chaos

Green Tests, Broken System: Six Bug Patterns AI Left at the Integration Layer

TL;DR: Before releasing Aristotle v1.1, I found 18 bugs. Unit tests caught four (22%). The other 14 lived at the integration layer — component wiring, config propagation, process startup seams. Root cause analysis revealed six patterns: path/environment mismatch (5), registration omission (3), startup hang (2), silent failure (2), test-production path divergence (2), integration seam errors (4). The root cause isn’t harder problems — it’s AI bypassing the defenses that experience built. Implementation and review rhythms decouple, code appearance misleads quality judgment, and integration shifts from an explicit action to an implicit assumption. Includes an eight-dimension integration checklist and a 16-type bug roadmap at the end. ...

OMO vs SLIM: I Switched Plugins to Save Tokens. Here's What Actually Happened.

TL;DR: I switched from OMO to SLIM and ran it for 13 days. Average Tokens per message dropped 3.7% — practically flat. Broken down by task type: coding flat, writing +61%, review -53%, debug +121% (unreliable, tiny sample). Aristotle dropped 68%, but the main cause was an architecture rewrite, not the plugin. “Saving tokens” is not a global fact. It’s local. The real differences are in experience and architecture choices, not in token counts. ...

The Last Line of Defense for Inquiry: Independent Confirmation and Protocol Reflexivity

TL;DR: The inquiry protocol’s last line of defense is independent confirmation — a perspective free of confirmation bias that runs falsifiability testing to hunt for counterexamples. This post also covers how the protocol came to be (from 18 bugs of practice to a gap found while writing these articles) and plans for future reflexivity. In the previous post, I laid out the inquiry protocol’s seven conditions: three floor conditions (T1–T3) that force the AI to go deep enough, and four guardrails (HC1–HC4) that keep the inquiry process from spiraling out of control. This post covers the last line of defense — and how the protocol actually came to be. ...

Seven Conditions to Keep AI's 5-Why from Going Off the Rails

TL;DR: The inquiry protocol sets seven conditions to keep AI’s 5-Why on track: T1–T3 are floor conditions (can’t stop until all three are met), HC1–HC4 are guardrails (prevent the process from spiraling). T2’s preventive counterfactual check is the most important design — preventive framing forces the inquiry to go deep, while counterfactual questions deliberately construct negation scenarios to counter confirmation bias. ← Previous post The last post diagnosed three problems when AI runs 5-Why: stopping too early (depth insufficient), single-path tracking (breadth insufficient), and confirmation bias (reasoning bias). These three are independent but tend to show up together — a shallow conclusion becomes an anchor, which simultaneously compresses the exploration space and biases evidence selection. This post designs the inquiry protocol: encoding the tacit judgment of “when to stop, when to keep going” that human experts use, into explicit rules that bring AI’s reasoning quality up to the standard 5-Why actually requires. ...

Why AI Can't Do 5-Why Right: Stopping Too Early, Single-Path Tracking, and Confirmation Bias

TL;DR: AI fails at 5-Why in three ways: stopping too early (insufficient depth), single-path tracking (insufficient breadth), and confirmation bias (reasoning distortion). The three are independent but tend to show up together — a shallow conclusion becomes an anchor that compresses the exploration space and biases evidence selection. This post uses a real case where all four rounds of attribution went wrong to dissect each failure mode. This post sits at the intersection of two series: “Taming AI Coding Agents with TDD” and “AI Root Cause Diagnosis.” ...

The bug loop: four rounds of root cause diagnosis and regression tests breaking the spiral

The Bug Loop You Can't Escape: Root Cause Diagnosis with AI

1. The Loop That Never Ends A few days ago, the Aristotle project [1] — aimed at fully implementing the GEAR protocol — finally validated all its core technical pathways. The codebase had gone through its third refactoring, core features were working, and testing was complete. Right before merging the development branch into main for release, I ran a manual test and discovered that SKILL.md instructions weren’t being executed correctly — the model received the action but didn’t call task() to launch a background subagent. Instead, it loaded LEARN.md. From investigating this issue, more bugs kept surfacing: ...

Pipeline from requirements to code, each stage catching what the previous one missed

The Full Pipeline: Five Stages from Requirements to Code

This is article 6 in “Taming AI Coding Agents with TDD.” The first four covered requirements disambiguation with the GEAR protocol, tech spec guardrails, test documents before test code, and convergent review loops. Article 5 upgraded the review layer with procedural justice. This one strings everything together into a single pipeline you can actually run. The Complete Pipeline Product Design → Tech Spec → Test Plan → Test Code → Production Code ↑ ↑ ↑ ↑ ↑ Ralph Loop Ralph Loop Ralph Loop Ralph Loop Ralph Loop Each stage has its own inputs, outputs, and review rules: ...

Procedural justice encoded: adversarial review where every decision is verifiable

Procedural Justice Encoded: Making Every Step of AI Review Verifiable

My Ralph Loop review mechanism had a hidden problem. v0.2’s flow was straightforward: find issues → fix → confirm convergence. In part 4 of this series, I mentioned that if the creator disagrees with the reviewer’s judgment, they can present evidence in the next round for reassessment. But that was one sentence in the rules — not a formal protocol. Nobody was checking whether the review itself was sound. The reviewer might mislabel severity. The main agent might blindly accept bad suggestions. ...

Ralph Loop: multi-round convergent review, two consecutive clean rounds to exit

AI Errors Converge, They Don't Randomize: The Review Loop That Catches What You Miss

in “Taming AI Coding Agents with TDD.” The first covered test-driven requirements anchoring, the second introduced the GEAR protocol for disambiguation, the third laid out what the tech spec must nail down. This one covers the last line of defense: review. The Problem the Tech Spec Cannot Solve Article 3 ended with an uncomfortable admission. The PRD locks down “what to build.” The tech spec locks down “how to build it.” Together they compress the AI’s improvisation space down to implementation details. That is a huge improvement. ...

PRD to tech spec: documents as guardrails, not burden

Why PRD Alone Is Not Enough: What the Tech Spec Must Cover in AI-Assisted Development

in the “Taming AI Coding Agents with TDD” series. The first covered test-driven requirements anchoring, the second covered the GEAR protocol for requirements disambiguation. This one fills the gap between them: after the PRD is done, what must the tech spec cover? Requirements Locked, Code Still Wrong Before the second Aristotle refactor, I spent two full days writing requirements. Following the structured approach from the previous article, I captured every acceptance criterion, boundary condition, error path, and platform constraint[1]. The AI consumed the document, passed all 37 static assertions plus end-to-end tests. The codebase was split into four files by responsibility. Information flow was switched from push to pull. ...