Left: a stamp copying identical patterns. Right: freeform marks for independent thinking. Red X marks the imitation path as wrong

A 4-Variable A/B Test — Why Positive Examples Harm Prompt Performance

TL;DR: A 4-variable A/B test on Why Articulation — structure, tone, position, and examples. Positive examples made output worse. The model imitated instead of reasoning. Open-ended prompts improved quality directionally and cut tokens by 33%. Series: Why Make AI Articulate Why Before Acting (Article 2) Previous: From Anthropic’s Alignment Research to a Prompt Design Insight Where We Left Off Anthropic’s alignment research [1] landed on a sharp insight: teaching a model why beats telling it what. I took that idea and built Why Articulation into my TDD Pipeline — a mechanism that forces the model to explain its understanding before it writes any code. Early results looked good. ...

2026-05-15 · 8 min · Alex Wang
An arched gateway inscribed with WHY, two rods of different length and color on the ground

From Anthropic's Alignment Research to a Prompt Design Insight

TL;DR: Anthropic’s alignment research shows that teaching a model why works better than teaching it what — misalignment dropped from 22% to 3%. This post breaks down four experiments and distills three lessons you can use in prompt design. I ran an A/B test comparing two prompt strategies. One group got positive examples — “do it like this.” The other got no examples. Instead, the AI had to explain why a choice was correct before acting on it. ...

2026-05-14 · 7 min · Alex Wang
AI Toolchain Evolution Path panorama — five levels from First Contact to AI Native

The AI Path: From First Contact to AI Native

TL;DR: How does a person grow with AI? This post maps the journey from “opening a chat box for the first time” to “thinking in AI-native ways” across five stages—First Contact, Power User, Engineer, Architect, and Native. The essence of each stage isn’t learning more tools, but a shift in mindset: from passively accepting outputs, to actively designing inputs, to orchestrating multi-agent collaboration, and ultimately reshaping your own cognitive framework. The interactive path map at the end lets you explore each stage in full detail. ...

2026-05-10 · 5 min · Alex Wang
Design docs dissolving after git rebase, a git worktree branch shielding them safely

Git Rebase Ate My Docs — Save Them with Worktree

TL;DR: git rebase / checkout silently deletes untracked files in .gitignore, with no recovery; git stash -u does NOT stash git-ignored files. The fix: use git worktree to create a local-assets branch, storing design docs in a git-tracked safe space. Three commands handle daily use: dp-save.sh to save, --prune to clean, --restore to recover. Real project data shows zero document loss after introducing worktree. Full script at alexwwang/design-doc-worktree. One afternoon I had AI run git rebase -i to tidy up the last dozen commits. No conflicts. Clean terminal. Everything went smoothly. ...

2026-05-08 · 10 min · Alex Wang
Six bug patterns: components correct in isolation, broken after integration, diagnostic clarity emerging from chaos

Green Tests, Broken System: Six Bug Patterns AI Left at the Integration Layer

TL;DR: Before releasing Aristotle v1.1, I found 18 bugs. Unit tests caught four (22%). The other 14 lived at the integration layer — component wiring, config propagation, process startup seams. Root cause analysis revealed six patterns: path/environment mismatch (5), registration omission (3), startup hang (2), silent failure (2), test-production path divergence (2), integration seam errors (4). The root cause isn’t harder problems — it’s AI bypassing the defenses that experience built. Implementation and review rhythms decouple, code appearance misleads quality judgment, and integration shifts from an explicit action to an implicit assumption. Includes an eight-dimension integration checklist and a 16-type bug roadmap at the end. ...

2026-05-07 · 15 min · Alex Wang
OMO vs SLIM: I Switched Plugins to Save Tokens. Here's What Actually Happened.

OMO vs SLIM: I Switched Plugins to Save Tokens. Here's What Actually Happened.

TL;DR: I switched from OMO to SLIM and ran it for 13 days. Average Tokens per message dropped 3.7% — practically flat. Broken down by task type: coding flat, writing +61%, review -53%, debug +121% (unreliable, tiny sample). Aristotle dropped 68%, but the main cause was an architecture rewrite, not the plugin. “Saving tokens” is not a global fact. It’s local. The real differences are in experience and architecture choices, not in token counts. ...

2026-05-06 · 9 min · Alex Wang
The last line of defense for inquiry: independent confirmation and protocol reflexivity

The Last Line of Defense for Inquiry: Independent Confirmation and Protocol Reflexivity

TL;DR: The inquiry protocol’s last line of defense is independent confirmation — a perspective free of confirmation bias that runs falsifiability testing to hunt for counterexamples. This post also covers how the protocol came to be (from 18 bugs of practice to a gap found while writing these articles) and plans for future reflexivity. In the previous post, I laid out the inquiry protocol’s seven conditions: three floor conditions (T1–T3) that force the AI to go deep enough, and four guardrails (HC1–HC4) that keep the inquiry process from spiraling out of control. This post covers the last line of defense — and how the protocol actually came to be. ...

2026-05-06 · 7 min · Alex Wang
Seven conditions to keep AI's 5-Why from going off the rails

Seven Conditions to Keep AI's 5-Why from Going Off the Rails

TL;DR: The inquiry protocol sets seven conditions to keep AI’s 5-Why on track: T1–T3 are floor conditions (can’t stop until all three are met), HC1–HC4 are guardrails (prevent the process from spiraling). T2’s preventive counterfactual check is the most important design — preventive framing forces the inquiry to go deep, while counterfactual questions deliberately construct negation scenarios to counter confirmation bias. ← Previous post The last post diagnosed three problems when AI runs 5-Why: stopping too early (depth insufficient), single-path tracking (breadth insufficient), and confirmation bias (reasoning bias). These three are independent but tend to show up together — a shallow conclusion becomes an anchor, which simultaneously compresses the exploration space and biases evidence selection. This post designs the inquiry protocol: encoding the tacit judgment of “when to stop, when to keep going” that human experts use, into explicit rules that bring AI’s reasoning quality up to the standard 5-Why actually requires. ...

2026-05-05 · 7 min · Alex Wang
Why AI Can't Do 5-Why Right: Stopping Too Early, Single-Path Tracking, and Confirmation Bias

Why AI Can't Do 5-Why Right: Stopping Too Early, Single-Path Tracking, and Confirmation Bias

TL;DR: AI fails at 5-Why in three ways: stopping too early (insufficient depth), single-path tracking (insufficient breadth), and confirmation bias (reasoning distortion). The three are independent but tend to show up together — a shallow conclusion becomes an anchor that compresses the exploration space and biases evidence selection. This post uses a real case where all four rounds of attribution went wrong to dissect each failure mode. This post sits at the intersection of two series: “Taming AI Coding Agents with TDD” and “AI Root Cause Diagnosis.” ...

2026-05-05 · 7 min · Alex Wang
The bug loop: four rounds of root cause diagnosis and regression tests breaking the spiral

The Bug Loop You Can't Escape: Root Cause Diagnosis with AI

1. The Loop That Never Ends A few days ago, the Aristotle project [1] — aimed at fully implementing the GEAR protocol — finally validated all its core technical pathways. The codebase had gone through its third refactoring, core features were working, and testing was complete. Right before merging the development branch into main for release, I ran a manual test and discovered that SKILL.md instructions weren’t being executed correctly — the model received the action but didn’t call task() to launch a background subagent. Instead, it loaded LEARN.md. From investigating this issue, more bugs kept surfacing: ...

2026-05-01 · 17 min · Alex Wang