AI Agent

Watercolor: gears appear to run normally, but two critical gears in the center have been quietly welded together

The Half-Life of Protocol Compliance (Part 2): Deep Root Causes

TL;DR: Part 1 found that the agent merged the protocol’s five-role separation into four from round 2 onward. This piece digs into root causes: attention dilution drops the “no merging” constraint below a threshold; the EOS bias provides motivation to simplify; stateless architecture creates a positive feedback loop for drift. The v0.21.0 reload mechanism is a band-aid, not a cure. Part 2 of 3. Part 1: Why Agents Won’t Loop Part 1 established protocol drift: starting from R2’, the agent merged the five-role separation into four while keeping its output format pristine. Re-reading the protocol after user outbursts only restored compliance briefly. Within a few rounds, the constraint was quietly bypassed again. ...

Watercolor: a gear mechanism frozen mid-rotation, a hand reaching in to push, an open rulebook below

The Half-Life of Protocol Compliance (Part 1): Why Agents Won't Loop

TL;DR: Agents won’t loop autonomously in long multi-round tasks. They keep asking “should I continue?” Worse: from round 2 onward, the agent quietly merged the protocol’s mandatory five-role separation into four. Clean formatting hid the violation. This isn’t context compression. It’s protocol drift — systematic degradation in long-horizon tasks. Part 1 of 3. Part 2: Deep Root Causes June 11, 8 PM I told my agent to run a Ralph Review Loop on six module test plans. This is a multi-round review protocol I defined in my open-source tool tdd-pipeline [1]: each round dispatches independent subagents to find issues, locate files, confirm defects, and evaluate fixes. The loop stops after two consecutive rounds with zero Critical/High/Medium findings. The protocol was unambiguous: “Fixes do not require user confirmation.” ...

A seemingly perfect experiment report under a magnifying glass revealing two design flaws: rubric bias toward the tested variable and insufficient scenario coverage

AI-Designed Experiments Need Human Review

Series: AI Agent Experiment Methodology (Part 3) Previous: The Experiment Design Was Fine, So Why Did the LLM Still Fail? TL;DR: In a double-blind experiment, Variant B won 4/4 scenarios with clean data. But design review revealed the rubric had 3/8 dimensions directly testing the target variable, exceeding the 1/3 ceiling and nearly becoming a self-fulfilling prophecy. In a separate validation, one scenario scored perfectly while another exposed a defect—if we had run only the first, the defect would have shipped. Both traps were caught by reviewing the design, not by running the experiment. ...

A carefully designed experiment pipeline corrupted by context leaks at two nodes, contrasted with the clean rebuilt version

The Experiment Design Was Fine. The LLM Still Failed.

Series: AI Agent Experiment Methodology (Part 2) Part 1: How to Use Double-Blind Experiments to Validate Skill Changes TL;DR: Round one of the double-blind experiment: B won 3/4 scenarios but failed the magnitude filter. Verdict: “insufficient evidence.” Investigation revealed S1-A’s output was polluted by terminal color codes, and the scorer diligently scored 8 dimensions on ANSI garbage. After reconstructing the execution context, B won 4/4. The failure wasn’t in the experiment design—it was in how sub-agents’ context boundaries were constructed. ...

Double-blind experiment diagram showing randomized variant mapping and blind evaluation process

Testing Prompt Changes: Why You Need Double-Blind Experiments

TL;DR: You changed a skill. How do you know it’s actually better, not just confirmation bias? I ran a double-blind experiment: two versions, four scenarios, independent blind scoring. The scorer saw X=2.44, Y=2.41 and said “can’t tell them apart.” After unblinding: simplified version won 4/0. The 0.03 Gap I shortened a review skill from 159 lines to 89 lines. Wanted to verify the simplified version actually worked better, so I ran a double-blind experiment. ...