AI Practice

Medieval castle at night with green surveillance lights along walls, a red crack visible in the shadows

1754 Tests All Green, Then a Code Review Found 6 Assassins

Prologue: 1,754 perfect green lights Late on the night Aristotle v1.6.0 was shipping, the team watched the test panel. Green indicators lit up like dominoes. Python side: 1,166 assertions. TypeScript side: 588 checks. Total: 1,754 automated test cases. All green. In code terms, that’s cameras and infrared sensors on every wall. A fly couldn’t sneak through without setting off alarms. The team leaned back. The system looked like an iron fortress. ...

Split architectural structure, left warm amber TypeScript tower, right cool cyan Python engine room, central subprocess bridge with five constraint pillars

One System, Two Languages: The Five Constraints Behind Aristotle v1.6's Architecture

TL;DR: Five constraints shaped the Watchdog-Intervention Bridge’s cross-language architecture. Watchdog has to intercept LLM tool calls synchronously, so it runs in TypeScript. Intervention has to reuse the existing reflection engine and rule system, so it stays in Python. The Bridge adds zero new infrastructure, so it uses subprocess. Communication can’t block every tool call, so batching replaces real-time streaming. Each decision was the least bad option under the circumstances. The last post covered what the Watchdog-Intervention Bridge does in Aristotle v1.6. This one is about why it looks the way it does. ...

Watchdog-Intervention Bridge three-layer architecture transitioning from post-mortem reflection (warm amber) to real-time interception (cool cyan)

From 'Post-Mortem Reflection' to 'Real-Time Interception': Aristotle v1.6.0's Watchdog-Intervention Bridge

TL;DR: Aristotle v1.6.0 introduces the Watchdog-Intervention Bridge, shifting from “reflect after the fact” to “intercept in real time.” A TypeScript watchdog detects 21 signal types around tool calls. A Python intervention layer handles 13 violation types, connected via a subprocess bridge. MCP tools expanded from 10 stubs to 25 full implementations. Two known bugs remain. Open source on GitHub, MIT license. A Hypothesis Overturned From v1.0 to v1.5, Aristotle answered one question: when AI makes a mistake, how do you make it remember and not repeat it? ...

A terminal window emitting four data streams: stock quotes, GDP curves, corporate info, academic papers, all flowing from a single source

kdatasrc-helper: Let AI Agents Query Financial Data Directly

Problem: Data Sources Exist, But Agents Can’t Use Them kimi CLI’s datasource plugin is a good piece of work. A-share, HK, and US stock quotes, macroeconomic indicators, corporate registries, academic paper search: six data sources covering most day-to-day investment research needs. Install it, type one command in kimi, and you get results. But I work in opencode, not directly in kimi. When an AI agent needs to query financial data, a few things get in the way. ...

omo vs oms: Fallback Chains Deep Dive

This is Part 2 of When Your AI Coding Tool Needs Three Configs. Part 1 covered the config design, file structure, and orchestration philosophy. This article focuses on fallback mechanisms. omo = oh-my-openagent, oms = oh-my-opencode-slim. Model and provider names are anonymized as provider-a/model-x etc. Why Bother Understanding Fallback omo and oms both support fallback—automatic switching to backup when the primary model is unavailable. But their mechanisms differ completely: omo is a multi-layer pipeline that degrades step by step; oms uses startup model selection + runtime abort retry. You need to understand this difference to configure a reliable chain. ...

When Your AI Coding Tool Needs Three Configs

Why I Need Three OpenCode Configs I have three opencode.json files in my ~/.config/opencode/ directory. The reason is simple: I wanted to run oh-my-openagent (omo from here on) and oh-my-opencode-slim (oms from here on) side by side, comparing them to understand where each one’s boundaries lie. omo is the full version—it comes with a batch of built-in agents (Sisyphus, Atlas, Prometheus, Oracle, Explore, Librarian, Metis, Momus, etc.), plus the ones I register on demand. The core is the fallback chain and the Sisyphus orchestrator: throw a refactoring task at Sisyphus, and it breaks the task down for Prometheus to plan, Atlas to execute the plan and distribute subtasks, Explore to search code, Oracle to analyze, then Sisyphus aggregates the results. oms is the slim version—it also has an orchestrator as the main agent responsible for executing tasks, but the difference is in the review phase: oms uses council multi-model consensus, where multiple councillors review results in parallel, and the Council agent synthesizes outputs from all councillors to reach a final conclusion. ...

A seemingly perfect experiment report under a magnifying glass revealing two design flaws: rubric bias toward the tested variable and insufficient scenario coverage

AI-Designed Experiments Need Human Review

Series: AI Agent Experiment Methodology (Part 3) Previous: The Experiment Design Was Fine, So Why Did the LLM Still Fail? TL;DR: In a double-blind experiment, Variant B won 4/4 scenarios with clean data. But design review revealed the rubric had 3/8 dimensions directly testing the target variable, exceeding the 1/3 ceiling and nearly becoming a self-fulfilling prophecy. In a separate validation, one scenario scored perfectly while another exposed a defect—if we had run only the first, the defect would have shipped. Both traps were caught by reviewing the design, not by running the experiment. ...

A carefully designed experiment pipeline corrupted by context leaks at two nodes, contrasted with the clean rebuilt version

The Experiment Design Was Fine. The LLM Still Failed.

Series: AI Agent Experiment Methodology (Part 2) Part 1: How to Use Double-Blind Experiments to Validate Skill Changes TL;DR: Round one of the double-blind experiment: B won 3/4 scenarios but failed the magnitude filter. Verdict: “insufficient evidence.” Investigation revealed S1-A’s output was polluted by terminal color codes, and the scorer diligently scored 8 dimensions on ANSI garbage. After reconstructing the execution context, B won 4/4. The failure wasn’t in the experiment design—it was in how sub-agents’ context boundaries were constructed. ...

Double-blind experiment diagram showing randomized variant mapping and blind evaluation process

Testing Prompt Changes: Why You Need Double-Blind Experiments

TL;DR: You changed a skill. How do you know it’s actually better, not just confirmation bias? I ran a double-blind experiment: two versions, four scenarios, independent blind scoring. The scorer saw X=2.44, Y=2.41 and said “can’t tell them apart.” After unblinding: simplified version won 4/0. The 0.03 Gap I shortened a review skill from 159 lines to 89 lines. Wanted to verify the simplified version actually worked better, so I ran a double-blind experiment. ...

A row of dim review dimension slots with only one glowing, then fully lit after new modules are added — but the version on the right, weighed down by math symbols, has gone dark again

Dimension Experiments: Can a 36-Year-Old Book Fix Your Review Coverage?

Series: Classic Theory Meets Agent Practice (Part 3) Part 1: Dual-Pass Review: Why You Can’t Have Both Recall and Precision · Part 2: Strategy Genes: Pruning Review Prompts with Genetic Algorithm Thinking TL;DR: Two controlled experiments. Code review dimensions went from 8 to 11, and known-issue detection went from 1/6 to 6/6. Design review introduced axiomatic design dimensions, and detection also went from 1/6 to 6/6. But the version with a math formula proved that more dimensions are not always better — computation consumed review attention, and findings dropped 35%. Run controlled experiments with known issues as reference, and you learn which dimensions actually work. ...