Chuanxilu for Skilled Homo sapiens

A carefully designed experiment pipeline corrupted by context leaks at two nodes, contrasted with the clean rebuilt version

The Experiment Design Was Fine. The LLM Still Failed.

Series: AI Agent Experiment Methodology (Part 2) Part 1: How to Use Double-Blind Experiments to Validate Skill Changes TL;DR: Round one of the double-blind experiment: B won 3/4 scenarios but failed the magnitude filter. Verdict: “insufficient evidence.” Investigation revealed S1-A’s output was polluted by terminal color codes, and the scorer diligently scored 8 dimensions on ANSI garbage. After reconstructing the execution context, B won 4/4. The failure wasn’t in the experiment design—it was in how sub-agents’ context boundaries were constructed. ...

Watercolor style: a winding path leading to a small flag on a hilltop, with broader mountain ranges and clouds stretching beyond

AI Path L0→L1 Upgrade Guide (5): Graduation Checklist & Next Steps

📖 This is Part 5 of 5 in the “AI Path L0→L1 Upgrade Guide” series — Series Navigation + Graduation Checklist. Series Navigation Part Topic Core Content Part 1 Understanding Your Tools LLM fundamentals (not a search engine), working memory vs. long-term memory, mainstream platforms and specialized tools Part 2 From Vague Questions to Precise Instructions The RBGO prompt framework, Chain-of-Thought reasoning, format constraints Part 3 Turning AI Into Your Collaboration Partner Iterative follow-up questions, context management (new conversations / progress summaries / chunked processing), role-playing Part 4 Building Your Personal System Prompt library, scenario-to-tool mapping (international and China options), layered knowledge management Part 5 Graduation & Next Steps L1 graduation checklist, L1→L2 dual-path preview ...

Double-blind experiment diagram showing randomized variant mapping and blind evaluation process

Testing Prompt Changes: Why You Need Double-Blind Experiments

TL;DR: You changed a skill. How do you know it’s actually better, not just confirmation bias? I ran a double-blind experiment: two versions, four scenarios, independent blind scoring. The scorer saw X=2.44, Y=2.41 and said “can’t tell them apart.” After unblinding: simplified version won 4/0. The 0.03 Gap I shortened a review skill from 159 lines to 89 lines. Wanted to verify the simplified version actually worked better, so I ran a double-blind experiment. ...

Watercolor style: a wooden desk with a partially open drawer revealing neatly organized pastel index cards in three rows, three sample cards fanned out on the desk surface

What My Prompt Library Looks Like: A Real Template

The biggest obstacle to building a Prompt library isn’t the tool — it’s knowing how to organize it. Yesterday you picked 5 Prompts; today I’ll show you a complete real template. Directory Structure This structure uses the Markdown folder approach. You can copy it directly: prompt-library/ ├── writing/ │ ├── email.md │ ├── article-summary.md │ └── ... ├── analysis/ │ ├── data-interpretation.md │ ├── case-breakdown.md │ └── ... ├── daily/ │ ├── meeting-notes.md │ └── ... └── README.md (global notes) The record format for each Prompt: ...

Watercolor style: an open notebook with five card-shaped slots, three filled with colored cards and two blank, scattered sticky notes nearby, a warm cream desk surface

Today's Practice: Organize Your First 5 Prompts

Today’s Practice From your recent AI conversations — coding, writing, analysis — pick 5 prompts that actually worked well. Record them using the template from Part 4: original prompt + effectiveness rating + iteration notes. Where you record them doesn’t matter — a notes app, Notion, a plain text file. Don’t overthink the tool. If you can’t find your chat history, spend 20 minutes creating 5 prompts you’ll definitely use at work. For example: “Check the edge cases in this code,” “Rewrite this technical article for beginners,” “Extract the 3 main issues from these 100 user feedbacks.” ...

Watercolor style: a neatly organized workbench with labeled glass jars, a leather journal, curated tools, and a two-drawer cabinet symbolizing tiered knowledge management

AI Path L0→L1 Upgrade Guide (4): Building Your Personal System

📖 This is Part 4 of 5 in the “AI Path L0→L1 Upgrade Guide” series. Part 1: Understanding Your Tools · Part 2: From Vague Questions to Precise Instructions · Part 3: Turning AI Into Your Collaboration Partner · Part 4: Building Your Personal System · Part 5: Graduation & Next Steps Over three weeks we’ve picked up follow-up questions, context management, role-playing… the skills are piling up, and here’s the problem: how do you manage all these scattered abilities in one place? Week 4 is about exactly that — building a prompt library, choosing the right tools, and setting up knowledge management. Turning what you’ve learned into a personal system. ...

A row of dim review dimension slots with only one glowing, then fully lit after new modules are added — but the version on the right, weighed down by math symbols, has gone dark again

Dimension Experiments: Can a 36-Year-Old Book Fix Your Review Coverage?

Series: Classic Theory Meets Agent Practice (Part 3) Part 1: Dual-Pass Review: Why You Can’t Have Both Recall and Precision · Part 2: Strategy Genes: Pruning Review Prompts with Genetic Algorithm Thinking TL;DR: Two controlled experiments. Code review dimensions went from 8 to 11, and known-issue detection went from 1/6 to 6/6. Design review introduced axiomatic design dimensions, and detection also went from 1/6 to 6/6. But the version with a math formula proved that more dimensions are not always better — computation consumed review attention, and findings dropped 35%. Run controlled experiments with known issues as reference, and you learn which dimensions actually work. ...

Watercolor style: a proposal document on a wooden desk surrounded by three theatrical masks (green supporter, red critic, amber neutral) in triangular formation, symbolizing role-playing to expose blind spots

Role-Playing in Practice: Make AI Your Devil's Advocate

Yesterday we talked about follow-up questions — three questions to dig out the hidden assumptions behind an AI answer. The catch is, you need to know what to ask. Some blind spots you simply can’t see from your own perspective. That’s when you give AI a different identity. The Devil’s Advocate Please act as a strict reviewer. Go through this plan point by point and identify every risk and weakness. Don’t hold back — the sharper, the better. ...

A bloated prompt pruned into compact strategy genes, with redundant fragments removed and core constraints preserved

Strategy Genes: Pruning Review Prompts with Genetic Algorithm Thinking

Series: Classic Theory Meets Agent Practice (Part 2) Previous: Dual-Pass Review: Why Recall and Precision Cannot Both Win TL;DR: A review prompt went from 317 lines to 135 lines (-58%), and review quality improved by 29%. What I removed was not useful procedure, but redundant content the model could infer on its own. What stayed were strategy genes: irreplaceable constraints, negative examples, and tone locks. The previous post covered dual-pass review: splitting one review agent into a “find everything” pass and a “filter hard” pass. Valid find rate went from 75% to 92%. But it left one problem open: what the “find everything” pass chooses to report or ignore is still affected by prompt wording. ...

Watercolor style: a translucent stack of papers on a desk with three follow-up checkpoints, symbolizing hidden assumptions behind AI answers

Advanced Follow-Up: 3 Questions That Expose AI's Hidden Assumptions

The previous post was about how long conversations drift. After writing it, I noticed something else: drift does not only happen after a conversation gets long. It can also happen inside any answer that looks complete. AI answers quickly, and its conclusions often sound smooth. But it rarely says upfront: what assumptions does this conclusion depend on? If those assumptions are not checked, I end up accepting them by default. Accept enough unchecked assumptions, and the later analysis may be built on the wrong foundation. ...