Prompt Engineering

Double-blind experiment diagram showing randomized variant mapping and blind evaluation process

Testing Prompt Changes: Why You Need Double-Blind Experiments

TL;DR: You changed a skill. How do you know it’s actually better, not just confirmation bias? I ran a double-blind experiment: two versions, four scenarios, independent blind scoring. The scorer saw X=2.44, Y=2.41 and said “can’t tell them apart.” After unblinding: simplified version won 4/0. The 0.03 Gap I shortened a review skill from 159 lines to 89 lines. Wanted to verify the simplified version actually worked better, so I ran a double-blind experiment. ...

Watercolor style: a wooden desk with a partially open drawer revealing neatly organized pastel index cards in three rows, three sample cards fanned out on the desk surface

What My Prompt Library Looks Like: A Real Template

The biggest obstacle to building a Prompt library isn’t the tool — it’s knowing how to organize it. Yesterday you picked 5 Prompts; today I’ll show you a complete real template. Directory Structure This structure uses the Markdown folder approach. You can copy it directly: prompt-library/ ├── writing/ │ ├── email.md │ ├── article-summary.md │ └── ... ├── analysis/ │ ├── data-interpretation.md │ ├── case-breakdown.md │ └── ... ├── daily/ │ ├── meeting-notes.md │ └── ... └── README.md (global notes) The record format for each Prompt: ...

Watercolor style: a proposal document on a wooden desk surrounded by three theatrical masks (green supporter, red critic, amber neutral) in triangular formation, symbolizing role-playing to expose blind spots

Role-Playing in Practice: Make AI Your Devil's Advocate

Yesterday we talked about follow-up questions — three questions to dig out the hidden assumptions behind an AI answer. The catch is, you need to know what to ask. Some blind spots you simply can’t see from your own perspective. That’s when you give AI a different identity. The Devil’s Advocate Please act as a strict reviewer. Go through this plan point by point and identify every risk and weakness. Don’t hold back — the sharper, the better. ...

A bloated prompt pruned into compact strategy genes, with redundant fragments removed and core constraints preserved

Strategy Genes: Pruning Review Prompts with Genetic Algorithm Thinking

Series: Classic Theory Meets Agent Practice (Part 2) Previous: Dual-Pass Review: Why Recall and Precision Cannot Both Win TL;DR: A review prompt went from 317 lines to 135 lines (-58%), and review quality improved by 29%. What I removed was not useful procedure, but redundant content the model could infer on its own. What stayed were strategy genes: irreplaceable constraints, negative examples, and tone locks. The previous post covered dual-pass review: splitting one review agent into a “find everything” pass and a “filter hard” pass. Valid find rate went from 75% to 92%. But it left one problem open: what the “find everything” pass chooses to report or ignore is still affected by prompt wording. ...

Watercolor style: a translucent stack of papers on a desk with three follow-up checkpoints, symbolizing hidden assumptions behind AI answers

Advanced Follow-Up: 3 Questions That Expose AI's Hidden Assumptions

The previous post was about how long conversations drift. After writing it, I noticed something else: drift does not only happen after a conversation gets long. It can also happen inside any answer that looks complete. AI answers quickly, and its conclusions often sound smooth. But it rarely says upfront: what assumptions does this conclusion depend on? If those assumptions are not checked, I end up accepting them by default. Accept enough unchecked assumptions, and the later analysis may be built on the wrong foundation. ...

Watercolor style: a winding paper trail across a desk, with three stations symbolizing mixed directions, data citation errors, and requirement bleed-through

Long Conversation Failures: Lessons from 3 Drift Disasters

The previous exercise was to run a 15-turn conversation with AI, using progress summaries and new conversations as checkpoints. If you actually did it, you probably noticed something else too — drift doesn’t always look the same. The three cases below are all failures I’ve run into myself. Here’s what happened, why it happened, and how to avoid it. Failure 1: Work Directions Got Mixed Together What Happened I was figuring out the approach for a project. I first discussed Approach A with AI — building a data dashboard. After 4 turns, it didn’t feel deep enough, so I switched to Approach B — automated reports — for another 3 turns. Then I thought maybe we could combine Approach C’s real-time push capability. Three directions kept jumping around in the same conversation for a dozen turns. ...

A ruler measuring its own scale marks for redundancy, then trimming the excess marks away

Using the Method to Improve the Method

Series: Breaking to Build: TDD Process Iterations (second post) Previous: What a Failed Experiment Got Right TL;DR: The TDD Pipeline taught “give principles, not steps” — but it had grown into a step-driven tool itself. I stripped the operational steps from Phases 1 through 5, keeping only principles, risk hints, and counterexamples. The model independently derived the steps I had deleted. Output quality held. The reason: Phases 1 through 5 are creative phases that need room to diverge. Removing the fixed track actually helped. The same strategy failed on Phase 6 — next post explains why. ...

Three objects on warm cream: a compass, a crossed-out stamp, and a blank card with a hand-drawn arrow

The Upgrade — New Template and Three Transferable Lessons

TL;DR: Before-and-after comparison of the upgraded Why Articulation template, plus three transferable lessons: give principles not examples, lock critical steps with mandatory tone, and trust the model’s self-organization. Experiment limitations included. Series: Why Make AI Articulate Why Before Acting (Article 3) Previous: A 4-Variable A/B Test — Why Positive Examples Harm Prompt Performance Recap Article 1 started from Anthropic’s alignment research: teaching a model why rather than what cut misalignment from 22% to 3% (about 7×), and achieved equivalent results with 1/28 the data [1]. I adapted this into Why Articulation — a mechanism that forces AI to explain purpose, risks, and approach before writing any code. ...

Left: a stamp copying identical patterns. Right: freeform marks for independent thinking. Red X marks the imitation path as wrong

A 4-Variable A/B Test — Why Positive Examples Harm Prompt Performance

TL;DR: A 4-variable A/B test on Why Articulation — structure, tone, position, and examples. Positive examples made output worse. The model imitated instead of reasoning. Open-ended prompts improved quality directionally and cut tokens by 33%. Series: Why Make AI Articulate Why Before Acting (Article 2) Previous: From Anthropic’s Alignment Research to a Prompt Design Insight Where We Left Off Anthropic’s alignment research [1] landed on a sharp insight: teaching a model why beats telling it what. I took that idea and built Why Articulation into my TDD Pipeline — a mechanism that forces the model to explain its understanding before it writes any code. Early results looked good. ...

An arched gateway inscribed with WHY, two rods of different length and color on the ground

From Anthropic's Alignment Research to a Prompt Design Insight

TL;DR: Anthropic’s alignment research shows that teaching a model why works better than teaching it what — misalignment dropped from 22% to 3%. This post breaks down four experiments and distills three lessons you can use in prompt design. I ran an A/B test comparing two prompt strategies. One group got positive examples — “do it like this.” The other got no examples. Instead, the AI had to explain why a choice was correct before acting on it. ...