<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>AI Agent Experiment Methodology on Chuanxilu for Skilled Homo sapiens</title><link>https://blog.chuanxilu.net/en/categories/ai-agent-experiment-methodology/</link><description>Recent content in AI Agent Experiment Methodology on Chuanxilu for Skilled Homo sapiens</description><generator>Hugo</generator><language>en-US</language><lastBuildDate>Mon, 01 Jun 2026 10:00:00 +0800</lastBuildDate><atom:link href="https://blog.chuanxilu.net/en/categories/ai-agent-experiment-methodology/index.xml" rel="self" type="application/rss+xml"/><item><title>AI-Designed Experiments Need Human Review</title><link>https://blog.chuanxilu.net/en/posts/2026/06/experiment-design-review/</link><pubDate>Mon, 01 Jun 2026 10:00:00 +0800</pubDate><guid>https://blog.chuanxilu.net/en/posts/2026/06/experiment-design-review/</guid><description>A double-blind experiment succeeded, but design review revealed a rubric biased toward the tested variable and insufficient scenario coverage. Both design flaws were caught by review, not by running the experiment.</description></item><item><title>The Experiment Design Was Fine. The LLM Still Failed.</title><link>https://blog.chuanxilu.net/en/posts/2026/05/execution-context-design/</link><pubDate>Sun, 31 May 2026 10:00:00 +0800</pubDate><guid>https://blog.chuanxilu.net/en/posts/2026/05/execution-context-design/</guid><description>A double-blind experiment with flawless design still produced unusable results. The culprit wasn&amp;#39;t the protocol—it was the execution context. ANSI-polluted output fed into a scorer that diligently scored garbage, and a single sub-agent aggregated across scenarios it shouldn&amp;#39;t have. I show how reconstructing the execution context flipped the conclusion from &amp;#39;insufficient evidence&amp;#39; to &amp;#39;adopt B&amp;#39;.</description></item><item><title>Testing Prompt Changes: Why You Need Double-Blind Experiments</title><link>https://blog.chuanxilu.net/en/posts/2026/05/double-blind-experiment-ai-prompt-validation/</link><pubDate>Fri, 29 May 2026 10:00:00 +0800</pubDate><guid>https://blog.chuanxilu.net/en/posts/2026/05/double-blind-experiment-ai-prompt-validation/</guid><description>A/B testing AI skills isn&amp;#39;t about showing users two options and tracking conversions. It&amp;#39;s about running two skills through AI agents, then having another agent blindly score the results. I use real experiment data to show why double-blind is necessary, and how to avoid five failure modes.</description></item></channel></rss>