AI Agent Experiment Methodology on Chuanxilu for Skilled Homo sapiens

AI Agent Experiment Methodology on Chuanxilu for Skilled Homo sapienshttps://blog.chuanxilu.net/en/categories/ai-agent-experiment-methodology/Recent content in AI Agent Experiment Methodology on Chuanxilu for Skilled Homo sapiensHugoen-USMon, 01 Jun 2026 10:00:00 +0800AI-Designed Experiments Need Human Reviewhttps://blog.chuanxilu.net/en/posts/2026/06/experiment-design-review/Mon, 01 Jun 2026 10:00:00 +0800https://blog.chuanxilu.net/en/posts/2026/06/experiment-design-review/A double-blind experiment succeeded, but design review revealed a rubric biased toward the tested variable and insufficient scenario coverage. Both design flaws were caught by review, not by running the experiment.The Experiment Design Was Fine. The LLM Still Failed.https://blog.chuanxilu.net/en/posts/2026/05/execution-context-design/Sun, 31 May 2026 10:00:00 +0800https://blog.chuanxilu.net/en/posts/2026/05/execution-context-design/A double-blind experiment with flawless design still produced unusable results. The culprit wasn't the protocol—it was the execution context. ANSI-polluted output fed into a scorer that diligently scored garbage, and a single sub-agent aggregated across scenarios it shouldn't have. I show how reconstructing the execution context flipped the conclusion from 'insufficient evidence' to 'adopt B'.Testing Prompt Changes: Why You Need Double-Blind Experimentshttps://blog.chuanxilu.net/en/posts/2026/05/double-blind-experiment-ai-prompt-validation/Fri, 29 May 2026 10:00:00 +0800https://blog.chuanxilu.net/en/posts/2026/05/double-blind-experiment-ai-prompt-validation/A/B testing AI skills isn't about showing users two options and tracking conversions. It's about running two skills through AI agents, then having another agent blindly score the results. I use real experiment data to show why double-blind is necessary, and how to avoid five failure modes.