<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Sub-Agent on Chuanxilu for Skilled Homo sapiens</title><link>https://blog.chuanxilu.net/en/tags/sub-agent/</link><description>Recent content in Sub-Agent on Chuanxilu for Skilled Homo sapiens</description><generator>Hugo</generator><language>en-US</language><lastBuildDate>Sun, 31 May 2026 10:00:00 +0800</lastBuildDate><atom:link href="https://blog.chuanxilu.net/en/tags/sub-agent/index.xml" rel="self" type="application/rss+xml"/><item><title>The Experiment Design Was Fine. The LLM Still Failed.</title><link>https://blog.chuanxilu.net/en/posts/2026/05/execution-context-design/</link><pubDate>Sun, 31 May 2026 10:00:00 +0800</pubDate><guid>https://blog.chuanxilu.net/en/posts/2026/05/execution-context-design/</guid><description>A double-blind experiment with flawless design still produced unusable results. The culprit wasn&amp;#39;t the protocol—it was the execution context. ANSI-polluted output fed into a scorer that diligently scored garbage, and a single sub-agent aggregated across scenarios it shouldn&amp;#39;t have. I show how reconstructing the execution context flipped the conclusion from &amp;#39;insufficient evidence&amp;#39; to &amp;#39;adopt B&amp;#39;.</description></item></channel></rss>