[{"content":"Yesterday someone in a group chat said GPT-5.4 performed worse than Doubao. When they asked questions, the model would often give irrelevant answers without even reading the question. I asked a few follow-up questions and found they had fed it a lot of documents, and the conversation had gone on for many turns. This probably wasn\u0026rsquo;t the model\u0026rsquo;s problem—it was context rot.\nI\u0026rsquo;ve had similar experiences myself. After talking to a model for a long time, it starts \u0026ldquo;forgetting\u0026rdquo; what we discussed earlier, or repeats mistakes that were already corrected. The model hasn\u0026rsquo;t gotten stupider. The conversation has just gotten too long.\nThis article systematically discusses how to manage context effectively during vibe coding or writing, to avoid token and time wasted on context rot.\nWhat Is Context Rot The term \u0026ldquo;context rot\u0026rdquo; was first used in a Hacker News discussion in June 2025. In July, Chroma Research published the first systematic research report, testing 18 mainstream models (GPT-4.1, Claude 4, Gemini 2.5-Pro, etc.) and found that model accuracy dropped 20-50% as input tokens grew from 10K to 100K[1]. In September, Anthropic formally adopted this term in their official engineering blog \u0026ldquo;Effective context engineering for AI agents\u0026rdquo;[2], spreading it widely throughout the industry. It\u0026rsquo;s not about running out of the context window—the model\u0026rsquo;s performance degrades long before reaching that limit.\nThe root cause lies in the transformer\u0026rsquo;s self-attention mechanism. Self-attention uses softmax normalization—the sum of weights must equal 1. This means attention is a zero-sum game: the longer the context, the more tokens participate in distribution, and the less attention weight each token receives. From 10K to 100K tokens, the average attention weight for a single token shrinks by about 10 times. The information hasn\u0026rsquo;t disappeared. It\u0026rsquo;s just been diluted to the point where it can no longer affect the output. A 2023 study by Xiao et al. called the discovery that \u0026ldquo;attention anchors must be retained at the beginning\u0026rdquo; in this phenomenon \u0026ldquo;Attention Sinks\u0026rdquo;[3], further explaining why model efficiency at using information from the middle of context drops sharply in long contexts.\nMore troublesome is the U-shaped position bias. Transformers naturally tend to pay more attention to tokens at the beginning (primacy effect) and end (recency effect), while information in the middle receives significantly less attention. A 2023 paper by Liu et al. named this phenomenon \u0026ldquo;Lost in the Middle\u0026rdquo;[4]—the model\u0026rsquo;s ability to retrieve information from the middle of context is clearly weaker than from both ends. A 2026 study by Chowdhury further proved that this bias is not a side effect of training, but an inherent property of the causal decoder + residual connection architecture[5].\nAutoregressive generation also compounds errors. Each token\u0026rsquo;s generation depends on the output of all previous tokens, including small biases that may have occurred earlier. A single token\u0026rsquo;s bias may be trivial, but after accumulating over thousands of steps, the model may unconsciously drift in the wrong direction.\nIn plain English: the longer the conversation, the less clearly the model \u0026ldquo;sees\u0026rdquo; the current task; things said in the middle are most easily ignored.\nThis isn\u0026rsquo;t a problem with any particular model. It\u0026rsquo;s an inherent characteristic of the transformer architecture. GPT-5.4, Claude, Doubao—as long as the foundation is transformer, they can\u0026rsquo;t escape this constraint. The difference is only where degradation starts and how fast it progresses.\nHow Common Is the Problem I pulled compaction records from my own OpenCode session database. Compaction is automatic summary compression that tools do when context approaches full capacity—the frequency at which it occurs indirectly reflects the severity of context rot.\n30 compactions, distributed across 7 sessions:\nSession Total Messages Compaction Count Avg Tokens Before Compaction Reflection Skill Dual-Platform Blog Series Planning 992 9 80,273 Git-Based Aristotle MCP Solution Design 844 5 114,697 Technical Blog Series Part 4 Topic Planning 542 3 108,893 Fix opencode config paths in docs 306 3 82,235 Aristotle Series \u0026ldquo;Smooth Implementation\u0026rdquo; Section Warning Setup 115 5 19,225 Hugo Personal Blog Overall Plan 196 3 39,294 Git Initialization and Project Assessment Report Generation 182 2 86,209 All 7 sessions experienced compaction. The longest one—the blog series planning—had 992 messages and was compressed 9 times. Looking at it by conversation turns is more intuitive:\nCompaction # Cumulative Messages Tokens Before Compaction Messages Since Last Compaction 1 45 59,599 — 2 141 101,614 96 3 277 80,659 136 4 403 63,297 126 5 467 74,360 64 6 561 105,226 94 7 657 93,583 96 8 796 76,208 139 9 910 67,913 114 The number of conversation turns between compactions fluctuates between 64 and 139. This means context fills up every 60-140 turns (including tool calls, file reads, code output). And these data only reflect the frequency of context compression—before compression happens, context rot is already occurring.\nBelow are five response strategies I\u0026rsquo;ve summarized from practice, arranged in chronological order of when they\u0026rsquo;re encountered in a task.\nStrategy 1: Start a New Session for New Tasks This is the simplest and most easily overlooked one.\nA session has been running for two hours. The context is already heavy. You say \u0026ldquo;let\u0026rsquo;s switch to something else\u0026rdquo;—continue in the current session, or start a new one?\nContinuing in the current session means the new task\u0026rsquo;s context has to share space with the old task\u0026rsquo;s history. File contents from the old task, decision reasoning, error corrections—they\u0026rsquo;re all there. According to the context rot principle above, longer context means more severe attention dilution—when the model processes the new task, its attention gets scattered by information from the old task. Worse, the model might \u0026ldquo;extract\u0026rdquo; irrelevant patterns from the old task\u0026rsquo;s context, interfering with its judgment on the current task.\nLooking back at my data confirms this. That 992-message session on \u0026ldquo;reflection skill dual-platform blog series planning\u0026rdquo; contained three logical units: writing the first blog post, writing the second blog post, series planning discussion. By all rights it should have been split into three sessions. But it wasn\u0026rsquo;t—three serialized posts needed coherence, and keeping them in one session let the model \u0026ldquo;remember\u0026rdquo; the style and conventions established earlier. This touches on the boundary between compacting and splitting sessions, which Strategy 5 will explore in detail.\nCore judgment: if the new task isn\u0026rsquo;t in the same logical unit as the current task, start a new session. If multiple tasks have strong dependencies—where later output quality directly depends on earlier context—keep them in the same session, but pair it with proactive compacting. Developing and testing the same feature can share a session, but developing feature A and designing feature B shouldn\u0026rsquo;t. Otherwise, if you\u0026rsquo;re just \u0026ldquo;doing it on the side,\u0026rdquo; don\u0026rsquo;t be lazy—start a new session.\nOne rule of thumb: when you find yourself repeatedly reminding the model \u0026ldquo;we\u0026rsquo;re doing X now, not Y,\u0026rdquo; you should have started a new session long ago.\nStrategy 2: Don\u0026rsquo;t Load MCPs and Skills You Don\u0026rsquo;t Need After starting a new session, the next step is to control the context baseline at initialization.\nEvery loaded MCP server and skill occupies context space[6]. Even if your current task doesn\u0026rsquo;t use them, their tool descriptions, parameter schemas, usage instructions are already in the context. According to the attention dilution principle, this irrelevant information will scatter the model\u0026rsquo;s attention from the current task.\nActual impact: if you\u0026rsquo;ve installed 10 MCP servers, each registering 5-8 tools, 50-80 tool descriptions permanently reside in context. Every time the model responds, it has to \u0026ldquo;see\u0026rdquo; these tools, even if the current task only needs 3 of them. Anthropic\u0026rsquo;s subsequent engineering blog specifically analyzed this problem and proposed using code execution to replace direct tool calls to compress token consumption[7].\nClaude Code\u0026rsquo;s skill system uses semantic matching for on-demand loading—only loading the description, not the full content[8]. But even so, descriptions from dozens of skills add up. MCP servers are heavier—every server registers the complete schema for all tools when it starts.\nPrinciple: only load tools necessary for the current task. MCP servers can be configured per project (.mcp.json), no need for global loading. Keep only skills you actually use, clean up unused ones regularly.\nOpenCode has similar layering: skills load on-demand (descriptions stay resident, full content enters context only when invoked), while MCP servers register complete schemas at startup—higher loading cost. The difference in context overhead between the two maps directly onto the principle of \u0026ldquo;don\u0026rsquo;t load what you don\u0026rsquo;t need.\u0026rdquo;\nStrategy 3: Delegate Subtasks to Subagents, Isolate Context Once in the execution phase, the most effective context management tool is isolation.\nThis is the most important lesson learned from Aristotle\u0026rsquo;s development. The initial Aristotle injected the complete 371-line SKILL.md of the reflection protocol into the main session. Reflection is a subtask, but all its details—the 5-Why analysis template, error classification, rule generation protocol—were crammed into the main session\u0026rsquo;s context. After the reflection subagent finished, background_output(full_session=true) pulled the complete RCA report back into the main session. Result: the main session\u0026rsquo;s context got completely polluted by the reflection task, and the main task\u0026rsquo;s space was squeezed out.\nThe redesigned solution uses Progressive Disclosure: 371 lines split into 4 files, loaded on demand. The Coordinator only does lightweight orchestration (84 lines), while Reflector runs in an isolated sub-session. The main session only receives a one-line notification.\nThis lesson can be generalized: any subtask with a complex intermediate process should be executed in an isolated environment, bringing only the final conclusion back to the main session.\nApplicable to:\nCode exploration—let the subagent search the codebase, return only a conclusion summary Solution design—let the subagent do multi-solution comparison, return only the recommended solution and rationale Test execution—let the subagent run tests, return only pass/fail and key error information Documentation generation—let the subagent write the first draft, the main session only reviews and revises The subagent\u0026rsquo;s intermediate process—search paths, trial-and-error records, intermediate versions—takes up a lot of context but doesn\u0026rsquo;t help subsequent work at all. Isolated execution means these intermediate products only exist in the subagent\u0026rsquo;s own context and won\u0026rsquo;t pollute the main session.\nStrategy 4: Roll Back on Wrong Responses Immediately, Don\u0026rsquo;t Repeatedly Correct During execution, how you handle errors directly affects context quality.\nThe AI gives wrong code. You point out the error. It apologizes and gives a modified version—still wrong. You correct it again. It modifies again. After three rounds, six more messages are in the conversation, and the context is stuffed with wrong code, corrections, wrong again, corrections again. These intermediate processes don\u0026rsquo;t help subsequent work at all, but they very much occupy context space. According to the attention dilution principle, they\u0026rsquo;re weakening the model\u0026rsquo;s attention to key information about the current task.\nWorse, repeated corrections can establish wrong \u0026ldquo;inertia\u0026rdquo; in the conversation. In subsequent responses, the model might reference previous wrong versions and bring back already-corrected problems.\nCorrect approach: upon discovering a wrong response, roll back directly to the state before the error, then give the correct instruction again. Don\u0026rsquo;t patch on top of errors, don\u0026rsquo;t let the error process pollute the context.\nSpecific operations:\nClaude Code: /rewind command (alias /undo), or press Esc+Esc. Supports three rollback modes: roll back code only, roll back conversation only, roll back both. Rollback is based on automatically created checkpoints, created before every file edit[9]. OpenCode: session.revert() API, with a rollback button in the UI. Two modes: roll back conversation only (keep file modifications), roll back conversation and code. ⚠️ Two points to note. First, neither rollback tracks side effects from Bash commands—if you executed npm install or rm in Bash, rollback won\u0026rsquo;t undo these operations. Second, the habit of rollback is very counter-intuitive. Human instinct is to patch on top of errors, not pretend they never happened. Building this habit requires deliberate practice.\nHonestly, I rarely use rollback myself. One reason is the scenario changed—switching from purely conversational ChatGPT to agentic tools like Claude Code and OpenCode, the model directly operates files and runs commands, so continuous errors happen significantly less often. Another reason is\u0026hellip; I really don\u0026rsquo;t have that awareness. I discovered the /rewind feature when reading tool documentation, and learned it could be done this way. Knowing is one thing, but when I encounter errors, I still subconsciously correct them instead of rolling back. I\u0026rsquo;m still in the process of building this habit.\nExtreme case: if a session is already full of correction noise, don\u0026rsquo;t hesitate—start a new session and bring clean context into it. Context purity is more important than continuity.\nRollback has a side effect: after an error is rolled back, there\u0026rsquo;s no trace in the conversation. The context is clean, but the lesson from the error is lost too. This got me thinking—can we save the error\u0026rsquo;s context before rolling back?\nThis is a new feature I\u0026rsquo;m considering adding to Aristotle: intercept rollback operations, capture the error scene before executing (the wrong instruction, the model\u0026rsquo;s response, the user\u0026rsquo;s correction intent), and trigger a reflection process. The goal isn\u0026rsquo;t just to clean context, but to transform \u0026ldquo;why rollback was needed\u0026rdquo; into reusable experience—recording error patterns, trigger conditions, avoidance methods, reducing the likelihood of similar errors happening in the future.\nRollback cleans the context. It shouldn\u0026rsquo;t erase the lesson. Discarded errors, if properly reflected and recorded, are the cheapest lessons.\nStrategy 5: Compact Proactively, Don\u0026rsquo;t Wait for Auto-Trigger A subtask is done, ready to switch to the next subtask. At this point there\u0026rsquo;s a key action: proactive compact.\nNever wait for automatic compaction. Automatic compaction is triggered by a token counter, with uncontrollable timing. It might happen while you\u0026rsquo;re debugging a complex bug—you just finished reading three files in the previous round, the model just located the root cause, hasn\u0026rsquo;t had time to give a fix solution yet, and context fills up. Compressed. All file contents, reasoning process compressed into a summary. The model then works based on the summary, losing key details.\nLooking at my data, that MCP solution design session had 45-277 messages between compactions. This means you can\u0026rsquo;t predict which round will trigger automatic compaction—it could interrupt your workflow at any moment.\nCorrect approach: during gaps between subtasks in the same background, compact proactively. For example, a feature is written, ready to start the next feature—compact first. A deep debugging session ends, ready to switch to documentation work—compact first.\nKey principle: before compacting, ensure the current subtask\u0026rsquo;s key conclusions have already landed—code written to files (not in conversation), decisions recorded to external storage. If your conclusions still only exist in the conversation context, after compacting they become summaries, details may be lost.\nAristotle\u0026rsquo;s GEAR protocol writes reflection rules to the Git repository rather than keeping them in conversation, partly for this reason. The file system is a persistence layer that compaction cannot touch. Important things go in files, not in conversation.\nThe first strategy above says \u0026ldquo;new tasks start new sessions,\u0026rdquo; here it says \u0026ldquo;compact proactively during subtask gaps.\u0026rdquo; Where\u0026rsquo;s the boundary?\nKey is distinguishing between tasks and between subtasks. Between-task switching—done with feature A, starting feature B—should start a new session. Between-subtask switching—code written, starting tests—compacting in the same session is enough.\nBut there are fuzzy times too. New session or compact, the boundary isn\u0026rsquo;t always clear. As time invested on the task increases and understanding of the problem deepens, what was initially thought of as \u0026ldquo;one big task\u0026rdquo; might be re-decomposed into several independent tasks; what was initially thought of as \u0026ldquo;independent tasks\u0026rdquo; might reveal hidden dependencies. Subtask division changes, so the choice between compacting and splitting must adjust too.\nThat blog series planning session is a textbook gray area. 992 messages, three logical units—writing the first post, writing the second, series planning—should have been three sessions. But the three posts needed coherence, so they stayed together. The 9 compactions weren\u0026rsquo;t a cost. They were an investment in active context management. Without proactive compacting, context rot erodes model performance before you notice. Later posts went to independent sessions—the fourth had only 542 messages and 3 compactions, less than half the first three combined. Enough conventions had accumulated to work outside the main session, while avoiding the weight of a single oversized session.\nThe basis for judgment isn\u0026rsquo;t the number of tasks, but the strength of information dependency between tasks. Strong dependency, keep together, pair with proactive compact; weak dependency, split apart, each manages its own context.\nThe Relationship Between the Five Strategies These five are arranged in task chronological order, but they all revolve around the same core: fighting context rot, keeping the model\u0026rsquo;s effective attention on the current task.\nNew tasks new sessions—different tasks don\u0026rsquo;t share context, cutting rot off at the source Lean loading—reduce attention competition from irrelevant information Subagent isolation—subtask intermediates don\u0026rsquo;t pollute the main session Error rollback—don\u0026rsquo;t let error processes squeeze out effective space Proactive compact—periodically clean completed subtasks, leave context space for current work The transformer\u0026rsquo;s attention mechanism isn\u0026rsquo;t perfect. In an era of increasingly long contexts, active context management isn\u0026rsquo;t just optimization—it\u0026rsquo;s necessary. If you don\u0026rsquo;t manage it, the model\u0026rsquo;s attention gets diluted by irrelevant information until it can\u0026rsquo;t \u0026ldquo;see\u0026rdquo; what you want.\nReferences:\nChroma Research, \u0026ldquo;Context Rot: How Increasing Input Tokens Impacts LLM Performance\u0026rdquo; (2025-07): research.trychroma.com/context-rot Anthropic Applied AI Team, \u0026ldquo;Effective context engineering for AI agents\u0026rdquo; (2025-09-29): anthropic.com/engineering/effective-context-engineering-for-ai-agents Xiao et al., \u0026ldquo;Efficient Streaming Language Models with Attention Sinks\u0026rdquo; (2023): arxiv.org/abs/2309.17453 Liu et al., \u0026ldquo;Lost in the Middle: How Language Models Use Long Contexts\u0026rdquo; (2023): arxiv.org/abs/2307.03172 Chowdhury, \u0026ldquo;Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias\u0026rdquo; (2026): arxiv.org/abs/2603.10123 Anthropic, \u0026ldquo;Introducing the Model Context Protocol\u0026rdquo; (2024-11-25): anthropic.com/news/model-context-protocol Anthropic, \u0026ldquo;Code execution with MCP: Building more efficient agents\u0026rdquo; (2025-11-04): anthropic.com/engineering/code-execution-with-mcp Anthropic, \u0026ldquo;Equipping agents for the real world with Agent Skills\u0026rdquo; (2025-10-16): anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills Anthropic, \u0026ldquo;Enabling Claude Code to work more autonomously\u0026rdquo; (2025-09-29): anthropic.com/news/enabling-claude-code-to-work-more-autonomously Series Articles:\nAristotle: Teaching AI to Reflect on Its Mistakes claude-code-reflect: Same Metacognition, Different Soil Trust Boundaries: One Idea, Two Systems From Scars to Armor: Harness Engineering in Practice A Markdown\u0026rsquo;s Three Lives: From Static Rules to a Git-Backed MCP Server Looking Back: Seven Human-AI Collaboration Patterns in the Aristotle Project ","permalink":"https://blog.chuanxilu.net/en/posts/2026/04/managing-context-length-in-ai-coding-sessions/","summary":"\u003cp\u003eYesterday someone in a group chat said GPT-5.4 performed worse than Doubao. When they asked questions, the model would often give irrelevant answers without even reading the question. I asked a few follow-up questions and found they had fed it a lot of documents, and the conversation had gone on for many turns. This probably wasn\u0026rsquo;t the model\u0026rsquo;s problem—it was context rot.\u003c/p\u003e\n\u003cp\u003eI\u0026rsquo;ve had similar experiences myself. After talking to a model for a long time, it starts \u0026ldquo;forgetting\u0026rdquo; what we discussed earlier, or repeats mistakes that were already corrected. The model hasn\u0026rsquo;t gotten stupider. The conversation has just gotten too long.\u003c/p\u003e","title":"Context Rot: An Easily Overlooked Problem in AI Coding"},{"content":"Five articles in. Time to step back and look at the path itself.\nAristotle: Teaching AI to Reflect on Its Mistakes covered the design philosophy and initial implementation. claude-code-reflect: Same Metacognition, Different Soil told the story of porting across platforms. Trust Boundaries: One Idea, Two Systems proposed a trust tiering model. From Scars to Armor: Harness Engineering in Practice validated the theory through refactoring. A Markdown\u0026rsquo;s Three Lives: From Static Rules to a Git-Backed MCP Server evolved the rule storage from append-only to the GEAR protocol.\nFive articles about design and technology. This one is about the human—the specific ways AI and I collaborated throughout the project. Looking back at the full development process from early April to mid-April, I\u0026rsquo;ve distilled seven collaboration patterns. They\u0026rsquo;re not a parallel list. They form an evolutionary line—from high-trust launch to metacognitive closure, each pattern a correction and deepening of the one before.\nPattern 1: Human Gives Philosophy, AI Fills in the Details The design and implementation of the original Aristotle.\nI gave AI three design principles—immediate trigger, session isolation, human in the loop—plus the 5-Why root cause analysis framework. AI delivered the complete SKILL.md (394 lines), test script, and README in three commits. Done.\nThis pattern runs on high-trust launch. The human defines \u0026ldquo;why\u0026rdquo; and \u0026ldquo;what.\u0026rdquo; AI handles \u0026ldquo;how.\u0026rdquo; When the problem space is clear enough and the platform infrastructure is solid, AI\u0026rsquo;s execution is strong. OpenCode\u0026rsquo;s skill system and the omo background task infrastructure had already solved the hardest parts. AI just needed to compose them.\nBut \u0026ldquo;done in one pass\u0026rdquo; carries hidden risk. The 37 static assertions verified that protocol steps executed in order. They did not verify whether the side effects were acceptable. Tests won\u0026rsquo;t tell you that the main session got flooded with 371 lines of context. They won\u0026rsquo;t tell you that users needed to open a separate terminal to review drafts. Passing tests created the illusion of \u0026ldquo;it works,\u0026rdquo; and I skipped manual verification.\nWhen tools are smooth enough, humans naturally treat review as optional. Smoothness itself becomes the trap.\nPattern 2: Platform Reality Keeps Correcting AI\u0026rsquo;s Assumptions Same design philosophy, different platform. Claude Code. Completely different experience.\nAI failed repeatedly to install the plugin—wrong marketplace.json format, wrong skill invocation path, config changes that didn\u0026rsquo;t take effect. Once installed, it hit a permission pitfall: bypassPermissions had a confirmed bug that silently rejected writes outside the project root. Later, the main session and sub-session shared an API endpoint, and concurrent requests triggered ECONNRESET errors.\nEvery time, AI confidently proposed a solution based on lessons from the previous round. Every time, platform reality pushed back. V1 introduced bypassPermissions to suppress dialogs → writes got rejected. V2 moved writes to a resumed session → forgot about preparation-phase atomicity. V3 merged everything into a single Bash command → testing revealed bypassPermissions couldn\u0026rsquo;t actually be removed.\nAI has rich theoretical knowledge but zero awareness of platform-specific implicit rules. Experience accumulated on OpenCode doesn\u0026rsquo;t transfer to Claude Code. Every ecosystem\u0026rsquo;s \u0026ldquo;obvious\u0026rdquo; details need to be relearned from scratch. It\u0026rsquo;s like a senior Java developer writing Rust for the first time: architectural skills transfer, but platform conventions don\u0026rsquo;t.\nLooking back, the differences between the three approaches seem obvious. Getting there took real effort.\nPattern 3: The Human Makes Architectural Decisions at Critical Moments After two weeks of using the original Aristotle, four problems surfaced: context pollution (371 lines injected wholesale), report leakage (full RCA pulled back into the main session), broken review flow (task sessions are non-interactive), and wasted attention (model selection popup). AI\u0026rsquo;s analysis concluded that the four problems were independent and needed separate fixes.\nI made a different call: all four problems pointed to the same structural deficiency—no separation between the \u0026ldquo;coordinator\u0026rdquo; and \u0026ldquo;executor\u0026rdquo; roles. Based on this judgment, the fix wasn\u0026rsquo;t four independent patches. It was an architectural restructuring: splitting the monolithic 371-line file into four on-demand files (Progressive Disclosure), each with a clear responsibility.\nAI couldn\u0026rsquo;t make this decision. AI can analyze symptoms and propose fixes for each problem individually. But attributing scattered problems to a common root cause and making an architectural decision based on that attribution—that\u0026rsquo;s a human cognitive advantage. AI\u0026rsquo;s 5-Why analysis finds surface causes. Stringing four independent 5-Why chains into one systemic architectural insight requires cross-domain abstraction.\nAnother example. During GEAR protocol design, AI suggested \u0026ldquo;L should connect directly to S, O is an unnecessary middleman\u0026rdquo;—citing CQRS as analogy: commands go through the coordinator, queries go direct, standard practice.\nI corrected this. L is the agent helping the user write code. O is Aristotle, an independent reflection skill. They run in different contexts. L\u0026rsquo;s context should be reserved for the user\u0026rsquo;s primary task—no reflection infrastructure details should enter it. AI\u0026rsquo;s suggestion was later analyzed by Aristotle\u0026rsquo;s own reflection mechanism via 5-Why. The root cause: \u0026ldquo;a default negative judgment about indirection layers.\u0026rdquo;\nIn general software design, removing middlemen is usually reasonable. In agent systems, the isolation layer is the product.\nPattern 4: Real Usage Exposes What Design Documents Can\u0026rsquo;t During the design phase, I confidently wrote \u0026ldquo;zero context pollution in the main session,\u0026rdquo; \u0026ldquo;transparent to the user,\u0026rdquo; \u0026ldquo;won\u0026rsquo;t interrupt the workflow.\u0026rdquo; All 37 tests passed. The logic was correct at the code level—Coordinator did launch Reflector, Reflector did generate a DRAFT.\nThen I actually used it:\nWhat the Design Promised What Actually Happened Zero context pollution SKILL.md\u0026rsquo;s 371 lines fully injected + full RCA report pulled back Transparent to the user Model selection dialog popped up immediately, consuming a conversation turn Won\u0026rsquo;t interrupt workflow Review flow was broken; user needed a separate terminal This wasn\u0026rsquo;t AI\u0026rsquo;s fault. AI faithfully implemented the protocol in SKILL.md. The problem was that the protocol itself didn\u0026rsquo;t account for side effects. Tests verified \u0026ldquo;did the protocol execute correctly.\u0026rdquo; They didn\u0026rsquo;t verify \u0026ldquo;are the side effects acceptable.\u0026rdquo;\nLater, during claude-code-reflect development, I put this lesson into practice: let AI test the system. Testing revealed three blind spots in the design documents—bypassPermissions as a platform quirk, API concurrency as an environment constraint, heredoc variable non-expansion as a Bash implementation detail. None of these were foreseeable at design time. When AI tests a system, it\u0026rsquo;s not just an executor. It\u0026rsquo;s a design verification participant.\nAutomated tests verify correctness. Manual testing verifies experience. Neither replaces the other.\nPattern 5: AI Does the Research, Human Makes the Call The previous patterns are about design and implementation. But there\u0026rsquo;s another collaboration mode running throughout the entire project, less visible but always present—AI doing research that improves the quality of my decisions and reduces the chance of mistakes.\nWhen refactoring Aristotle, I needed to confirm a critical fact: are OpenCode\u0026rsquo;s task() sessions actually non-interactive? Not a guess. AI examined OpenCode\u0026rsquo;s source code and database, empirically verifying that all 47 task sessions contained exactly 1 user message (a system prompt), with zero follow-up interaction. It also found GitHub Issues #4422, #16303, and #11012, all pointing to the same conclusion. This wasn\u0026rsquo;t AI\u0026rsquo;s \u0026ldquo;opinion.\u0026rdquo; It was empirical data. Based on this evidence, I made the architectural decision to move review into the main session instead of the sub-agent session. Without this research, I might have kept going down the wrong path of \u0026ldquo;let users switch into the sub-agent session for review.\u0026rdquo;\nWhen designing the MCP Server, AI produced a comparison of Git vs. SQLite. Git\u0026rsquo;s advantages (transparent, lightweight, no runtime dependency) and SQLite\u0026rsquo;s advantages (query power, complex indexing) were laid out objectively. I chose Git based on the judgment that \u0026ldquo;a frequently-debugged early system needs transparency.\u0026rdquo; AI also researched the Dream subsystem\u0026rsquo;s sandbox design from Claude Code\u0026rsquo;s leaked source, Cursor Bugbot\u0026rsquo;s multi-round parallel analysis strategy, and the latest practices in harness engineering. These research outputs went directly into the third and fifth blog posts, backing up the trust tiering model and GEAR protocol design.\nEven model selection involved AI-supported research decisions. When I ran into technical issues during claude-code-reflect development, I specifically had a deep conversation with Sonnet 4.6 to analyze the causes. But for the project implementation itself, I judged that the current model was fully capable and used it to move forward. This too is research—judgment about model capability is itself \u0026ldquo;research data\u0026rdquo; accumulated through daily use.\nThe pattern: AI provides information and options. The human chooses based on judgment. AI\u0026rsquo;s research capabilities—rapidly retrieving documentation, examining source code, comparing approaches, synthesizing evidence—dramatically reduced my decision-making cost. Tasks that used to take hours of flipping through docs, reading source code, and searching Issues now produce structured research results in minutes. The quality of my decisions didn\u0026rsquo;t drop (because the final judgment was still mine), but the speed and confidence improved significantly.\nAnd AI\u0026rsquo;s research didn\u0026rsquo;t just serve my decisions. It served the writing. The OpenClaw security incident analysis, CVE inventory, and industry discussions in the third article—all of that material was gathered with AI\u0026rsquo;s help. AI didn\u0026rsquo;t write my conclusions. It found and organized the evidence so my conclusions could stand.\nPattern 6: AI Executes and Verifies, Human Sets Direction and Priorities The rule management system went through five stages of iteration, each forced out by a concrete problem encountered in actual use:\nRules couldn\u0026rsquo;t be rolled back → introduced Git Read/write interference → introduced read-write separation and a state machine AI executing git commands was unreliable → introduced the MCP Server Rules had no structure → introduced YAML frontmatter and search dimensions Production and review goals conflicted → introduced role separation Design was reusable → abstracted into the GEAR protocol Throughout this iteration, AI did the bulk of concrete execution: writing 8 MCP Server tools (75 pytest cases, all passing), designing two-phase streaming filtering, implementing atomic writes and cold-start migration. This work is voluminous, detail-heavy, and deterministic—exactly AI\u0026rsquo;s strength zone.\nBut every directional decision was mine. Why Git instead of SQLite? Because \u0026ldquo;visible and tangible\u0026rdquo; transparency is critical for a frequently-debugged early system. Why not let AI execute git commands directly? Because the rule repository is the user\u0026rsquo;s long-term knowledge accumulation—one wrong command can destroy the entire history. Why split O/R/C/L/S into five roles? Because R optimizes for coverage and C optimizes for precision—mixing them lets two goals interfere with each other.\nAI can produce high-quality output on \u0026ldquo;how to implement.\u0026rdquo; But \u0026ldquo;why this choice\u0026rdquo; requires human judgment. Especially when tradeoffs involve values—transparency vs. performance, security vs. flexibility, isolation vs. efficiency—these judgments aren\u0026rsquo;t technical problems at their core.\nPattern 7: The Reflection System Reflects on Its Own Designer\u0026rsquo;s Mistakes During GEAR protocol design, AI suggested that L should bypass O and connect directly to S. That error was later subjected to a 5-Why root cause analysis by Aristotle\u0026rsquo;s own reflection mechanism, which produced this rule:\nA default negative judgment about indirection layers—assuming every additional coordination layer is unnecessary complexity. This judgment is usually valid in general software design, but wrong in Aristotle\u0026rsquo;s context. Aristotle\u0026rsquo;s indirection layer isn\u0026rsquo;t overhead; it\u0026rsquo;s the product itself. The entire point of the skill is to make the reflection infrastructure invisible to the primary agent. Removing the indirection layer removes the product value.\nAI made an error while designing a reflection system. The reflection system reflected on that error and generated a preventive rule. A bit recursive, but exactly what the system was designed to do—learn from mistakes, even the designer\u0026rsquo;s own.\nThis reveals a deeper collaboration pattern: an AI system\u0026rsquo;s metacognitive capability can feed back into the system\u0026rsquo;s own design. When AI can examine its own decision-making process, identify cognitive biases, and generate preventive rules, it has evolved from an \u0026ldquo;execution tool\u0026rdquo; into a \u0026ldquo;cognitive partner.\u0026rdquo; This partnership isn\u0026rsquo;t built on the assumption that AI is always right. It\u0026rsquo;s built on joint reflection about errors.\nThe Evolution of Seven Patterns Arranged chronologically, the seven patterns form a clear evolutionary line:\nPhase Dominant Pattern Human Role AI Role Initial design Human gives philosophy, AI fills details Principle setter Solution implementer Cross-platform port Platform reality corrects assumptions Problem discoverer Solution iterator Architecture refactoring Human makes critical architectural decisions Root-cause synthesizer Concrete executor Usage validation Real usage exposes design blind spots Experience verifier Test participant Research support AI does research, human makes the call Decision maker Research assistant System iteration AI executes and verifies, human sets direction Direction setter Executor and verifier Metacognitive closure The reflection system reflects on itself Corrector + confirmer Self-examiner + learner The trajectory: humans gradually shift from \u0026ldquo;full involvement\u0026rdquo; to \u0026ldquo;intervention at key decision points,\u0026rdquo; while AI gradually gains \u0026ldquo;limited autonomy + self-reflection\u0026rdquo; capability. This direction and the trust tiering model in the GEAR protocol (Level 0 → Level 3) are two sides of the same coin—as trust accumulates, checkpoints move backward. But judging when it\u0026rsquo;s time to move them remains a human responsibility.\nThis is not a picture of \u0026ldquo;AI gets stronger, humans become less important.\u0026rdquo; The opposite—as AI capability grows, human judgment becomes more critical, because every decision\u0026rsquo;s blast radius expands. One wrong reflection rule, auto-loaded, can skew decisions across dozens of subsequent sessions. Stronger AI demands more precise human steering.\nThe key to steering AI isn\u0026rsquo;t prompt engineering, and it isn\u0026rsquo;t letting AI run autonomously. It\u0026rsquo;s intervening at the right moments—knowing when to let go, when to step in, when to reflect. The Aristotle project itself has been the training ground for that judgment.\n","permalink":"https://blog.chuanxilu.net/en/posts/2026/04/seven-human-ai-collaboration-patterns-in-aristotle/","summary":"\u003cp\u003eFive articles in. Time to step back and look at the path itself.\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"/en/posts/2026/04/aristotle-ai-reflection/\"\u003eAristotle: Teaching AI to Reflect on Its Mistakes\u003c/a\u003e covered the design philosophy and initial implementation. \u003ca href=\"/en/posts/2026/04/claude-code-reflect-different-soil/\"\u003eclaude-code-reflect: Same Metacognition, Different Soil\u003c/a\u003e told the story of porting across platforms. \u003ca href=\"/en/posts/2026/04/a-trust-boundary-design-experiment/\"\u003eTrust Boundaries: One Idea, Two Systems\u003c/a\u003e proposed a trust tiering model. \u003ca href=\"/en/posts/2026/04/from-scars-to-armor-harness-engineering-practice/\"\u003eFrom Scars to Armor: Harness Engineering in Practice\u003c/a\u003e validated the theory through refactoring. \u003ca href=\"/en/posts/2026/04/from-markdown-to-mcp-server-gear-protocol/\"\u003eA Markdown\u0026rsquo;s Three Lives: From Static Rules to a Git-Backed MCP Server\u003c/a\u003e evolved the rule storage from append-only to the GEAR protocol.\u003c/p\u003e","title":"Looking Back: Seven Human-AI Collaboration Patterns in the Aristotle Project"},{"content":"The previous article, From Scars to Armor: Harness Engineering in Practice, ended with Aristotle having a streamlined router (SKILL.md compressed from 371 lines to 84), an on-demand progressive disclosure architecture, and a working reflect→review→confirm workflow.\nBut one thread never got pulled: Where do confirmed rules actually live?\nThis article follows that thread. It wasn\u0026rsquo;t planned from the start. Three concrete problems in actual use forced the design out, step by step.\nFirst Hurdle: The Append-Only Trap Aristotle\u0026rsquo;s reflection ultimately writes a rule — a Markdown snippet telling future AI sessions \u0026ldquo;how to handle this type of situation.\u0026rdquo; The initial implementation was crude and brutal: all rules appended to ~/.config/opencode/aristotle-learnings.md, a single file constantly growing with new rules.\nThis approach worked. But after two weeks of use, three problems surfaced.\nProblem One: No Way to Roll Back One day AI generated a rule: \u0026ldquo;pandas groupby results must be processed with .reset_index() for proper serialization.\u0026rdquo; The rule itself wasn\u0026rsquo;t wrong, but the trigger condition was written too broadly. Subsequent simple aggregation tasks forced reset_index() calls too, which actually broke multi-index structures. Once confirmed and written, every subsequent session read that rule — until I manually opened the file, found that rule, deleted it.\nThis wasn\u0026rsquo;t as simple as \u0026ldquo;delete one line.\u0026rdquo; Finding a specific rule in mixed Markdown content requires visual scanning. Delete the wrong one, and there\u0026rsquo;s no git history to recover. Rules were immutable — once written, they stayed there until manual intervention.\nProblem Two: Project-Level Rules Scattered Everywhere, No Unified Management The first design did distinguish between user-level and project-level — user-level rules in ~/.config/opencode/aristotle-learnings.md, project-level rules in each project\u0026rsquo;s .opencode/aristotle-project-learnings.md. Separating the two files was the right idea.\nBut after separation, both had identical dilemmas — both append-only, neither had version control. Worse, project-level rules were scattered. When I accumulated five lessons across ten projects, those fifty rules were distributed across ten different directories. Searching and managing became a nightmare. Want to check \u0026ldquo;which project previously hit a data leak pitfall\u0026rdquo;? You had to flip through directories one by one.\nProblem Three: No Structure Between Rules Dozens of rules laid flat in a Markdown file, each just a heading plus a few lines. No category tags, no confidence scores, no \u0026ldquo;how this rule came to be.\u0026rdquo; When I wanted to find \u0026ldquo;all lessons related to data cleaning,\u0026rdquo; I had to keyword search — but the wording AI used when generating rules, and the wording I used when searching, often didn\u0026rsquo;t match. Rule says \u0026ldquo;null value handling omission.\u0026rdquo; I search for \u0026ldquo;missing values.\u0026rdquo; No match.\nCommon root of all three problems: flat append-only files can\u0026rsquo;t support \u0026ldquo;stateful knowledge management.\u0026rdquo; Even with user-level and project-level separation, without version management, without structured metadata, without unified search entry points, separation is just physical isolation, not true governance capability.\nSecond Hurdle: Why Git? Four Decision Points My first thought for improvement wasn\u0026rsquo;t Git. But code is text, rules are text too — why can\u0026rsquo;t rules have version management like code?\n\u0026ldquo;Feels right\u0026rdquo; and \u0026ldquo;stands up to scrutiny\u0026rdquo; are different things. In repeated discussions with AI, introducing Git went through four key decision points. Each solved a physical determinism problem in multi-agent collaboration.\nDecision One: Version Rollback — The \u0026ldquo;Undo Button\u0026rdquo; What if agent B produces hallucinations or logic errors when reviewing agent A\u0026rsquo;s output rule and corrupts the file? If we wrote our own version management — like backing up .bak files before each change — complexity would spiral: how to manage backups of backups? How to diff between multiple versions?\nGit is the world\u0026rsquo;s most mature \u0026ldquo;undo button\u0026rdquo; system. git revert or git checkout can roll back to any historical version in seconds, zero extra cost.\nDecision Two: Physical Isolation of Read-Write Conflicts When one agent is writing a rule file, another agent trying to read might read \u0026ldquo;half-written\u0026rdquo; incomplete content. In single-process software this isn\u0026rsquo;t a problem, but in environments where multiple AI sessions run in parallel, it\u0026rsquo;s a real risk.\nGit\u0026rsquo;s staging area and commit history naturally provide logical isolation. Write operations happen on disk, read operations use git show HEAD:file to read directly from Git\u0026rsquo;s object store for the previous stable version. This Snapshot Read eliminates read-write conflicts — readers and writers always see different versions.\nDecision Three: From \u0026ldquo;Modify File\u0026rdquo; to \u0026ldquo;Commit Transaction\u0026rdquo; Simple file state marking (writing status: pending in text) isn\u0026rsquo;t reliable. Physical state and logical state can decouple — file exists on disk, but status flag is wrong; or status flag is right, but file content got accidentally overwritten.\nWe need to make \u0026ldquo;modify file\u0026rdquo; and \u0026ldquo;activate file\u0026rdquo; two independent actions. git commit is essentially an atomic transaction. Only after commit does a rule become \u0026ldquo;officially live\u0026rdquo; in the system. Anything uncommitted is considered untrusted. This provides consumers with an absolutely reliable boundary.\nDecision Four: Lightweight and Transparent I evaluated SQLite. Databases are stronger on query capability, but have two fatal flaws: invisibility — you can\u0026rsquo;t directly open the database with a text editor to see rule content, debugging and audit costs are high; deployment cost — needs extra runtime dependencies.\nGit is file-based. You can directly open folders to view .md content while gaining database-level version control. This transparency — visible and tangible — is crucial for an early system with frequent debugging.\nCommon Conclusion of Four Decisions Choosing Git actually solved four engineering problems — version control, physical isolation, transaction mechanism, audit traceability — through one lightweight existing tool. A tool users already have.\nOn this \u0026ldquo;secure foundation\u0026rdquo; Git provides, subsequent designs like atomic writes, state machines, read-write separation have something to build on.\nThird Hurdle: Git-Backed Filesystem Design Details Atomic Writes When rule files write to disk, I use \u0026ldquo;temporary file + rename\u0026rdquo; strategy — first write to a .tmp file, then os.rename() to replace the original. This guarantees two properties:\nOther processes (including simultaneously running AI sessions) never read \u0026ldquo;half-written\u0026rdquo; files. Even if a crash happens during write, the original file stays intact. Sounds like over-engineering? Actually not. AI agents often run in parallel across multiple sessions. If session A is writing a rule file and session B happens to be reading at the same time, without atomic write guarantees, B might read incomplete content, then make decisions based on that incomplete content. This isn\u0026rsquo;t theoretical risk. It\u0026rsquo;s a problem encountered in actual use. Common in database write scenarios — that\u0026rsquo;s why databases use locks. Given reflection content written to files will be larger and the same reflection file won\u0026rsquo;t update frequently, I chose not to use locking for lightweight non-blocking reasons.\nState Machine Rules no longer have just one \u0026ldquo;written\u0026rdquo; state, but a full lifecycle:\npending → staging → verified ↘ rejected (recoverable) pending: rule just generated, not reviewed yet. File exists on disk, but not in Git. staging: reviewer is checking. This step \u0026ldquo;locks\u0026rdquo; the rule to prevent modification while reviewer is working. verified: review passed, execute git add \u0026amp;\u0026amp; commit. This is terminal state — consumers only see rules in this state. rejected: review failed. But not deleted — moved to rejected/ directory, preserving all metadata, can be restored later. Why preserve rejected rules instead of deleting directly? Because I discovered some rejected rules aren\u0026rsquo;t \u0026ldquo;completely wrong,\u0026rdquo; but \u0026ldquo;not applicable in specific scenarios.\u0026rdquo; Keeping them lets future restore reactivate them, rather than regenerating from scratch.\nRead-Write Separation When consumers (future Agent L) read rules, they don\u0026rsquo;t read files on disk directly, but by using git show HEAD:file to read Git committed snapshots. This means consumers only ever see verified state rules, never read producer\u0026rsquo;s half-written drafts.\nRead-write separation is a key design decision. It doesn\u0026rsquo;t solve performance problems. It solves trust problems — consumers don\u0026rsquo;t need to trust disk file state, only Git commit history. Git commit\u0026rsquo;s atomicity became the contract between producer and consumer.\nCold Start First run, system detects old aristotle-learnings.md file, automatically executes migration: parse old Markdown format, generate YAML frontmatter for each rule (including status, category, confidence, etc.), write to Git repo. After migration completes, old file renamed to .bak backup.\nMigration isn\u0026rsquo;t as simple as \u0026ldquo;cutting old files into pieces.\u0026rdquo; Old rules have no structured metadata, need heuristic inference — parse error categories from Markdown headings, extract rule summaries from paragraphs. Inference isn\u0026rsquo;t necessarily accurate, so during migration confidence defaults to 0.7 (conservative), verified_by marked as \u0026quot;migration\u0026quot;, convenient for later manual review.\nThese design ideas came from repeated discussions with AI. I saved nine discussion records total. From initial \u0026ldquo;Git-MCP skill management plan\u0026rdquo; to finally converging on \u0026ldquo;GEAR protocol spec,\u0026rdquo; step-by-step iteration, each step recording the problems at that time, design decisions, and reasons for those choices.\nFourth Hurdle: Why MCP Server? With design direction, next step is implementation. A key technical selection question: where should these Git operations execute?\nMost direct approach: write bash commands in SKILL.md — let AI agent call git add and git commit itself. But I quickly excluded this option for three reasons:\nReliability. AI-generated git commands can have spelling errors, path errors, even destructive operations (like accidental git reset --hard). The rule repo is user\u0026rsquo;s long-term knowledge accumulation. One wrong git command can destroy entire history.\nConsistency. Every rule write needs to execute same state checks, frontmatter formatting, atomic write flows. Putting this logic in prompt for AI to execute, consistency cannot be guaranteed — models sometimes \u0026ldquo;creatively\u0026rdquo; skip certain steps.\nTestability. Flows described in prompts are hard to test automatically.\nThese three reasons also reflect task characteristics: this is highly deterministic standard action. It can use programs to implement logic, and guarantee quality through test cases, covering every node from initialization to migration to lifecycle management. Wrapping these operations as standardized tools for AI to call on demand is the higher-determinism, safer choice.\nSo MCP (Model Context Protocol) entered: an independent Python process, communicating with AI agent via stdio JSON-RPC. Agent doesn\u0026rsquo;t execute git commands directly, but calls MCP-provided tools to achieve goals. After several iterations, defined eight such tools:\nTool Operation Purpose init_repo Initialize Create directory structure, Git repo, migrate old rules write_rule Produce Create rule file (pending state), write YAML frontmatter read_rules Retrieve Multi-dimensional combined query (status, category, intent tags, error summary) stage_rule Review Mark rule entering staging state commit_rule Confirm Status set to verified, execute git add \u0026amp;\u0026amp; commit reject_rule Reject Move to rejected/ directory, preserve metadata restore_rule Restore Restore from rejected/ to official directory list_rules List Lightweight metadata query (doesn\u0026rsquo;t load rule body) Each tool is a deterministic Python function with input validation, error handling, and test coverage. AI agent operates rule repo by calling these tools, but never bypasses tools to execute git commands directly.\nMCP Server doesn\u0026rsquo;t give AI more capability, it adds boundaries to AI\u0026rsquo;s capability. This design philosophy follows the trust calibration discussed in Part four: not distrusting AI, but narrowing \u0026ldquo;places where errors can happen\u0026rdquo; to predictable ranges through structured interfaces.\nFifth Hurdle: Retrieval Dimensions — How to Find \u0026ldquo;Relevant\u0026rdquo; Rules? MCP Server ready, rules have lifecycle, Git version management. But another problem remains: when AI starts a new task, how does it know which rules relate to the current task?\nInitial implementation only supported filtering by status (verified) and category (HALLUCINATION and other 8 types). In actual use, I found rules under same category might cover completely different technical scenarios — \u0026ldquo;HALLUCINATION\u0026rdquo; can mean \u0026ldquo;invented a non-existent API method,\u0026rdquo; or \u0026ldquo;incorrectly claimed a config item doesn\u0026rsquo;t exist.\u0026rdquo; Categories too coarse, not enough. Use large models for semantic comparison directly? That makes MCP tools too heavy and loses MCP tool determinism. So I decided query filtering only uses regex matching, converting semantic comparison to keyword queries.\nAfter consideration, I introduced three retrieval dimensions in query design:\nIntent tags (intent_tags): rule\u0026rsquo;s applicable technical field (domain) and specific goal (task_goal). Like domain: \u0026quot;database_operations\u0026quot;, task_goal: \u0026quot;connection_pool_management\u0026quot;. Failed skill (failed_skill): errored tool or skill. Like failed_skill: \u0026quot;prisma_client\u0026quot;. Error summary (error_summary): one-sentence description of error site. Like \u0026quot;P2024 connection pool timeout in serverless\u0026quot;. These three dimensions are automatically filled by AI when generating rules. When generating rules, an inference step is added — infer technical field from error context, infer task goal from user\u0026rsquo;s original request, infer errored tool from involved code.\nRetrieval can combine: query \u0026ldquo;all database operation related\u0026rdquo; rules, or more precisely query \u0026ldquo;connection pool management + timeout involved\u0026rdquo; rules. 500 rules, Phase 1 frontmatter filtering only needs 80ms.\nStreaming Filter Here\u0026rsquo;s one engineering detail worth mentioning when implementing retrieval. read_rules tool uses two-phase search:\nPhase 1: only read the first 50 lines of each file (YAML frontmatter usually ends within first 20 lines), use regex to match KV pairs in frontmatter. Files not matching directly skip, no YAML parsing.\nPhase 2: only do complete frontmatter parsing and rule body loading for files hit in Phase 1.\nWhy two phases? Because YAML parsing is an order of magnitude slower than regex matching. If all 500 rules did YAML parsing, retrieval latency would spike from 80ms to nearly 1 second. Two-phase design excludes \u0026ldquo;definitely unneeded files\u0026rdquo; as early as possible, only paying parsing cost where necessary. (Though whether my local system can accumulate to 500 rules, I genuinely don\u0026rsquo;t know yet.)\nSixth Hurdle: S — Translating Intent to Queries With three retrieval dimensions, next question: who translates \u0026ldquo;I want to do database migration\u0026rdquo; natural language into MCP query parameters?\nAnswer seems obvious — can\u0026rsquo;t let L do it to avoid polluting context. Further, natural thought: put it in Agent O, let it handle routing, intent extraction, query construction together. But would this cause SKILL.md context explosion? Especially query construction needs to call MCP service to get reflection results, which returns a lot of content.\nSo progressive disclosure thought used again (actually I think it\u0026rsquo;s the same concept as decoupling in programming design, just expressed in different scenarios), query construction extracted as an independent concern, named S (Searcher). S\u0026rsquo;s input is intent tags (domain: \u0026quot;database_operations\u0026quot;, task_goal: \u0026quot;schema_migration\u0026quot;), output is read_rules() parameter dict. S does specific things:\nIf it has domain, set intent_domain parameter. If it has task_goal, set intent_task_goal parameter. If it has failed_skill, set failed_skill parameter. If it has error description, extract 2-3 keywords from it, connect with | as keyword parameter. All parameters AND combined, call read_rules(). S doesn\u0026rsquo;t do semantic understanding, doesn\u0026rsquo;t do fuzzy matching — it\u0026rsquo;s a deterministic parameter constructor.\nHere\u0026rsquo;s a deliberate design choice: S has independent agent identity in the design scheme, but in current implementation it\u0026rsquo;s just a function call inside O. Not a contradiction — a phased strategy. Query construction is simple enough for now, not worth starting an independent subagent. But if future versions need semantic retrieval (vector matching), cross-repo joint queries, or query result caching, S\u0026rsquo;s complexity will grow. The agent identity in the design reserves an evolution path from function to independent process.\nLightweight implementation first, protocol layer reservation — entire project\u0026rsquo;s design philosophy stays consistent.\nBut S is only one link in retrieval chain. S might return 20 rules — if we throw all to the agent executing user task, those 20 rules\u0026rsquo; complete bodies will directly fill context window, main task space squeezed out.\nThis introduces deeper design question: who stands between L and reflection infrastructure, doing filtering and compression?\nSo first use O to handle, if future context explosion encountered, can split out another agent to handle filtering tasks, controlling each task\u0026rsquo;s context length. This uses toolchain thinking to control single node complexity, and context length is intuitive measure of agent task complexity. Phased implementation doesn\u0026rsquo;t affect architectural principles — next section explains, O in-between isn\u0026rsquo;t expedient, but architectural necessity for learning chain.\nSeventh Hurdle: O\u0026rsquo;s Expanded Role — From Router to Knowledge Service Provider O (Orchestrator) in Aristotle\u0026rsquo;s original design was just a router — user inputs /aristotle, O parses parameters, decides to start reflection or review, then hands off.\nBut in learning chain, O\u0026rsquo;s role fundamentally changed. It no longer just distributes tasks, it becomes an isolation layer.\nL and O Aren\u0026rsquo;t the Same Agent Here\u0026rsquo;s a pitfall I (and AI helping me design) both stepped in.\nAristotle\u0026rsquo;s historical implementation had O, R, C roles all completed in same main session context — load SKILL.md become O, load REFLECT.md start reflection, load REVIEW.md do review. All in same agent process.\nSo when designing learning chain, the AI naturally assumed L was also same agent — \u0026ldquo;L connects directly to S is fine, O is unnecessary middleman.\u0026rdquo; It even used CQRS as analogy: commands go through coordinator, queries directly get, as a matter of course.\nI corrected this judgment.\nL is the agent helping user write code, O is Aristotle this independent reflection skill. They run in different contexts. L\u0026rsquo;s context should be left to user\u0026rsquo;s main task as much as possible — any reflection infrastructure details (MCP, frontmatter, query construction) shouldn\u0026rsquo;t enter L\u0026rsquo;s context.\nThis distinction doesn\u0026rsquo;t matter in P1+P2 phases, because reflection and review themselves are user-initiated operations, occupying main session context is reasonable. But in learning chain, L executes user\u0026rsquo;s main task — at this point any reflection infrastructure intrusion into L\u0026rsquo;s context is pollution, and only that reflection rule helping solve current task\u0026rsquo;s problem is what L needs.\nThree Things O Does In-Between O in learning chain does three things L shouldn\u0026rsquo;t do:\n1. Intent extraction. L says \u0026ldquo;I want to do database migration, any pitfalls encountered before?\u0026rdquo; — O infers domain: \u0026quot;database_operations\u0026quot;, task_goal: \u0026quot;schema_migration\u0026quot; from this sentence. L doesn\u0026rsquo;t need to know what intent_tags are.\n2. Query construction and execution. O calls S function, constructs MCP query parameters, calls read_rules(), gets raw results. These are reflection infrastructure internal operations, invisible to L.\n3. Filtering and compression. S might return 20 rules. O does deduplication, sorts by relevance, keeps at most 5, then compresses each to 3-4 line summary — error description, pitfall avoidance points, positive and negative examples, rule ID. L only sees this refined summary.\nL\u0026rsquo;s perspective is simple: asked a question, received a few lessons. Doesn\u0026rsquo;t know MCP, doesn\u0026rsquo;t know read_rules, doesn\u0026rsquo;t know frontmatter. This is minimal pollution.\nA Valuable Mistake Worth mentioning: the \u0026ldquo;O is an unnecessary middleman\u0026rdquo; judgment later got a 5-Why root cause analysis from Aristotle\u0026rsquo;s own reflection mechanism. The conclusion is telling:\nA default negative judgment toward \u0026ldquo;indirect layers\u0026rdquo; — assuming every extra coordination layer is unnecessary complexity. This judgment is usually reasonable in general software design, but wrong for Aristotle. Aristotle\u0026rsquo;s indirect layer isn\u0026rsquo;t overhead. It\u0026rsquo;s the product itself. The entire skill exists to make reflection infrastructure invisible to the mainline agent. Remove the layer, remove the product value.\nAI made an error while designing a reflection system. The reflection system reflected on that error and generated a prevention rule. Somewhat matryoshka — but that\u0026rsquo;s exactly the point. Learn from errors, even the designer\u0026rsquo;s own.\nEighth Hurdle: Role Separation — O, R, C, S, L Each Does Their Job With S and O\u0026rsquo;s expanded design, five roles\u0026rsquo; complete picture becomes clear:\nRole Goal Pursues O (Orchestrator) Coordinate + Isolate Route correctly, minimize context pollution R (Resource Creator) Produce rules Recall rate — better over-generate than miss C (Checker) Review rules Precision — format, logic, deduplication S (Searcher) Intent → Query Deterministic translation, no guessing L (Learner) Consume rules Execute main tasks + avoid known traps R and C responsibilities have essential difference — R pursues coverage, C pursues accuracy. When mixed together, the two goals interfere with each other. Role separation isn\u0026rsquo;t for \u0026ldquo;division of labor,\u0026rdquo; but for goal isolation.\nMore bluntly: R is automated agent, its output might have logical errors even hallucinations. If R-generated rules without review enter production environment — treated as \u0026ldquo;must-follow lessons\u0026rdquo; by L — one wrong rule will pollute all subsequent sessions\u0026rsquo; decisions. This isn\u0026rsquo;t assumption, it\u0026rsquo;s problem I encountered in actual use: R wrote a rule with overly broad trigger conditions directly, L misused it in subsequent tasks, generated new errors. Rule repo\u0026rsquo;s influence is global, one bad rule\u0026rsquo;s destructive power far exceeds one good rule\u0026rsquo;s benefit.\nC exists to block this risk. C is system\u0026rsquo;s only role with git commit permission — R only writes, C can approve. R-produced rules must pass C\u0026rsquo;s schema validation, format check, deduplication verification, before becoming verified state for L to see. This \u0026ldquo;production-audit\u0026rdquo; two-step flow, essentially is software engineering\u0026rsquo;s Code Review — not distrusting developers, but single perspective blind spots need another perspective to complement.\nL and R/C don\u0026rsquo;t communicate directly, only interact indirectly through Git repo. O is only coordinator — L sends requests to O, O returns summaries to L. L doesn\u0026rsquo;t know R and C exist.\nR only writes, C only reviews, L only learns, O coordinates in-between, S translates. Read-write separation guarantees L never reads R\u0026rsquo;s half-written drafts. This isolation isn\u0026rsquo;t over-design — Progressive Disclosure architecture discussed in Part four, essentially is one implementation form of role separation.\nNinth Hurdle: From Implementation to Protocol — GEAR At this step, Aristotle has a complete rule management system: Git-backed storage, YAML frontmatter schema, state machine, multi-dimensional retrieval, role separation.\nWhen organizing documentation, I realized these designs aren\u0026rsquo;t limited to Aristotle — any AI agent needing \u0026ldquo;learn from errors and persist knowledge across sessions\u0026rdquo; faces the same problems. So abstracted core designs into an independent protocol spec, named GEAR (Git-backed Error Analysis \u0026amp; Reflection).\nGEAR defines five roles (O/R/C/L/S), a state machine, a frontmatter schema, and a Δ decision factor. Aristotle is GEAR\u0026rsquo;s first implementation — O is implemented by SKILL.md + REFLECT.md + REVIEW.md, R by REFLECTOR.md, C by schema validation in REVIEW.md. L and S still in planning.\nΔ Decision Factor This mechanism is part of the GEAR protocol spec but not yet implemented in Aristotle. It is documented here as forward-looking design — the current behavior remains fixed semi mode for all rules.\nOne mechanism in GEAR still in design — Δ decision factor:\nΔ = confidence × (1 − risk_weight) It decides rule\u0026rsquo;s path from staging to verified:\nΔ Value Review Behavior Δ \u0026gt; 0.7 auto: automatic commit, no manual confirmation needed 0.4 \u0026lt; Δ ≤ 0.7 semi: show diff, wait for user confirmation Δ ≤ 0.4 manual: force manual review risk_weight determined by error category — hallucination (HALLUCINATION) weight 0.8, syntax error (SYNTAX_API_ERROR) weight 0.2. High-risk rules need more review, low-risk rules can pass faster.\nCurrently Aristotle fixedly uses semi mode — all rules go through user confirmation. The system is still accumulating data. Success rate statistics aren\u0026rsquo;t sufficient. It doesn\u0026rsquo;t have conditions to automatically adjust review levels. After P4 phase implements evolution_stats.json, Δ factor can truly land.\nThe Δ factor borrows from progressive trust models — not a blanket trust or distrust of AI rules, but dynamically adjusting the review threshold based on confidence and risk weight. Consistent with the \u0026ldquo;trust calibration\u0026rdquo; discussed in Part four, and also to let users quantify the risk of being lazy.\nCurrent Status and Next Steps GEAR protocol implementation currently completed P1 and P2 (first two phases):\nP1 (MCP Infrastructure): 8 tools, YAML frontmatter schema, multi-dimensional retrieval, atomic writes, cold start migration. 75 pytest tests all passing. P2 (Aristotle Skill Layer Integration): REVIEW.md refactored to MCP tool call chain, REFLECTOR.md output protocol extension, C role schema validation. Not yet implemented:\nP3 (Learner + Searcher): let AI automatically retrieve relevant rules before task starts. This is key to GEAR self-healing loop — evolve from \u0026ldquo;manual trigger reflection\u0026rdquo; to \u0026ldquo;automatic learning avoid pitfalls.\u0026rdquo; P4 (Evolution Model): Δ decision factor actual integration, and review level automatic adjustment. P5 (Documentation Wrap-up) Review this design path: from one append-only Markdown file, to Git-backed filesystem, to MCP Server, to GEAR protocol. Every step wasn\u0026rsquo;t planned in advance, but forced out by concrete problems encountered in actual use.\nRules couldn\u0026rsquo;t roll back → introduce Git. Read-write interference → introduce read-write separation and state machine. AI directly executing git commands unreliable → introduce MCP Server. No structure between rules → introduce YAML frontmatter and retrieval dimensions. Production and review goals conflict → introduce role separation. Design reusable → abstract as GEAR protocol.\nEvery step solves one concrete problem. Connected together, this path points to an ultimate goal — self-healing loop:\nL executes task → O retrieves relevant rules through S → L still makes error after learning → submit error report to O → O pulls up R generate new rule → C reviews → verified → next L retrieves updated rules through S Every failure creates new knowledge, every new knowledge reduces similar failure probability. Loop wasn\u0026rsquo;t designed from day one, but every iteration paved way for it — Git gave version management, MCP gave structured interface, retrieval dimensions gave precise matching, role separation gave trust boundaries, O in-between gave context isolation. All these together, loop runs.\nNot considering all possible requirements at system design start, but discovering problems in use, refining abstractions in solving problems. This is harness engineering\u0026rsquo;s core method — iterate on smallest viable solution, let every iteration solve one real problem, eventually discover loop already underfoot.\nAppendix: GEAR Protocol Conformance Requirements A system claiming GEAR compliance must meet the following conditions:\nRole separation. Production (R), review (C), consumption (L) handled by different agents or processes. Same agent cannot simultaneously execute two roles on same rule. Git-backed storage. All verified rules must enter repo via git commit. Consumers read through git show HEAD: or equivalent mechanism. State machine enforcement. Rules can only convert through pending → staging → verified or rejected paths, no skipping allowed. Frontmatter schema. Each rule must contain YAML frontmatter, at minimum have id, status, scope, category, confidence, risk_level, intent_tags, created_at fields. Intent-driven retrieval. System must support querying rules by intent_tags.domain and intent_tags.task_goal. Rejected rule preservation. Rejected rules preserve original metadata, can be restored through restore operation. Atomic writes. Rule files written via \u0026ldquo;write temporary file + rename\u0026rdquo; method, no partial writes allowed. Reference Links Aristotle project repo: github.com/alexwwang/aristotle GEAR protocol spec (the project\u0026rsquo;s internal GEAR.md file, currently in git-mcp branch) Previous article From Scars to Armor: Harness Engineering in Practice Aristotle project is open source on GitHub, MIT license. Issues and PRs welcome.\n","permalink":"https://blog.chuanxilu.net/en/posts/2026/04/from-markdown-to-mcp-server-gear-protocol/","summary":"\u003cp\u003eThe previous article, \u003ca href=\"/en/posts/2026/04/from-scars-to-armor-harness-engineering-practice/\"\u003eFrom Scars to Armor: Harness Engineering in Practice\u003c/a\u003e, ended with Aristotle having a streamlined router (SKILL.md compressed from 371 lines to 84), an on-demand progressive disclosure architecture, and a working reflect→review→confirm workflow.\u003c/p\u003e\n\u003cp\u003eBut one thread never got pulled: \u003cstrong\u003eWhere do confirmed rules actually live?\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis article follows that thread. It wasn\u0026rsquo;t planned from the start. Three concrete problems in actual use forced the design out, step by step.\u003c/p\u003e","title":"A Markdown's Three Lives: From Static Rules to Git-Backed MCP Server"},{"content":"Three articles in. Back to code — and a hard look in the mirror.\nThe first post, Aristotle: Teaching AI to Reflect on Its Mistakes, covered the design philosophy and a smooth implementation — three commits in one go. The second, claude-code-reflect: Same Metacognition, Different Soil, described the adaptation cost of moving the same philosophy to Claude Code — continuous iteration from V1 to V3. The third, Trust Boundaries: The Same Idea on Open and Closed Platforms, proposed a tiered trust model and a harness engineering framework.\nPart three gave us theory. This article returns to code — using Aristotle\u0026rsquo;s refactoring to validate how those theories land in real engineering practice.\nFrom \u0026ldquo;Smooth\u0026rdquo; to \u0026ldquo;Scars\u0026rdquo; The first version of Aristotle was genuinely smooth to implement. Three commits. Complete SKILL.md to test script to README, done in one flow. Not because the problem was simple. OpenCode\u0026rsquo;s infrastructure solved the hardest parts.\nIn the first post, I confidently wrote three design principles. The second was \u0026ldquo;complete session isolation\u0026rdquo; — \u0026ldquo;reflection happens in a background sub-session, main session context is zero-pollution, won\u0026rsquo;t affect current tasks.\u0026rdquo; I also said \u0026ldquo;the entire process is transparent to users, won\u0026rsquo;t interrupt workflow.\u0026rdquo;\nAfter actually using it, I found those two claims delivered on exactly nothing.\nThe main session\u0026rsquo;s context was not zero-pollution. When /aristotle triggered, the full 371-line SKILL.md was injected into the parent session. Reflection was supposed to be \u0026ldquo;isolated metacognition,\u0026rdquo; but just starting it consumed huge chunks of the main session\u0026rsquo;s tokens. When the subprocess finished, background_output(full_session=true) pulled the entire RCA report back to the parent session. Error classification, root cause chains, suggested rules — all flooded into the working context. The goal was to help the AI understand its limits. The process of understanding ended up disrupting normal work. The sub-agent session created by task() is non-interactive — this is OpenCode\u0026rsquo;s architectural limit. The first version assumed users could jump into the sub-agent\u0026rsquo;s session for review. In practice, users could only manually open another terminal. The review flow broke in actual use. The process wasn\u0026rsquo;t transparent either. Startup popped a model selection dialog, consuming one round of conversation. Promises vs. reality:\nPromise from Part 1 Actual Behavior Main session context zero-pollution Massive context pollution: SKILL.md 371 lines fully injected + RCA report fully pulled back Entire process transparent to users Immediately demands user choice: model selection popup consumes one conversation round Won\u0026rsquo;t interrupt workflow Severely interrupts current workflow: results write directly into the main session. Even opening another terminal for review doesn\u0026rsquo;t work — because it\u0026rsquo;s a sub-agent session, review is structurally impossible. The flow is broken. Why such a gap between design and implementation?\nThe root cause: I over-trusted automated test results, so I didn\u0026rsquo;t test manually. 37 static assertions plus an E2E live test verified that the protocol steps executed in order — Coordinator starts Reflector, Reflector reads session, generates DRAFT. But tests verified \u0026ldquo;did the protocol execute correctly,\u0026rdquo; not \u0026ldquo;are the protocol\u0026rsquo;s side effects acceptable.\u0026rdquo; Tests won\u0026rsquo;t tell you the main session got stuffed with 371 lines of context. Tests won\u0026rsquo;t tell you users need to open another terminal to review. Tests gave a passing illusion. I thought the design principles had landed. In reality, the principles were violated at step one of implementation.\nA self-fulfilling prophecy: \u0026ldquo;When tools are smooth enough, humans naturally treat review as optional.\u0026rdquo; Automated tests passing gave me the same smoothness — \u0026ldquo;protocol works\u0026rdquo; — and made me feel manual testing was unnecessary. So I skipped examining side effects. Until I used it myself.\nOnce the root cause was clear, the refactoring direction was obvious. The original design principles weren\u0026rsquo;t wrong — isolation, transparency, low friction. But the implementation didn\u0026rsquo;t enforce them. The refactoring isn\u0026rsquo;t about overthrowing the design. It\u0026rsquo;s about locking design principles into code structure, making violations structurally impossible.\nFrom Four Problems to One Architecture The four problems looked independent. But analysis revealed they all point to the same structural defect: the first version used a single agent skill, with no distinction between \u0026ldquo;coordinator\u0026rdquo; and \u0026ldquo;executor\u0026rdquo; roles and their responsibility boundaries.\nContext pollution: the coordinator doesn\u0026rsquo;t need to know the executor\u0026rsquo;s full protocol. But with only one agent role, everything got stuffed into a single skill file, and at runtime all of it was disclosed into the main session\u0026rsquo;s context. Report leakage: the coordinator doesn\u0026rsquo;t need to see the executor\u0026rsquo;s full output. But with only one agent, the output was sent back to the main session too. Broken review: the executor shouldn\u0026rsquo;t handle review — it\u0026rsquo;s non-interactive. But the first version had users reviewing in the executor\u0026rsquo;s session, only moving the reflection process into a sub-agent session. Because the first version only designed one agent role. Wasted attention: the coordinator\u0026rsquo;s startup flow should be zero-interaction. But the first version used a popup dialog at startup. I didn\u0026rsquo;t think through this interaction problem. Over-trusted the model and prompt\u0026rsquo;s ability to produce good design. This version involved a critical investigation: the broken review. By examining OpenCode\u0026rsquo;s source code and database evidence, I confirmed that task() creates architecturally non-interactive sessions — all 47 task sessions had exactly 1 user message (the system prompt), none with follow-up interaction. OpenCode\u0026rsquo;s GitHub Issues #4422, #16303, and #11012 confirmed this too.\nThis means review can\u0026rsquo;t happen in a sub-agent session. It must be implemented in the main session. The original implementation idea needs to be scrapped. But the principle stands. What to do? File an issue with OpenCode and wait for them to add user interaction support in sub-sessions? Or change my own design? The answer is obvious.\nFrom this constraint, a natural question: if user review is unavoidable in the workflow, and it can\u0026rsquo;t happen in a sub-session, what about launching a dedicated main session just for this? If review happens in the main session, the main session needs to know which reflection records exist — that\u0026rsquo;s the origin of aristotle-state.json. Need to load a specific record\u0026rsquo;s DRAFT report — that\u0026rsquo;s the /aristotle review N command. Need to handle confirm, revise, reject — that\u0026rsquo;s the interactive review flow.\nFurther: since the review protocol and the startup protocol are only used in different scenarios, was it necessary to put them in the same 371-line file? Do both scenarios need the same content loaded? After splitting by responsibility: routing logic stays in SKILL.md (84 lines), reflection startup logic in REFLECT.md (110 lines), review logic in REVIEW.md (167 lines), sub-agent analysis protocol in REFLECTOR.md (172 lines). Each file is loaded only in its scenario. Context usage drops significantly. The main session\u0026rsquo;s context pollution is minimized.\nAfter splitting, SKILL.md is just routing logic. But the first version had a model selection popup before startup, consuming one conversation round. Since starting reflection only needs REFLECT.md (+110 lines), the popup is completely unnecessary. Delete it, replace with command-line parameter --model. Default uses the current session model. Advanced users override via parameter. Starting reflection goes from a two-step operation to one.\nFinally, information flow. The first version\u0026rsquo;s background_output(full_session=true) pulled the sub-agent\u0026rsquo;s complete analysis back to the main session. After refactoring, this call is deleted entirely. When the sub-agent finishes, the main session outputs one line of notification. When users want to review, they actively pull the DRAFT report via /aristotle review N. Information flow goes from \u0026ldquo;sub-agent pushes everything to main session\u0026rdquo; to \u0026ldquo;sub-agent writes state file, user pulls on demand.\u0026rdquo;\nThe entire reasoning process distills into three principles:\nDerive architecture from constraints, not from ideal flows. First confirm what the platform can do (task sessions are non-interactive), then design the flow. Split by responsibility, load by scenario. Each file corresponds to one clear responsibility. Each scenario loads only what it needs. Put the user in the driver\u0026rsquo;s seat. Notify, don\u0026rsquo;t push. Pull, don\u0026rsquo;t inject. Command-line parameters, not popups. First Constraint: Context Boundary — 371 Lines to 84 Direct approach: split the 371-line SKILL.md monolith into four on-demand files.\nScenario Files Loaded Lines Command routing SKILL.md 84 Starting reflection SKILL.md + REFLECT.md 194 Reviewing rules SKILL.md + REVIEW.md 251 Sub-agent analysis REFLECTOR.md 172 When /aristotle starts, only the 84-line routing file loads. The complete analysis protocol only goes to the sub-agent. The main session is protected from the start.\nImplementation: the Coordinator passes the REFLECTOR.md location to the sub-agent via the SKILL_DIR environment variable, not by inlining the full protocol. The sub-agent receives the prompt and reads the file itself.\nThe first version had no context boundary between main session and sub-agent — the sub-agent\u0026rsquo;s protocol was unconditionally injected into the main session\u0026rsquo;s context. After the fix, each scenario loads only the minimum information it needs. Think of hiring an external auditor. The auditor needs to see all the books. But you don\u0026rsquo;t need the auditor\u0026rsquo;s working papers spread across your desk. You need the conclusion, not every page of scratch paper. Reflection is a post-incident activity. It shouldn\u0026rsquo;t interfere with ongoing incident response.\nGit commit 39dffae completed this refactoring. 371 lines to 84 lines, a 77% reduction. Functionality didn\u0026rsquo;t decrease — it increased (the --focus parameter, state tracking, cross-session joint reflection, revision of already-written rules).\nSecond Constraint: Information Flow — One-Way Completion Notification Delete the background_output() call entirely. When the sub-agent finishes, the parent session outputs one line:\n🦉 Aristotle done [current]. Review: /aristotle review N The parent session no longer retrieves any analysis content. Review happens via /aristotle review N in a dedicated review session — REVIEW.md (167 lines) loads, reads the DRAFT report, and presents it for user confirmation.\nInformation flow went from bidirectional to strictly one-way: sub-agent → state file → user actively pulls.\nThe DRAFT marker means \u0026ldquo;pending verification.\u0026rdquo; Users must see it with their own eyes, manually confirm, before rules land. Part three asked the Level 0 trust question: \u0026ldquo;how much authorization can this model\u0026rsquo;s RCA quality support right now?\u0026rdquo; The answer: not enough for fully automatic writing.\nAnother consideration: when the sub-agent finishes, it triggers a completion notification. If the main session is processing another user request at that moment, the analysis report flooding in will disrupt current work. One-way notification plus user pull hands information flow control to the user.\nInterestingly, writing this I realize: claude-code-reflect\u0026rsquo;s /reflect review was forced (platform limitation), while Aristotle\u0026rsquo;s /aristotle review N is an active choice. Even if OpenCode had no limitations, after thinking it through I\u0026rsquo;d still design it this way — launch a dedicated main session to review DRAFTs — rather than the original approach of pulling up each sub-session individually.\nHow this design is implemented on OpenCode is worth explaining. Quite interesting.\nThird Constraint: Architectural Reality — Review Returns to Main Session This was the most convoluted problem.\nThe core issue: OpenCode\u0026rsquo;s task() creates non-interactive sessions — GitHub issues and database evidence both confirm this. The first version\u0026rsquo;s \u0026ldquo;user jumps into sub-session to review\u0026rdquo; flow wasn\u0026rsquo;t feasible in practice.\nThe solution wasn\u0026rsquo;t to bypass the limitation (like claude-code-reflect\u0026rsquo;s bypassPermissions). It was to acknowledge the limit and redesign the flow:\nSub-agent only does analysis and generates DRAFT — no user interaction Leverage OpenCode\u0026rsquo;s openness and trust: review happens in the main session via /aristotle review N Introduce ~/.config/opencode/aristotle-state.json state tracking file to manage reflection record lifecycle Support multiple reflections, distinguished by serial number. Users find scenarios worth re-reflecting via /aristotle sessions (in real life, we also occasionally recall mistakes we\u0026rsquo;ve made before, don\u0026rsquo;t we?) State flow: draft → confirmed → revised (allows re-reflect for re-analysis)\nSub-agent does analysis. Main session does review and writing. This separation isn\u0026rsquo;t a forced compromise. OpenCode\u0026rsquo;s non-interactive task session limit looks like a platform constraint on the surface. Underneath, it\u0026rsquo;s a healthy architectural boundary: agents executing sub-tasks should maintain independent context, free from main session interference.\nCompare with Part two\u0026rsquo;s claude-code-reflect. On Claude Code, this separation was achieved through V2→V3\u0026rsquo;s \u0026ldquo;move writes to resumed session\u0026rdquo; — lots of detours. On OpenCode, the platform\u0026rsquo;s architectural limit guided us directly to the correct separation.\nA deeper observation: how much a platform trusts developers, and how much users trust AI — sometimes these align, sometimes they oppose. In Aristotle\u0026rsquo;s scenario, they aligned. OpenCode\u0026rsquo;s openness lets you make separation cleaner. Level 0 trust design requires exactly that cleanliness.\nFourth Constraint: Minimizing User Attention Cost Simplest fix. Clearest principle.\nDelete the question tool popup. Replace with command-line parameter: /aristotle --model sonnet. Unspecified defaults to current session model.\nSmall change. The principle behind it isn\u0026rsquo;t small: starting reflection should be low-barrier. Every added interaction step reduces the probability that users will start reflection. Model selection is an advanced need for different scenarios. Worth providing to users. Not worth making part of the default flow. So the configuration moves to startup time rather than asking again after launch. In real life, we also prefer working with people who explain the situation upfront, rather than asking permission at every step.\nThe first version used an interactive popup — every time relying on the user to make a choice. After changing to command-line parameters, \u0026ldquo;no popup\u0026rdquo; is structurally guaranteed. The default doesn\u0026rsquo;t depend on user judgment.\nAll Four Together: Progressive Disclosure Architecture The four fixes aren\u0026rsquo;t isolated. Together they form Progressive Disclosure.\nThe final architecture diagram:\nReflect Phase Review Phase ───────────── ─────────── /aristotle /aristotle review 1 │ │ ├─ Load REFLECT.md ├─ Load REVIEW.md │ (110 lines) │ (167 lines) │ │ ├─ Fire Reflector ──────► ├─ Read Reflector session │ (background task) DRAFT │ Extract DRAFT report │ ──────► │ ├─ Update state file ├─ Present DRAFT to user ├─ One-line notification ├─ Handle confirm/revise/reject └─ STOP ├─ Write rules on confirm └─ Re-reflect if requested 371 lines to 84 lines, a 77% reduction. Functionality didn\u0026rsquo;t decrease — it increased (the --focus parameter, state tracking, cross-session joint reflection, revision of already-written rules).\nFour files in the final structure:\nSKILL.md (84 lines) — routing layer, parameter parsing and phase dispatch REFLECT.md (110 lines) — reflection phase protocol, sub-agent startup and state tracking REVIEW.md (167 lines) — review phase protocol, DRAFT review, rule writing, revision REFLECTOR.md (172 lines) — sub-agent analysis protocol, error analysis, DRAFT generation Test assertions expanded from 37 to 63, covering file structure, progressive disclosure, SKILL.md content, hook logic, error pattern detection (Chinese and English), and architectural guarantees.\nEvery layer of this architecture is a product of trust judgments:\nFile splitting: don\u0026rsquo;t trust the parent session to absorb the sub-agent\u0026rsquo;s full context impact without problems One-way notification: don\u0026rsquo;t trust users to immediately handle asynchronous information from the sub-agent Main session review: don\u0026rsquo;t trust the sub-agent session to properly handle user interaction — and it shouldn\u0026rsquo;t No popup by default: don\u0026rsquo;t trust user attention to be unlimited But \u0026ldquo;don\u0026rsquo;t trust\u0026rdquo; here isn\u0026rsquo;t negative judgment. It\u0026rsquo;s precise trust calibration — each component is trusted to do what it\u0026rsquo;s best at, not trusted to do what\u0026rsquo;s beyond its capabilities. Think of a symphony orchestra. It\u0026rsquo;s not that you don\u0026rsquo;t trust the horn player to play violin. Each section keeps to its own score. When it\u0026rsquo;s your solo, you get the full part (REFLECTOR.md, 172 lines). When it\u0026rsquo;s not your turn, you only need to know what\u0026rsquo;s next (SKILL.md, 84 lines). Nobody needs the full score.\nBack to the Origin: Trust-Driven Design Tradeoffs Part three proposed a harness engineering framework: \u0026ldquo;between users, tools, and language models, let trust relationships drive architectural decisions — not the other way around.\u0026rdquo; Part four validates this framework through Aristotle\u0026rsquo;s refactoring.\nThe trust curve across both projects:\nPhase Trust Judgment Code Manifestation First Aristotle Implicit trust — didn\u0026rsquo;t consider boundaries 371-line full injection, full report retrieval Discovering problems Trust calibration — realized boundaries were missing Promise vs. reality comparison, four architectural defects exposed Refactoring Aristotle Active constraints — lock boundaries with code structure Progressive Disclosure claude-code-reflect Constrained choices — platform limits shape the design bypassPermissions, resumed session Two paths, same destination. Both converge to: sub-agent analyzes, main session reviews, user approves. The difference is, on OpenCode you have the opportunity to actively choose the correct constraints. On Claude Code, you\u0026rsquo;re forced to find workarounds under platform limits.\nAn open question: as model capabilities improve, Aristotle\u0026rsquo;s DRAFT report quality will gradually improve. When trust shifts from Level 0 to Level 1-2, does Progressive Disclosure architecture need to change?\nProbably no. Even if sub-agent output quality is high enough, context isolation still has value. You trust the output quality. You don\u0026rsquo;t trust asynchronous information influx disrupting the main session. These two trust dimensions are independent.\nClosing Four scars became a set of armor.\nNot every scar becomes armor. Some problems need platform-level support — like interactive capability for sub-agent sessions, auto-notification mechanisms. Some constraints under the current architecture can only be mitigated, not eliminated.\nBut when trust judgments can be transformed into code structure, scars become armor\u0026rsquo;s raw material. Progressive Disclosure isn\u0026rsquo;t about showing off architectural skills. It\u0026rsquo;s about solidifying trust relationships into verifiable code constraints. The boundary between main session and sub-agent, one-way information flow, main session review, no popup by default — every constraint is a concrete trust judgment.\nAristotle and claude-code-reflect aren\u0026rsquo;t about \u0026ldquo;which is better.\u0026rdquo; They\u0026rsquo;re two points on the same trust curve, twins that inspire and iterate on each other. The real question was never \u0026ldquo;whether there should be human in the loop.\u0026rdquo; It\u0026rsquo;s: how much authorization can this model\u0026rsquo;s reliability on this task in this environment support right now?\nAs model capabilities improve, that answer will change. Checkpoints will shift backward. Automation will increase. Review frequency will decrease.\nBut deciding \u0026ldquo;it\u0026rsquo;s time to shift\u0026rdquo; — that\u0026rsquo;s always human responsibility.\nAristotle project: https://github.com/alexwwang/aristotle\n","permalink":"https://blog.chuanxilu.net/en/posts/2026/04/from-scars-to-armor-harness-engineering-practice/","summary":"\u003cp\u003eThree articles in. Back to code — and a hard look in the mirror.\u003c/p\u003e\n\u003cp\u003eThe first post, \u003ca href=\"/en/posts/2026/04/aristotle-ai-reflection/\"\u003eAristotle: Teaching AI to Reflect on Its Mistakes\u003c/a\u003e, covered the design philosophy and a smooth implementation — three commits in one go. The second, \u003ca href=\"/en/posts/2026/04/claude-code-reflect-different-soil/\"\u003eclaude-code-reflect: Same Metacognition, Different Soil\u003c/a\u003e, described the adaptation cost of moving the same philosophy to Claude Code — continuous iteration from V1 to V3. The third, \u003ca href=\"/en/posts/2026/04/a-trust-boundary-design-experiment/\"\u003eTrust Boundaries: The Same Idea on Open and Closed Platforms\u003c/a\u003e, proposed a tiered trust model and a harness engineering framework.\u003c/p\u003e","title":"From Scars to Armor: Harness Engineering in Practice"},{"content":" Fundamentum autem est iustitiae fides, id est dictorum conventorumque constantia et veritas.\n— Cicero, De Officiis\nThe foundation of justice is fides — constancy and truthfulness in words and agreements.\nThe first two posts told the story of two projects. Aristotle: Teaching AI to Reflect on Its Mistakes runs on OpenCode — three commits, done. claude-code-reflect: Same Metacognition, Different Soil runs on Claude Code — V1 through V3, hitting walls the entire way.\nBoth projects solve the same problem: teaching AI agents to learn from mistakes and persist those lessons as durable rules. But the implementation process made me realize there\u0026rsquo;s a question more important than technology choice. When should we trust AI\u0026rsquo;s judgment, and when should we step in?\nThis question doesn\u0026rsquo;t just apply to users trusting AI. How much a platform trusts developers also determines how much autonomy you can give the AI. Two layers of trust, stacked together — that\u0026rsquo;s what these experiments are really exploring.\nA Quick Comparison of the Two Projects Same core logic: detect correction signals → spawn a subagent for 5-Why root cause analysis → generate prevention rules → user confirms → write to persistent memory → auto-load next session. But on different platforms, the implementation looks nothing alike:\nDimension Aristotle (OpenCode) claude-code-reflect (Claude Code) System primitives task(), session_read() fully transparent, paths visible Interfaces opaque; session_read() unavailable under some model/provider combos Permission model Full system access; skills can do whatever the system can do Skill system is a sandbox; bypassPermissions is a forced choice — the auto mode\u0026rsquo;s safety classifier may be unavailable in background sessions, causing deadlock Concurrency control task() is atomic; subagent launch can\u0026rsquo;t be interrupted Multi-step preparation calls can be interleaved by user requests; must merge into a single Bash command State management Built-in notification system state.json + manual session resume Implementation time 3 commits V1→V2→V3, continuous iteration Details in post one and post two. What I want to talk about is what\u0026rsquo;s behind the differences.\nA Tiered Trust Model A Turning Point When I compared the two implementations side by side, I expected to conclude \u0026ldquo;OpenCode is better.\u0026rdquo;\nOn reflection, that\u0026rsquo;s too simple.\nclaude-code-reflect\u0026rsquo;s human-in-the-loop design — requiring users to run /reflect review, read the RCA report with their own eyes, and manually confirm before writing to memory — isn\u0026rsquo;t just a workaround for system limitations. It\u0026rsquo;s an active response to a real question:\nHow much do you trust this model\u0026rsquo;s root cause analysis right now?\nIf the Reflector\u0026rsquo;s RCA quality isn\u0026rsquo;t stable enough, fully automatic memory writes create systemic risk. Bad prevention rules get auto-loaded, silently affecting dozens of subsequent sessions without you noticing. By the time you discover the problem, contamination has already spread.\nAn Analogy from Code Review Culture This tension plays out every day in engineering teams: who gets to push to production?\nA high-trust team gives engineers direct push access. Deploys are fast. But if someone\u0026rsquo;s judgment is off, mistakes go straight into production. A conservative team requires code review, staging validation, and manual approval. Slower — but every gate is a chance to catch something.\nNeither approach is universally right. It depends on three things: how much track record you have with this person, how expensive a mistake would be, and whether the system can roll back.\nThe pattern is the same everywhere. A new joiner gets every PR reviewed line by line. After shipping reliably for a year, they earn the right to self-merge routine changes. Senior engineers with years of demonstrated judgment might get direct production access — but the team still runs random audit reviews, and every deploy leaves an auditable trail.\nCheckpoints shift backward as trust accumulates. From \u0026ldquo;review everything\u0026rdquo; to \u0026ldquo;review only high-risk changes\u0026rdquo; to \u0026ldquo;audit after the fact.\u0026rdquo; But the decision to shift is always deliberate, always based on evidence, and always reversible if the evidence changes.\nMapping This to AI Agents We\u0026rsquo;re in the early days of human-AI collaboration. Think \u0026ldquo;every PR needs review, no exceptions.\u0026rdquo;\nThe corresponding trust model looks like this:\nLevel 0 (now): Every RCA requires human review before writing to memory. Users confirm each prevention rule with their own eyes. This is claude-code-reflect\u0026rsquo;s current design.\nLevel 1: RCAs auto-write, flagged \u0026ldquo;pending verification.\u0026rdquo; Weekly batch review to confirm or revoke.\nLevel 2: High-confidence RCAs auto-archive. Low-confidence ones enter the review queue. Users periodically spot-check high-confidence archives.\nLevel 3: Fully automatic, complete audit logs, random sampling as quality assurance.\nAristotle (OpenCode version) operates closer to Level 2-3. That\u0026rsquo;s not a problem — it\u0026rsquo;s what the open system can do. But whether it should run that way depends on your actual trust in the model\u0026rsquo;s RCA quality.\nAn Honest Confession I ran automated tests on Aristotle with zero human intervention. The test script verified that the Coordinator correctly launched the Reflector, verified that session IDs were properly passed — then it ended, printing \u0026ldquo;to view analysis results, run opencode -s \u0026lt;id\u0026gt;.\u0026rdquo;\nThat review step? I didn\u0026rsquo;t do it.\nMy excuse was \u0026ldquo;claude-code-reflect isn\u0026rsquo;t done yet, limited bandwidth.\u0026rdquo; But this behavior itself proves the point: when tools are smooth enough, humans naturally treat review as optional — even when the system design requires it.\nThis isn\u0026rsquo;t criticism. It\u0026rsquo;s an honest observation about human-AI collaboration. Trust boundaries drift quietly in practice. Faster tools make lowering guard easier than slower ones.\nDesign Philosophy — Who Trusts Whom User trust in AI is only half the picture. How much a platform trusts its developers determines how far that user trust can actually be expressed — and that\u0026rsquo;s a separate layer entirely.\nOpenCode Treats You Like a Developer OpenCode is fully open source. task() launches subagents. session_read() reads session content. Memory file paths are completely transparent. Whatever the system can do, skills can do. Complexity goes entirely into the problem itself — how to do root cause analysis, how to classify errors, how to generate useful prevention rules.\nAristotle\u0026rsquo;s flow is crystal clear as a result: user triggers → Coordinator collects metadata → Reflector analyzes in an isolated subagent → generates rules → user confirms → writes. No surprises. No hidden traps.\nClaude Code Treats You Like a User Claude Code is a closed-source commercial product. The skill system is a sandbox Anthropic gives you within boundaries it deems safe — you can do what Anthropic decides to expose.\nImplementing the same flow, complexity goes into fighting system boundaries. bypassPermissions is a forced choice — ideally, background subagents should use auto permission mode. But auto mode relies on an internal safety classifier that evaluates each tool call\u0026rsquo;s risk level. This classifier needs an interactive session to display permission dialogs and receive user decisions. Background sessions run in non-interactive mode — the classifier can\u0026rsquo;t complete its decision loop, so all tool calls in the subagent deadlock. I had to use bypassPermissions and compensate with hand-written path restrictions in prompts. Concurrency control has no system primitives — multi-step preparation can be interrupted by user requests. State management emulates via the filesystem. Notifications rely on users manually resuming sessions.\nThe Known Issues list honestly records these problems — UUID collisions, cross-compaction notification loss, incomplete error recovery\u0026hellip; each one a scar from patching at the system\u0026rsquo;s edge.\nWhere Two Layers of Trust Intersect On the surface, this looks like the old \u0026ldquo;open source vs. closed source\u0026rdquo; debate. But the reality is more nuanced than the labels suggest.\nUsers trusting AI is layer one: how much confidence I have in the model\u0026rsquo;s root cause analysis determines whether RCA results write automatically or need human approval. This trust isn\u0026rsquo;t theoretical — during claude-code-reflect development, I ran into a series of technical issues and deliberately had a deep conversation with Sonnet 4.6 to analyze the causes. But for the actual project implementation, I judged that glm-5.1 was fully capable and used it to drive the work. That\u0026rsquo;s trust in action too: my assessment of different models\u0026rsquo; capabilities on different tasks directly shaped work allocation.\nPlatforms trusting users is layer two: how much room a platform gives me determines how far layer-one trust can be expressed. In that conversation with Sonnet 4.6, I realized the real reason behind all the pitfalls I hit on Claude Code wasn\u0026rsquo;t that the model wasn\u0026rsquo;t smart enough, or that my design was flawed — it was that \u0026ldquo;how much the platform trusts its users\u0026rdquo; was decided from the start, and that determined how far you could go.\nStack the two layers: OpenCode\u0026rsquo;s high trust (in developers) lets me approach Level 2-3. Claude Code\u0026rsquo;s low trust pushes me back to Level 0-1. Neither is \u0026ldquo;better\u0026rdquo; — they just have different trust foundations at this moment.\nBut dig one level deeper. Low platform trust isn\u0026rsquo;t necessarily arrogance — it might come from a clear-eyed assessment of model capabilities. If you believe current models\u0026rsquo; RCA isn\u0026rsquo;t stable enough, then restricting developers to Level 0-1 isn\u0026rsquo;t blocking innovation. It\u0026rsquo;s preventing systemic risk. Claude Code\u0026rsquo;s sandbox design and claude-code-reflect\u0026rsquo;s human-in-the-loop share the same judgment: right now, the model\u0026rsquo;s autonomous decisions don\u0026rsquo;t deserve full trust.\nTrust Boundaries in the Wild So far this has been about two specific projects. But the same trust dynamics play out at industry scale.\nThe two layers of trust discussed above aren\u0026rsquo;t abstract theorizing. Two events from the past few months serve as real-world footnotes.\nThe OpenClaw Incident: Where Platform Trust Ends OpenClaw is an open-source autonomous AI agent platform with over 340,000 GitHub stars. It can execute shell commands, read and write files, automate browsers, manage email and calendars. Users connected their Claude subscription OAuth tokens through OpenClaw, turning a $200/month subscription into $1,000-5,000 of API-equivalent usage.\nAnthropic\u0026rsquo;s response came in three steps. January 2026: silent technical block — subscription OAuth tokens stopped working outside the Claude Code CLI, no advance notice. February: legal compliance documentation explicitly prohibiting subscription tokens for third-party tools. April: subscriptions no longer cover third-party tool usage; pay-as-you-go required.\nCommunity reaction was fierce. DHH called it \u0026ldquo;very customer hostile\u0026rdquo; on X[1]. George Hotz published \u0026ldquo;Anthropic Is Making a Huge Mistake\u0026rdquo;[2]. On Hacker News, users compared it to an all-you-can-eat buffet: the promise of unlimited subscriptions met actual unlimited consumers.\nOpenAI took the opposite approach — OpenClaw can connect to ChatGPT Pro subscriptions. OpenClaw\u0026rsquo;s creator Peter Steinberger joined OpenAI[3], while OpenClaw moved to a foundation to stay independent. For a moment, the \u0026ldquo;open vs. closed\u0026rdquo; narrative seemed to have a clear answer.\nBut consider the other angle. OpenClaw hands shell access, filesystem control, email and calendar management to an agent that prompt injection can hijack. Zenity Labs demonstrated a zero-click attack chain[4]: indirect prompt injection via Google Document → agent creates Telegram backdoor → modifies SOUL.md for persistence → deploys scheduled re-injection every two minutes → establishes traditional C2 channel for full system compromise. Every step uses OpenClaw\u0026rsquo;s intended capabilities — no software vulnerability required. Gartner\u0026rsquo;s assessment: \u0026ldquo;unacceptable cybersecurity risk\u0026rdquo;[5].\nA side note on these security issues. They triggered a real trust crisis among users. But OpenClaw\u0026rsquo;s developers responded in March 2026 — disclosing nine CVEs, patching each one, adding credential encryption at rest, plugin capability controls, and sandbox hardening[6]. The problems were real. The response was also real. That process itself validates what this post is about: it\u0026rsquo;s not about never breaking things. It\u0026rsquo;s about pulling back to a safe boundary quickly when things do break.\nViewed this way, Anthropic\u0026rsquo;s block isn\u0026rsquo;t purely a business decision. The \u0026ldquo;we treat you as a user\u0026rdquo; philosophy extends to the ecosystem level: when the trust foundation isn\u0026rsquo;t there, pull back into your own harness and open up incrementally within controlled boundaries — rather than handing shell access to a third-party agent that prompt injection can hijack. As we\u0026rsquo;ve seen, Claude Code is gradually adding Coordinator, Team Mode, background tasks — each step within its own defined boundaries.\nAnd OpenAI? Is supporting OpenClaw genuine trust in users, or an open strategy to capture the power user ecosystem? The two strategies rest on fundamentally different assumptions about \u0026ldquo;platform-developer-user\u0026rdquo; trust relationships. I don\u0026rsquo;t have an answer, but the question is worth asking.\nHarness Engineering: Trust-Driven Design Tradeoffs Harness Engineering is an emerging engineering discipline[7]: building the infrastructure that turns a language model from a text predictor into a reliable, safe agent — not the model itself, but everything around it. In late March 2026, Claude Code\u0026rsquo;s source code leaked to npm due to a packaging error[8], providing a wealth of reference material for exploring harness engineering.\nThis methodology answers not \u0026ldquo;can the model do it\u0026rdquo; but \u0026ldquo;under what conditions is the model allowed to do it.\u0026rdquo; Each layer sets trust boundaries along different dimensions. This gives future agent designers a practical thinking framework: not copying specific implementation patterns, but making trust relationships the basis for design decisions and tradeoffs.\nHere are a few examples to illustrate.\n1. Physical constraints vs. prompt instructions — when to use which?\nHarness engineering has a pattern called computational control: using code structure to make violations impossible, rather than relying on prompts to ask the model to comply. For example, task list storage can be designed so agents only have an updateStatus(taskId, newStatus) interface — no deleteTask() or editHistory(). This means the model can\u0026rsquo;t secretly mark unfinished tasks as complete. Not because it wouldn\u0026rsquo;t try, but because it physically can\u0026rsquo;t — the list structure and change history are not writable from its perspective. The trust judgment here: don\u0026rsquo;t trust the model to honestly report its own progress. With prompt instructions — \u0026ldquo;please accurately report completion status\u0026rdquo; — the model might cut corners under context pressure. Computational control transforms the trust question from \u0026ldquo;will the model comply\u0026rdquo; to \u0026ldquo;can the model violate.\u0026rdquo;\nConversely, if the trust foundation is sufficient — simple tasks, cheap verification, low impact from errors — prompt instructions are enough. The cost of computational control is reduced flexibility. Not every scenario justifies that cost.\n2. Adversarial review vs. direct user review — when to use which?\nHarness engineering\u0026rsquo;s Evaluator Agent is a separate agent with an adversarial mindset — \u0026ldquo;try to break it.\u0026rdquo; Cursor\u0026rsquo;s Bugbot in its early version (2025) used a notable approach[9]: running eight parallel analysis passes on the same diff, randomizing the diff order in each pass, then using majority voting to suppress single-pass hallucinations — only when multiple passes independently flagged the same issue was it reported as a real bug. The trust judgment: don\u0026rsquo;t trust the model to objectively evaluate its own work.\nFor high-risk decisions — code merges, security changes, production deployments — adversarial review is worth it. For routine small changes, direct user review is more efficient. The cost of adversarial review is doubled latency and expense.\n3. Red-light gating vs. free scheduling — when to use which?\nIn harness engineering\u0026rsquo;s multi-layer agent architecture, Workers can\u0026rsquo;t touch core code until the Coordinator approves the plan. The leaked Claude Code\u0026rsquo;s Coordinator Mode[8] reveals a workflow split into Research, Synthesis, Implementation, and Verification phases — Workers can\u0026rsquo;t overstep their bounds in the first three phases. This is computational gating — don\u0026rsquo;t trust Workers to judge priorities and dependencies themselves.\nIf models are mature enough and tasks are independent, free scheduling reduces bottlenecks. But the coordination overhead of concurrent agents is real — multiple agents tend to fall into repetitive work or stick to low-risk surface changes, avoiding the core problems that require deeper thinking. Red-light gating costs parallelism but ensures correct direction.\n4. How tight should the sandbox be?\nIn harness engineering, the granularity of sandbox isolation is itself a trust judgment. The leaked Claude Code source reveals that its Dream subsystem[8] — a background agent that consolidates memories — is restricted to read-only bash: it can inspect the project but can\u0026rsquo;t modify anything. The trust judgment: don\u0026rsquo;t trust background agents not to produce side effects without supervision.\nHigher isolation means better security but less flexibility. In a fully trusted sandbox, a background agent might produce unexpected modifications. In a fully untrusted sandbox, the agent can\u0026rsquo;t read even necessary configurations. The tradeoff depends on the task\u0026rsquo;s risk level.\nThe Essence of These Tradeoffs Behind every tradeoff is a trust judgment: do I trust the model to do this? If not fully, what structural constraint compensates? What does the constraint cost? Is that cost worth it?\nHarness engineering doesn\u0026rsquo;t provide standard answers. It provides a thinking framework: between users, tools, and language models, let trust relationships drive architectural decisions — not the other way around.\nWhat These Two Projects Tell Us Aristotle and claude-code-reflect aren\u0026rsquo;t about \u0026ldquo;which is better.\u0026rdquo; They\u0026rsquo;re two points on the same trust curve.\nThe real question was never \u0026ldquo;should there be a human in the loop.\u0026rdquo; It\u0026rsquo;s: how much authorization can this model\u0026rsquo;s reliability on this task support right now?\nAs model capabilities improve, that answer will change. Checkpoints will shift backward. Automation will increase. Review frequency will decrease. But deciding \u0026ldquo;it\u0026rsquo;s time to shift\u0026rdquo; — that\u0026rsquo;s always human responsibility.\nAnd that judgment doesn\u0026rsquo;t just happen between users and AI. Platform trust in developers. Platform assessments of model capabilities. Developer dependence on toolchains. Multiple layers of trust intertwine to form the real picture of human-AI collaboration. Trust boundaries will keep drifting as model capabilities, platform strategies, and user understanding evolve. This post doesn\u0026rsquo;t provide answers — it offers a framework for continuous inquiry.\nInterested in Contributing? Both projects are MIT-licensed. Contributions welcome.\nAristotle (OpenCode): Current priorities include: a sensible default model selection for the Reflector in non-interactive contexts (currently opencode run hangs at the model prompt), graceful degradation paths for session_read() across model/provider combinations, and a rule deduplication mechanism (semantically similar rules accumulate over time).\nclaude-code-reflect (Claude Code): Six known issues remain. The three most critical: preparation-phase atomicity (multi-step operations must merge into a single Bash call, otherwise the UI appears frozen with no feedback), auto-notification when subagents complete, and session ID collisions on retry. Non-English correction signal coverage and RCA prompt quality improvements are also valuable.\nIf you notice recurring model errors in daily use, consider turning them into test cases and submitting a PR — real-world correction patterns are the core material that makes these tools genuinely useful.\nAristotle: https://github.com/alexwwang/aristotle\nclaude-code-reflect: https://github.com/alexwwang/claude-code-reflect\nAppendix: OpenClaw March 2026 Security Fixes The security issues mentioned above. Here\u0026rsquo;s the full vulnerability list and fix timeline.\nCVE List (Disclosed March 18-21, 2026) CVE Severity Issue Patched In CVE-2026-22171 High (CVSS 8.2) Path traversal in Feishu media download → arbitrary file write 2026.2.19 CVE-2026-28460 Medium (CVSS 5.9) Shell line-continuation bypasses command allowlist → command injection 2026.2.22 CVE-2026-29607 Medium (CVSS 6.4) Allow-always wrapper bypass → approve safe command, swap payload, RCE 2026.2.22 CVE-2026-32032 High (CVSS 7.0) Untrusted SHELL env variable → arbitrary shell execution on shared hosts 2026.2.22 CVE-2026-32025 High (CVSS 7.5) WebSocket brute-force, no rate limiting → full session hijack from browser 2026.2.25 CVE-2026-22172 Critical (CVSS 9.9) WebSocket scope self-declaration → low-priv user becomes full admin 2026.3.12 CVE-2026-32048 High (CVSS 7.5) Sandbox escape → sandboxed sessions spawn unsandboxed children 2026.3.1 CVE-2026-32049 High (CVSS 7.5) Oversized media payload DoS → crash service remotely, no auth needed 2026.2.22 CVE-2026-32051 High (CVSS 8.8) Privilege escalation → operator.write scope reaches owner-only surfaces 2026.3.1 Sources: [6][10]\nArchitecture Security Fixes (March 21, PR #51790) Credential encryption at rest: AES-256-GCM + HKDF-SHA256 key derivation, macOS Keychain for master key storage Plugin capability-based access control: Declarative capabilities field in manifests, all 71 bundled extensions updated File permission hardening: Atomic writes to eliminate TOCTOU race between writeFileSync and chmodSync Unbounded cache mitigation: Size bounds added to 9 in-memory Map caches Sources: [11][12][13]\nSources:\nDHH on Anthropic: x.com/dhh/status/2009664622274781625 George Hotz, \u0026ldquo;Anthropic Is Making a Huge Mistake\u0026rdquo;: geohot.github.io/blog Peter Steinberger joins OpenAI: TechCrunch Zenity Labs, \u0026ldquo;OpenClaw or OpenDoor?\u0026rdquo; (2026-02-04): labs.zenity.io Gartner, \u0026ldquo;OpenClaw: Agentic Productivity Comes With Unacceptable Cybersecurity Risk\u0026rdquo; (2026-01-29): gartner.com OpenClaw, \u0026ldquo;Nine CVEs in Four Days: Inside OpenClaw\u0026rsquo;s March 2026 Vulnerability Flood\u0026rdquo; (2026-03-28): openclawai.io Birgitta Böckeler, \u0026ldquo;Harness Engineering for Coding Agent Users\u0026rdquo; (2026-04-02): martinfowler.com Claude Code source leak analysis (2026-03-31): github.com/soufianebouaddis/claude-code-doc Jon Kaplan, \u0026ldquo;Building a Better Bugbot\u0026rdquo; (2026-01-15): cursor.com/blog/building-bugbot OpenClaw GitHub, \u0026ldquo;Security audit remediation: encryption, capabilities, hardening\u0026rdquo; PR #51790 (2026-03-21): github.com OpenClaw 3.22 Release (2026-03-22): openclaws.io OpenClaw 2026.3.28 Release (2026-03-28): blink.new OpenClaw, \u0026ldquo;The February Security Storm\u0026rdquo; (2026-03-04): openclaws.io ","permalink":"https://blog.chuanxilu.net/en/posts/2026/04/a-trust-boundary-design-experiment/","summary":"\u003cblockquote\u003e\n\u003cp\u003e\u003cem\u003eFundamentum autem est iustitiae fides, id est dictorum conventorumque constantia et veritas.\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e— Cicero, \u003cem\u003eDe Officiis\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eThe foundation of justice is fides — constancy and truthfulness in words and agreements.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003chr\u003e\n\u003cp\u003eThe first two posts told the story of two projects. \u003ca href=\"/en/posts/2026/04/aristotle-ai-reflection/\"\u003eAristotle: Teaching AI to Reflect on Its Mistakes\u003c/a\u003e runs on OpenCode — three commits, done. \u003ca href=\"/en/posts/2026/04/claude-code-reflect-different-soil/\"\u003eclaude-code-reflect: Same Metacognition, Different Soil\u003c/a\u003e runs on Claude Code — V1 through V3, hitting walls the entire way.\u003c/p\u003e","title":"Trust Boundaries: The Same Idea on Open and Closed Platforms"},{"content":"Same metacognitive ability, different soil. The growing patterns look nothing alike.\nMy previous post, Aristotle: Teaching AI to Reflect on Its Mistakes, had three core principles: immediate trigger, session isolation, human in the loop. These sound platform-agnostic. But when I moved the same philosophy to Claude Code, I discovered something: platform differences are much larger than expected.\nFirst Hurdle: Plugin System Differences Claude Code\u0026rsquo;s plugin and OpenCode\u0026rsquo;s skill are completely different systems. Just getting the plugin installed and recognized took several rounds of struggle.\nThe marketplace.json format was wrong. Plugin installed but wasn\u0026rsquo;t recognized. Skill call path was wrong, system couldn\u0026rsquo;t find the entry point. Loading mechanism misunderstood, configuration changes wouldn\u0026rsquo;t take effect. AI repeatedly failed installation. It took multiple rounds to figure out the correct format and location.\nThis raises a question: why does the same model-driven Vibe Coding, when designing tasks with the same goals in Claude Code, not even get the plugin system right?\nThe answer: implicit rules of different platforms run deeper than surface differences. I used to think OpenCode\u0026rsquo;s skill system and Claude Code followed the same protocol. Highly similar. In practice, I discovered Claude Code\u0026rsquo;s plugin loading mechanism, configuration format, and path conventions all have detailed differences and extra restrictions.\nExperience the model accumulated on the first platform can\u0026rsquo;t be directly migrated. Each ecosystem\u0026rsquo;s \u0026ldquo;common sense\u0026rdquo; details need to be learned again. Checking documentation is still important. It just shifted from humans checking to teaching AI to check.\nWhat looks like \u0026ldquo;same standard protocol + switch platform\u0026rdquo; is actually understanding another ecosystem\u0026rsquo;s design details from scratch.\nSecond Hurdle: Permission Model Pitfalls The real problem was yet to come. The same reflection subagent design went smoothly on OpenCode. On Claude Code, it hit walls repeatedly. When the reflection task started, the main session conversation got constantly interrupted by user confirmation popups from subtasks. Users typed in the wrong place easily. This caused serious context pollution. Wrong responses led to AI misunderstandings.\nThe subagent often failed to start entirely. Reflection tasks that did start often executed in the main session instead. This blocked the user\u0026rsquo;s workflow and seriously polluted the context. The experience completely deviated from the goal design. Unacceptable.\nWhy does this happen? (Click to expand technical details) The root cause is non-atomic preparation. Starting a reflection task involves multiple independent steps: generating a session UUID, creating a directory, writing state.json, writing the prompt file, launching a background subprocess. In Claude Code\u0026rsquo;s default ask permission mode, each Bash or Write call triggers a user confirmation popup. Between each popup, control returns to the main session. The user\u0026rsquo;s next message might slip in—at best the preparation flow is interrupted, at worst the reflection task starts directly in the main session and the context is completely polluted.\nThe V1 solution introduced bypassPermissions: skip all confirmation popups and let the preparation flow complete in one go. This did solve the interruption problem, but bypassPermissions does more than that—it changes the entire reflection flow\u0026rsquo;s permission model. When background sub-sessions run in non-interactive mode, without it even basic file writes are rejected. In other words, bypassPermissions on one hand guarantees atomicity, on the other hand becomes the source of subsequent permission issues. I\u0026rsquo;ll continue discussing this detail below.\nAfter finally getting the subagent to start (V1 version refactoring introduced the bypassPermissions solution), file writes were rejected again. After some investigation, I discovered:\nClaude Code\u0026rsquo;s background sub-sessions have a confirmed bug: bypassPermissions silently rejects writes outside the project root directory.\nWhen the solution hit this bug, it manifested as: user-level rules (like skill updates under ~/.claude/skills/) precisely needed to write outside the project root directory. But background sub-sessions were designed as non-interactive. The locations they needed to write happened to be on the permission boundary. So saving files failed.\nExploring Around the Pitfall: Solution Iteration v2 to v3 So I came up with a v2 solution to work around the write permission problem: move all final writes to the user-confirmed interactive session (resumed session). The background sub-session would only do analysis and generate drafts. This way the background sub-session only writes to .reflect/reflections/{id}/ inside the project root, avoiding that bug.\nBut v2 still had problems: the atomicity of the preparation phase was forgotten. If the preparation process was interrupted, it would leave an inconsistent state. The problem that was solved in the V1 refactoring came back.\nSo I continued with a v3 solution, merging all preparation steps into a single Bash command to eliminate the interruption window. At the same time I decided to abandon the OMC dependency and only maintain the standalone branch.\nWhy Abandon OMC Dependency OMC brings two core capabilities:\nnotepad_write_priority, for cross-compaction notification—when the background subagent completes analysis, it injects a priority notification through notepad, ensuring the reminder is still visible after context compression. But in the v3 version\u0026rsquo;s redesigned write path, users need to actively resume the subagent session to do review and writing, so the value of this notification mechanism has greatly decreased—users already know they triggered a reflection. /reflect inspect and /reflect list are enough.\nproject_memory_add_note / project_memory_add_directive, providing structured project memory management. Standalone uses the Write tool to directly write .reflect/project-memory.json, functionally equivalent, just without OMC\u0026rsquo;s unified management layer. For this project\u0026rsquo;s usage scenario, the difference is barely perceptible.\nStandalone is completely sufficient. The OMC dependency in the main branch isn\u0026rsquo;t cost-effective:\nFirst, standalone\u0026rsquo;s file-based solution is more transparent—which file gets written, what gets written, users can see and control completely, fitting this project\u0026rsquo;s human-in-the-loop design philosophy. Second, the benefits OMC brings have already been marginalized—after the v3 write path redesign, the notepad notification\u0026rsquo;s value dropped significantly, and project memory using the Write tool to write JSON directly is functionally equivalent. This isn\u0026rsquo;t \u0026ldquo;giving up something valuable for convenience\u0026rdquo;—it\u0026rsquo;s \u0026ldquo;the value was already marginal to begin with.\u0026rdquo; Third, OMC itself needs separate installation, an extra step and cognitive burden for users, plus the ongoing cost of maintaining two branches (every SKILL.md change needs to be synchronized)—and I already have the write path redesign big change to do. So I had the v3 solution:\nPhase Session Type bypassPermissions Write Scope Preparation Main session (1 atomic Bash call) Yes—for atomicity .reflect/reflections/ Background Analysis Background sub-session Yes—required for non-interactive writes .reflect/reflections/{id}/ Review+Write Interactive (resumed) session No .reflect/ + ~/.claude/ Implementing v3: More Pitfalls Ahead I used a ralph loop to execute the v3 solution changes. Cross-platform path compatibility is a detail—Windows Git Bash and POSIX systems handle paths differently.\nThis step went relatively smoothly. The v3 solution drew a clear boundary between \u0026ldquo;preparation\u0026rdquo; and \u0026ldquo;analysis,\u0026rdquo; concentrating on solving write permission issues. What came next was where I really stepped into pitfalls.\nTesting Proved That bypassPermissions Can\u0026rsquo;t Be Removed In theory, the background sub-session only writes to .reflect/reflections/{id}/ inside the project root. So bypassPermissions shouldn\u0026rsquo;t be needed.\nIn practice? Without it, I couldn\u0026rsquo;t even write files. Theory and platform reality don\u0026rsquo;t always agree.\nThe final solution added back bypassPermissions, and at the same time added path restrictions in the prompt as defense in depth: open in permissions, constrained in logic.\nA table to review the V1→V3 iteration. Looking back, it\u0026rsquo;s simple, but figuring it out really took some effort:\nDimension V1 V2 V3 Preparation Phase Multi-step independent calls, can be interrupted Multi-step independent calls (same as V1) Single atomic Bash command Background Write Location Tried to write ~/.claude/ (rejected) Only write to project root Only write to project root Final Write Location Background sub-session writes directly Moved to resumed session Moved to resumed session bypassPermissions Introduced—suppress popups Tried to remove—theoretically not needed Added back—actually required OMC Dependency Yes Yes Abandoned, standalone only Iteration isn\u0026rsquo;t linear progress, but constant trade-offs between atomicity, permission safety, and dependency complexity. Each solution solves the previous version\u0026rsquo;s problems, then exposes new boundary conditions.\nTesting Revealed API Concurrency Errors Another problem surfaced during testing. Main session and sub-session share the API endpoint. Concurrent requests triggered ECONNRESET errors.\nTroubleshooting took a few detours. First I tried specifying a different model — suspected a model switching issue. Checked third-party API configuration — suspected a routing problem. Finally confirmed: the API I was using had a concurrency limit. Concurrent requests to the same endpoint get rejected. Switched to an API with looser limits, problem gone.\nSolution: Retry Mechanism Since concurrency limits objectively exist, let\u0026rsquo;s add a retry mechanism:\n( MAX_RETRIES=3 RETRY_DELAY=10 attempt=0 while [ $attempt -lt $MAX_RETRIES ]; do claude -p \u0026#34;$(cat prompt.txt)\u0026#34; \\ --session-id $SESSION_ID \\ --model ${REFLECT_SUBAGENT_MODEL:-sonnet} \\ --permission-mode bypassPermissions \\ --output-format json 2\u0026gt;\u0026gt;stderr.log [ $? -eq 0 ] \u0026amp;\u0026amp; break attempt=$((attempt + 1)) sleep $RETRY_DELAY done ) \u0026amp; The script spawns a background sub-shell wrapping the claude -p call. After failure it waits 10 seconds and retries, up to 3 times. At the same time add a configurable model parameter REFLECT_SUBAGENT_MODEL, allowing users to choose the model according to their own API\u0026rsquo;s concurrency limits.\nVerification succeeded, matching expectations. But this is an uncontrollable risk from outside. The current design mechanism can only mitigate its impact, cannot completely eliminate it.\nFinally: 6 Known Issues Remain Not every problem has an elegant solution. Let\u0026rsquo;s honestly face the unresolved problems:\nThe preparation phase confuses users (looks like freezing, actually analyzing in the background) No way to automatically notify the user when the sub-session completes Session ID might conflict during retry Read tool doesn\u0026rsquo;t display accurately when rendering markdown Insufficient error recovery options Cross-compaction notification reliability Solving these problems requires platform-level support, or making trade-offs under current constraints. Engineering is like this—not all problems have perfect solutions.\nThe Value of AI-Driven Testing The entire testing process was completed by AI. This isn\u0026rsquo;t the point. The point is that several problems discovered during testing were in the blind spot of the original solution documentation: bypassPermissions permission is a platform characteristic, not a design problem. API concurrency is an environment limitation, also not a design problem. heredoc variable not expanding is a Bash implementation detail, let alone a design problem.\nIf designed in traditional ways, these problems might only be exposed after launch. Letting AI test the system, AI can discover unforeseen edge cases in human solutions. This point is worth emphasizing—if you\u0026rsquo;re designing a system, let AI test it. AI isn\u0026rsquo;t just an executor, it\u0026rsquo;s also a participant in design verification.\nNext Post Preview Next post: Trust Boundaries: The Same Idea on Open and Closed Platforms — systematically comparing the differences between the two systems, from skill systems to permission models to concurrency control to underlying design philosophy, seeing what the same metacognitive mechanism grows into on different soil, and what insights it gives us for future AI practice.\n","permalink":"https://blog.chuanxilu.net/en/posts/2026/04/claude-code-reflect-different-soil/","summary":"\u003cp\u003eSame metacognitive ability, different soil. The growing patterns look nothing alike.\u003c/p\u003e\n\u003cp\u003eMy previous post, \u003ca href=\"/en/posts/2026/04/aristotle-ai-reflection/\"\u003eAristotle: Teaching AI to Reflect on Its Mistakes\u003c/a\u003e, had three core principles: immediate trigger, session isolation, human in the loop. These sound platform-agnostic. But when I moved the same philosophy to Claude Code, I discovered something: platform differences are much larger than expected.\u003c/p\u003e\n\u003ch2 id=\"first-hurdle-plugin-system-differences\"\u003eFirst Hurdle: Plugin System Differences\u003c/h2\u003e\n\u003cp\u003eClaude Code\u0026rsquo;s plugin and OpenCode\u0026rsquo;s skill are completely different systems. Just getting the plugin installed and recognized took several rounds of struggle.\u003c/p\u003e","title":"claude-code-reflect: Same Metacognition, Different Soil"},{"content":"\u0026ldquo;Knowing yourself is the beginning of all wisdom.\u0026rdquo; — Aristotle\nEvery time I work with an AI coding assistant, I run into the same problem.\nMistakes that were corrected get repeated in the next session. The model isn\u0026rsquo;t stupid. There\u0026rsquo;s a structural gap in memory.\nFor example. Last week I corrected a mistake the model made. It apologized, I accepted, we kept working. Today I started a new session, and the same mistake appeared again.\nThe correction just\u0026hellip; evaporated.\nThis isn\u0026rsquo;t a one-off problem. It\u0026rsquo;s the norm in every conversation. The model remembers the current context, but not the mistakes it made before and how they were corrected.\nThis made me realize something. The best time to reflect is right when the mistake happens. No delays. No switching tools. No interrupting the current workflow.\nWhen cognitive balance breaks — the model makes a mistake, the user corrects it — the metacognitive rebalancing should start soon after the conflict appears. But it shouldn\u0026rsquo;t interfere with the original task.\nI learned this from cognitive science. But the Vibe Coding tools and plugins I use don\u0026rsquo;t have this capability yet.\nSo I built one.\nThree Design Principles When designing Aristotle, I set three principles for myself.\nFirst, let the user trigger reflection, but remind them. When a signal of error correction is detected, remind the user to run /aristotle.\nDon\u0026rsquo;t make them actively think, \u0026ldquo;Oh, I should record this mistake.\u0026rdquo; The user might be focused on correcting a complex error and forget to reflect.\nBut full automatic reflection carries a risk. If the user hasn\u0026rsquo;t finished entering the full correction before reflection starts, the reflection will likely be incomplete.\nSecond, complete session isolation. The reflection process happens in a background sub-session with zero pollution to the main session context. It won\u0026rsquo;t affect the current task.\nAlso, when the model makes a mistake, the user might already be impatient. If the main flow has to wait for a reflection task before continuing, the user experience would be terrible.\nAnother benefit of isolation is that the reflection session can access the complete conversation history without disrupting the current work rhythm. After all, reflection is the model\u0026rsquo;s responsibility, not the user\u0026rsquo;s obligation.\nThird, \u0026ldquo;human in the loop\u0026rdquo;. Generated rules don\u0026rsquo;t get persisted without user approval.\nAI might mistake a temporary correction for a general rule. Or treat a special case as a general pattern. The user\u0026rsquo;s judgment is the last line of defense.\nHow Aristotle Works The core is 5-Why root cause analysis. Starting from the surface error, ask \u0026ldquo;why\u0026rdquo; layer by layer to find the true root cause.\nFor example. If the model outputs incorrect code, the first layer asks why — it might be misunderstanding the requirement. The second layer asks why the misunderstanding — maybe context information is insufficient. The third layer asks why insufficient — maybe the user didn\u0026rsquo;t explicitly state a constraint.\nAsking five times usually pinpoints the problem.\nErrors are divided into 8 categories: MISUNDERSTOOD_REQUIREMENT, ASSUMED_CONTEXT, PATTERN_VIOLATION, HALLUCINATION, INCOMPLETE_ANALYSIS, WRONG_TOOL_CHOICE, OVERSIMPLIFICATION, SYNTAX_API_ERROR.\nClassification isn\u0026rsquo;t the goal. It\u0026rsquo;s to make subsequent rule matching more precise.\nRules are divided into two levels: user-level and project-level. User-level rules follow the individual. Project-level rules are shared within the team.\nThe specific scope judgment mechanism is an engineering detail, so I won\u0026rsquo;t go into it here. What matters is that the layered rules let the results of reflection be reused in the appropriate scope — neither over-generalized nor limited to one person\u0026rsquo;s experience.\noh-my-opencode\u0026rsquo;s background tasks naturally support isolated sub-sessions. The main session triggers a background task. The Reflector reads the conversation history in a completely isolated environment, does root cause analysis, and generates rule suggestions.\nThe entire process is transparent to the user and doesn\u0026rsquo;t interrupt the workflow.\nSmooth Implementation OpenCode\u0026rsquo;s skill system plus the omo background task infrastructure made Aristotle\u0026rsquo;s implementation surprisingly smooth. It only took 3 commits.\nThe first commit was the complete SKILL.md, 394 lines. I wrote the entire protocol in one go — including the Coordinator-Reflector dual-layer architecture, the 5-Why root cause analysis template, and the Stop Hook automatic detection logic.\nCoordinator only does lightweight orchestration. It collects metadata like session ID, project directory, and language. Reflector does the reanalysis in a completely isolated background session.\nThis design was clear from the start. No back and forth adjustments.\nThe second commit was the test script. 37 static assertions plus E2E live tests. The tests covered the complete chain from trigger mechanism to rule generation.\nWhen writing tests, I found several edge cases — like empty conversation history and multi-round correction scenarios. I added explanations for these in the SKILL.md.\nThe third commit was the README. Writing down the design philosophy and usage clearly, bringing this project to a close.\nThe whole process went smoothly. 37 static assertions plus E2E tests all passed. The full chain — from trigger to rule generation — ran end to end without a hitch.\nNot because the problem was simple. But because OpenCode\u0026rsquo;s infrastructure already solved the hardest parts. The skill system makes implementing custom commands natural. The omo task() background task natively supports session isolation. The session read/write APIs are complete.\nI just had to combine these capabilities and express Aristotle\u0026rsquo;s design philosophy with a clear framework.\n(Added 2026-04-11: The \u0026ldquo;smoothness\u0026rdquo; described here turned out to be the smoothness of the tests. What the tests didn\u0026rsquo;t cover — that\u0026rsquo;s a story for part four.)\nReflection Itself Is Worth Reflecting On I thought about the name Aristotle for a long time. I finally chose it because of that sentence.\n\u0026ldquo;Knowing yourself is the beginning of all wisdom.\u0026rdquo;\nGiving AI the ability to reflect is essentially letting AI recognize its own limitations. This isn\u0026rsquo;t simple error recording. It\u0026rsquo;s structured metacognition.\nThe model makes mistakes not because it\u0026rsquo;s smart or stupid. It\u0026rsquo;s because it has no awareness of its blind spots. Aristotle makes these blind spots explicit through root cause analysis and transforms them into learnable rules.\nThere\u0026rsquo;s a concept in cognitive science called metacognition. It refers to \u0026ldquo;thinking about thinking,\u0026rdquo; or \u0026ldquo;knowing what you know and knowing what you don\u0026rsquo;t know.\u0026rdquo;\nHuman learning largely depends on metacognitive ability. When we encounter difficulties, we stop and reflect. \u0026ldquo;Why did I make this mistake?\u0026rdquo; \u0026ldquo;How to avoid it next time?\u0026rdquo;\nThis reflection isn\u0026rsquo;t accidental. It\u0026rsquo;s a core link in the learning process.\nWhat AI was missing is exactly this link. The model can learn and adjust in real-time during conversations, but this learning is temporary and situational. After the session ends, the experience is lost.\nAristotle structures this link. It lets the model \u0026ldquo;know what it doesn\u0026rsquo;t know\u0026rdquo; and transforms this awareness into persistent knowledge.\nThis isn\u0026rsquo;t magic. Rules generated by Aristotle still need human review. The application scenarios of rules might also have limitations.\nBut it fills the metacognitive link at the system level. It makes AI\u0026rsquo;s learning go from temporary and isolated to persistent and accumulative.\nWhat\u0026rsquo;s Next Moving the same philosophy to Claude Code wasn\u0026rsquo;t that simple.\nOpenCode has a complete skill system and background task infrastructure. Implementing Aristotle was a natural fit. Claude Code\u0026rsquo;s environment is different. Many capabilities need to be redesigned.\nIn the next post, I\u0026rsquo;ll talk about this process and the technical challenges encountered. The core philosophy stays the same, but the implementation path is completely different.\nThe key to mastering AI isn\u0026rsquo;t prompts. It\u0026rsquo;s giving AI the ability to learn from mistakes. This is the starting point of this series. And what I consider the most important point.\n","permalink":"https://blog.chuanxilu.net/en/posts/2026/04/aristotle-ai-reflection/","summary":"\u003cp\u003e\u0026ldquo;Knowing yourself is the beginning of all wisdom.\u0026rdquo; — Aristotle\u003c/p\u003e\n\u003cp\u003eEvery time I work with an AI coding assistant, I run into the same problem.\u003c/p\u003e\n\u003cp\u003eMistakes that were corrected get repeated in the next session. The model isn\u0026rsquo;t stupid. There\u0026rsquo;s a structural gap in memory.\u003c/p\u003e\n\u003cp\u003eFor example. Last week I corrected a mistake the model made. It apologized, I accepted, we kept working. Today I started a new session, and the same mistake appeared again.\u003c/p\u003e","title":"Aristotle: Teaching AI to Reflect on Its Mistakes"},{"content":"Chuanxilu for Skilled Homo sapiens (能工智人的传习录), inspired by Wang Yangming\u0026rsquo;s philosophy of the Unity of Knowledge and Action.\nThis is a personal space for an AI practitioner, covering:\nAI Practice: Agent exploration, prompt engineering, LLM applications, technical reflections Tech Tinkering: Toolchains, productivity, interesting projects Life Notes: Reading notes, parenting and education, daily reflections Knowledge is the beginning of action; action is the completion of knowledge.\n","permalink":"https://blog.chuanxilu.net/en/about/","summary":"\u003cp\u003eChuanxilu for Skilled Homo sapiens (能工智人的传习录), inspired by Wang Yangming\u0026rsquo;s philosophy of the Unity of Knowledge and Action.\u003c/p\u003e\n\u003cp\u003eThis is a personal space for an AI practitioner, covering:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eAI Practice\u003c/strong\u003e: Agent exploration, prompt engineering, LLM applications, technical reflections\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTech Tinkering\u003c/strong\u003e: Toolchains, productivity, interesting projects\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLife Notes\u003c/strong\u003e: Reading notes, parenting and education, daily reflections\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eKnowledge is the beginning of action; action is the completion of knowledge.\u003c/p\u003e","title":"About"},{"content":"Why Blog Knowledge is the beginning of action; action is the completion of knowledge.\nThe idea behind this blog aligns with Hugo\u0026rsquo;s philosophy — less tooling, more content. In an era where AI evolves daily, watching without practicing is not enough. A place to document and solidify real practice is needed.\nWhy Hugo After some research, I chose Hugo with the PaperMod theme:\nHugo: Blazing fast builds, zero runtime dependencies, great CJK support PaperMod: Clean and minimal, dark mode, SEO-friendly, active community Cloudflare Pages: Global CDN, generous free tier Technology shouldn\u0026rsquo;t be a barrier to writing. Less tooling, more content.\nWhat\u0026rsquo;s Next AI Practice: Agent exploration, prompt engineering, LLM applications, technical reflections Tech Tinkering: Toolchains, productivity, interesting projects Life Notes: Reading notes, parenting and education, daily reflections Unity of knowledge and action. This is where it begins.\n","permalink":"https://blog.chuanxilu.net/en/posts/2026/04/hello-world-the-inaugural-post/","summary":"\u003ch2 id=\"why-blog\"\u003eWhy Blog\u003c/h2\u003e\n\u003cp\u003eKnowledge is the beginning of action; action is the completion of knowledge.\u003c/p\u003e\n\u003cp\u003eThe idea behind this blog aligns with Hugo\u0026rsquo;s philosophy — less tooling, more content. In an era where AI evolves daily, watching without practicing is not enough. A place to document and solidify real practice is needed.\u003c/p\u003e\n\u003ch2 id=\"why-hugo\"\u003eWhy Hugo\u003c/h2\u003e\n\u003cp\u003eAfter some research, I chose Hugo with the PaperMod theme:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eHugo\u003c/strong\u003e: Blazing fast builds, zero runtime dependencies, great CJK support\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003ePaperMod\u003c/strong\u003e: Clean and minimal, dark mode, SEO-friendly, active community\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCloudflare Pages\u003c/strong\u003e: Global CDN, generous free tier\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eTechnology shouldn\u0026rsquo;t be a barrier to writing. Less tooling, more content.\u003c/p\u003e","title":"Hello World · The Inaugural Post"}]