From Shipping
to Studying
Agent Systems
從交付到研究
Agent 系統
Engineer turned researcher. I've built multi-agent LLM systems serving real enterprise users. Now I want to understand why they work, and how to make them work better. 從工程師走向研究者。我建構了服務真實企業用戶的多 Agent LLM 系統。現在我想深入理解它們為何有效,以及如何做得更好。
4-stage cascade router (Keyword → Embedding → LLM → Hybrid) tested on CLINC150 with 3 seeds. R4 Hybrid matches full-LLM accuracy (82.6% vs 82.9%, McNemar p > 0.3) while cutting 74% of LLM calls. Total experiment cost: $0.44. ACL workshop paper ready for arXiv. 四階段級聯路由器(Keyword → Embedding → LLM → Hybrid),在 CLINC150 上以 3 組種子驗證。R4 Hybrid 準確率與 Full-LLM 無顯著差異(82.6% vs 82.9%,McNemar p > 0.3),同時減少 74% 的 LLM 呼叫。實驗總成本 $0.44。ACL workshop 論文已準備上 arXiv。
Aria
→AI-powered decision system with a 7-member "parliament" of competing analyst personas. 6-agent pipeline (macro → screening → analysis → parliament → verdict), ~10K LOC across 13 Python modules, FastAPI dashboard with SSE streaming. First-week directional accuracy: 80% (8/10). AI 驅動的決策系統,核心是 7 位「議員」角色的對抗式辯論。6-agent 流水線(總經→選股→分析→議會→裁決),13 個 Python 模組共 ~10K LOC,FastAPI 儀表板搭配 SSE 串流。首週方向準確率:80%(8/10)。
Multi-Agent LLM Platform 多模型 Agent 平台
→3 agents, 11 MCP tools, 3 orchestration modes. Chat interface: 6.6s avg / 80% accuracy. Quick UI: 5.9s avg / 60% accuracy. The counterintuitive finding that structured input doesn't always win drove my research into Intent Density. 3 個 Agent、11 個 MCP 工具、3 種編排模式。Chat 介面:平均 6.6s / 80% 準確率。Quick UI:平均 5.9s / 60% 準確率。結構化輸入不一定更好的反直覺發現,驅動了我對意圖密度的研究。
taxFormatTool 永盛會計資料轉換系統
→56 commits, 14K LOC, 28 API endpoints. Multi-tenant platform with row-level security, configurable field mapping, post-export locking. Replaced a manual Excel workflow that took accountants ~4 hours per client; now completes in under 10 minutes. Serving 10+ clients across 3 POS vendors. 56 次 commit、14K 行程式碼、28 個 API。多租戶平台搭配 row-level security、可配置欄位映射、匯出後鎖定。取代了每個客戶要花會計 ~4 小時的手動 Excel 流程,現在 10 分鐘內完成。服務 10+ 客戶、橫跨 3 家 POS 廠商。
egcpa_helper 工作小幫手
→~100 commits, 4 modules, v2.0.0. Equity CTE traversal, Odoo ERP integration, work logs, case tracking. Used daily by 50+ staff for task management and case oversight. Biggest lesson: ERP integration was 3x over budget; 60+ commits on Odoo spike alone taught me to enforce read-only boundaries. 約 100 次 commit、4 個模組、v2.0.0。股權 CTE 遍歷、Odoo ERP 整合、工作日誌、案件追蹤。每天 50+ 位同仁用於任務管理與案件監控。最大教訓:ERP 整合超出預算 3 倍,光 Odoo spike 就 60+ 次 commit,讓我學會強制唯讀邊界。
POS → MERP Converter
→Turned a tedious manual process into a one-click pipeline. Maps POS export fields to ERP schema, validates data integrity, flags gaps. Reduced per-client monthly data entry from ~3 hours to ~5 minutes, now used by the entire accounting team for all POS-integrated clients. 將繁瑣的人工流程變成一鍵流水線。POS 匯出欄位自動對應 ERP 結構、驗證資料完整性、標記缺口。將每個客戶每月的資料輸入從約 3 小時縮短至約 5 分鐘,現為全所會計團隊用於所有 POS 整合客戶。
Monthly Report Generator 月結報表生成系統
→End-to-end report automation: ingests raw POS data, runs financial calculations, generates formatted PDF reports. Cut monthly close from 2 days to ~30 minutes per client. Now handles reporting for 8+ clients. Accountants review and approve rather than build from scratch. 端到端報表自動化:吃進原始 POS 資料、執行財務計算、生成格式化 PDF。將每個客戶的月結從 2 天縮短至約 30 分鐘。現為 8+ 個客戶處理月報,會計只需審閱核准,不再從零開始。
Cross-conversation memory skill for AI assistants, built on top of MemPalace by milla-jovovich. Adapts the palace metaphor into a lightweight Claude Code skill: wings for topics, notes for sessions, tunnels for cross-references. Hot Cache pattern loads full context in ~170 tokens. Includes visualization, stats, and Obsidian export. AI 跨對話記憶技能,基於 milla-jovovich 的 MemPalace 開發。將記憶宮殿隱喻改造為輕量 Claude Code skill:wings 分主題、notes 存對話、tunnels 串跨主題連結。Hot Cache 模式用 ~170 tokens 載入全局。附帶視覺化、統計和 Obsidian 匯出。
Ecommerce Platform 電商平台
→Full-stack e-commerce application built during CS coursework at UTSA. Complete shopping experience with JWT auth, admin dashboard, and Docker deployment. 在 UTSA 課程中開發的全端電商應用。完整購物流程、JWT 認證、管理後台與 Docker 部署。
The Path Here 我的路徑
I started building things with my hands: 10 years of competitive robotics across WRO, FLL, FRC, VEX, and APRA, always as part of a team. In FRC we had 20+ members splitting into mechanical, electrical, programming, and strategy sub-teams; in WRO my team of three earned a WRO World Championship representing the USA. Those years taught me that orchestration matters more than any single component. A mediocre robot with excellent sub-system coordination beats a brilliant one with poor integration. 我從動手做東西開始:橫跨 WRO、FLL、FRC、VEX、APRA 的十年機器人競賽經歷,每一場都是團隊作戰。FRC 隊伍 20+ 人分成機構、電控、程式、策略小組;WRO 三人小隊拿下代表美國出賽的WRO 世界賽。這些年教會我:協調比單一零件重要。子系統配合好的普通機器人,會贏過整合差的天才機器人。
The turning point was CMU's Robotics Feiyue Program in 2019. Walking through the Gates Center for Computer Science, seeing labs where robots learned from experience rather than following fixed rules, I realized the next frontier wasn't mechanical; it was intelligence. That's when I decided to study CS at UTSA, shifting from hardware systems to software, from robots to AI. 轉折點是 2019 年的 CMU Robotics 飛躍計劃。走進 Gates 電腦科學大樓,看到實驗室裡的機器人不是按照固定規則運作,而是從經驗中學習,我意識到下一個前沿不是機械,而是智慧。那時我決定到 UTSA 讀資工,從硬體系統轉向軟體,從機器人轉向 AI。
At an accounting firm, I got to answer that question with real stakes. I designed and shipped a multi-agent LLM platform with hybrid task routing, dual-interface design, and RAG knowledge bases, serving 50+ daily users handling real financial data. Along the way I built 4 production systems from scratch. 在會計師事務所,我得以在真實場景中回答這個問題。我設計並交付了一套多 Agent LLM 平台,包含混合任務路由、雙介面設計與 RAG 知識庫,每天服務50+ 位使用者處理真實財務資料。過程中從零打造了 4 套上線系統。
But building exposed gaps that engineering alone can't close. My mock router handles 70% of queries through keywords, but where exactly does the remaining 30% fail, and why does Claude succeed where rules don't? I tuned a 5-iteration agent loop by instinct, but I want to know the principled way to set that threshold. I can make agents work, but I want to understand why. These are the questions that pull me toward formal research in LLM agent systems. 但建構的過程暴露了工程手段無法填補的缺口。我的 mock 路由器靠關鍵字處理了 70% 的查詢,但剩下 30% 到底在哪裡失敗?為什麼 Claude 能在規則做不到的地方成功?我靠直覺調了 5 輪的 agent 迴圈上限,但我想知道設定這個閾值的理論依據。我能讓 Agent 運作,但我想理解為什麼。這些問題把我拉向LLM Agent 系統的正式研究。
AI & Full-Stack Engineer
B.S. Computer Science
CMU Robotics Feiyue Program
Competitive Robotics 機器人競賽
International 國際賽事
FIRST Robotics FIRST 機器人
Skills & Certifications 技能與證照
Leadership & Mentoring 領導與指導
Open Questions 正在探索的問題
Questions I ran into while shipping production agent systems, the ones where the right answer wasn't obvious from the code, and where I suspect a systematic answer is possible. 這些是我在交付上線 Agent 系統時遇到、答案在程式碼裡看不出來、但我認為存在系統性解答的問題。
LLM Agent Orchestration in Enterprise Environments 企業環境中的 LLM Agent 編排
Building a multi-agent system for an accounting firm taught me that the hard problem isn't making agents work; it's making them work together, reliably, at acceptable cost. My system uses hybrid routing (rule-based + LLM-classified) to delegate tasks across models, and most design choices were made by intuition rather than principle. These are the questions I'd like to answer more rigorously. 在會計事務所建構多 Agent 系統讓我明白,困難的不是讓 Agent 運作,而是讓它們協作、可靠地、以可接受的成本運作。我的系統用混合路由(規則 + LLM 分類)分配跨模型任務,多數設計選擇基於直覺而非原則。以下是我想更嚴謹回答的問題。
When does structured input beat natural language? 結構化輸入什麼時候才真的贏過自然語言?
My dual-interface experiment showed Chat hitting 80% accuracy while a form-based Quick UI only reached 60%, the opposite of what I expected. The form's dropdown selections get stringified into natural language before reaching the orchestrator, and the stringification step is where information gets lost. The interesting question: is there a measurable property of the input ("intent density," effective slots per token) that predicts when structured wins and when it loses? 我的雙介面實驗顯示 Chat 準確率 80%,而表單式 Quick UI 只有 60%,跟我預期相反。表單的下拉選擇在送入編排器前會被字串化成自然語言,資訊就在這個字串化步驟遺失。有趣的問題:是否存在一個可量測的輸入屬性(「意圖密度」,每 token 的有效 slot 數),能預測結構化何時勝出、何時輸?
How far can cheap routers go before the LLM has to step in? 便宜的 router 能走多遠,LLM 才真的非介入不可?
My three-mode orchestrator showed keyword rules handling ~70% of queries on their own. My follow-up CLINC150 experiment (3 seeds, McNemar-tested) found a keyword → embedding → LLM cascade matches full-LLM routing accuracy (82.6% ± 1.2pp vs 82.9% ± 0.6pp) while calling the LLM on only 26% of queries — a 74% LLM cost reduction. That's one dataset in English commercial domains. I don't yet know whether enterprise/domain-specific traffic is systematically more cascadeable than general-purpose traffic, or whether the cascade's ceiling simply rises and falls with keyword coverage. 我的三模式編排器顯示,關鍵字規則單獨就處理 ~70% 查詢。後續在 CLINC150 上的實驗(3 seeds、McNemar 檢定)發現 keyword → embedding → LLM cascade 能達到與全量 LLM 相同的準確率(82.6% ± 1.2pp vs 82.9% ± 0.6pp),但只對 26% 的查詢呼叫 LLM — LLM 成本降低 74%。這只是一個英文商業領域的資料集。我還不知道企業/特定領域的流量是否系統性地比通用流量更適合 cascade,或 cascade 的天花板只是跟著關鍵字覆蓋率上下浮動。
Writing & Blog 文章與部落格
Paper readings, system reflections, and thoughts on where LLM agents are heading. 論文閱讀、系統反思、以及對 LLM Agent 發展方向的想法。
Hybrid Router: Matching LLM Accuracy with 74% Fewer LLM Calls on CLINC150 Hybrid Router 實驗:在 CLINC150 上用 26% 的 LLM 呼叫達到相同準確率
I tested four routing strategies on CLINC150 across 3 random seeds (n=1,200 pooled LLM calls). A keyword→embedding→LLM cascade matches full-LLM accuracy (82.6% ± 1.2pp vs 82.9% ± 0.6pp, McNemar not significant in 3/3 seeds) while calling the LLM on only 26% of queries — a 74% LLM cost reduction with no accuracy loss. 我在 CLINC150 上跨 3 個 random seed(總 pooled n=1,200)測試了四種路由策略。keyword→embedding→LLM 的 cascade 與全量 LLM 路由準確率相同(82.6% ± 1.2pp vs 82.9% ± 0.6pp,McNemar 在 3/3 seeds 皆 not significant),但只對 26% 查詢呼叫 LLM — LLM 成本降低 74%,準確率無損。
Dual Interface Experiment: Chat vs. Quick UI Intent Parsing 雙介面實驗:Chat vs. Quick UI 的意圖解析效率
Two interfaces, same MCP tools. The UI was 39% faster on ambiguous queries. The difference was intent density, not interface quality. 兩個介面、相同 MCP 工具。UI 在模糊查詢上快了 39%。差異在意圖密度,不在介面品質。
Paper Read: To CoT or Not: Chain-of-Thought Isn't Always the Answer 論文閱讀:To CoT or Not,思維鏈不是萬靈丹
UT Austin's Durrett meta-analyzes 100+ papers. CoT mostly helps on math only. What this means for agent routing costs. UT Austin Durrett 後設分析 100+ 篇論文。CoT 主要只在數學上有用。這對 agent 路由成本的意義。
Dissecting Claude Code: What 512K Lines of Leaked Source Reveal 拆解 Claude Code:51 萬行洩漏源碼的架構解讀
Six-layer architecture, fail-closed tool design, three-tier memory, five-level compression, KAIROS daemon mode, and anti-distillation: a deep technical read. 六層架構、fail-closed 工具設計、三層記憶、五級壓縮、KAIROS 守護程式模式、反蒸餾:一次深度技術解讀。
Paper Read: TheAgentCompany: Agents Complete 24% 論文閱讀:TheAgentCompany,Agent 只完成 24%
CMU's Neubig benchmarks agents on real workplace tasks. Best model: 24%. Why that's both damning and expected. CMU Neubig 在真實工作任務上測試 agent。最佳模型:24%。為什麼這既令人失望又在預期中。
Paper Read: Multi-Agent ToT Validator: Reasoning Needs a Referee 論文閱讀:多 Agent ToT 驗證器,推理需要裁判
UTSA's Najafirad adds a validator agent to Tree-of-Thought. The pattern matters more than the 5.6% gain. UTSA Najafirad 在思維樹上加了驗證 agent。這個模式比 5.6% 的增益更重要。
Paper Read: The Context Trap: Modular vs. Monolithic 論文閱讀:The Context Trap,模組化 vs. 單體
E2E audio models degrade on multi-turn dialogue. But is modularity inherently better, or just a crutch for weaker models? E2E 語音模型在多輪對話中退化。但模組化是天生更好,還是只是弱模型的暫時拐杖?
Paper Read: SalesBot: Strategy ≠ Planning in Agent Design 論文閱讀:SalesBot,Agent 設計中策略 ≠ 規劃
CoT-injected dialogue strategies work for sales. But does the approach generalize to domains with wider strategy trees? CoT 注入的對話策略在銷售中有效。但這個方法能泛化到策略樹更寬的領域嗎?
From Odoo Read-Only to Multi-Agent: My Architecture Evolution 從 Odoo Read-Only 到多 Agent:我的系統架構演進
How a 100-commit internal platform with Odoo JSON-RPC integration led me to design a multi-agent orchestrator. 一個 100 次提交的內部平台如何帶領我走向多 Agent 編排器的設計。
Three-Mode Orchestrator: Mock → Ollama → Claude Hybrid Routing 三模式 Orchestrator:Mock → Ollama → Claude 的混合路由
My orchestrator supports mock, local Qwen2.5:7b, and cloud Claude Sonnet. The latency gap is 5x. Here's the design. 我的編排器支援 mock、本地 Qwen2.5:7b 和雲端 Claude Sonnet。延遲差距是 5 倍。以下是設計過程。
32 Skills + RAG: Building a Domain Knowledge System for Accounting 32 個 Skill + RAG:為會計事務所打造領域知識系統
22 tax/accounting skills, a pure JSON RAG pipeline, and Claude Agent SDK. How I built a knowledge system without ML models. 22 個稅務會計 skill、純 JSON 的 RAG pipeline、Claude Agent SDK。我如何不用 ML 模型就建構知識系統。
Paper Read: AutoGen: Comparing to My Multi-Agent System 論文閱讀:AutoGen,與我的多 Agent 系統之比較
Microsoft's conversational multi-agent framework vs. my centralized orchestrator. Same problem, opposite design choices. 微軟的對話式多 Agent 框架 vs. 我的集中式編排器。相同問題,相反的設計選擇。
Paper Read: ReAct: Why My Agent Loop Works 論文閱讀:ReAct,為什麼我的 Agent 循環有效
I built a ReAct loop without knowing it had a name. Comparing my mock vs. Claude gap to the paper's ablation studies. 我建了一個 ReAct 循環卻不知道它有名字。將我的 mock vs. Claude 差距與論文的消融研究做比較。
Let's Connect 與我聯繫
Interested in collaborating on LLM agent systems research. Open to discussion and feedback on my work. 歡迎 LLM Agent 系統研究方面的合作邀請。歡迎交流與對我作品的回饋。