Skip to content
View zanwenfu's full-sized avatar
🎯
Focusing
🎯
Focusing

Highlights

  • Pro

Block or report zanwenfu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
zanwenfu/README.md

Zanwen (Ryan) Fu

ML Engineer · Founder
zanwenfu.com

LinkedIn  ·  zanwen.fu@duke.edu

Incoming MLE @ Robinhood Central AI · May 2026
MS Computer Science (AI/ML) @ Duke · BComp CS with Distinction @ NUS


I build agentic AI systems that survive production. The model is the easy part. The harness around it — memory, rollback, observability, the architecture between the LLM and the user — is where reliability actually lives.


Three projects

VYNN AI  ·  sole engineer · ~500 pilot users · vynnai.com

Institutional equity research end-to-end in under 7 minutes. LangGraph supervisor orchestrates 7 specialized agents; the LLM never touches a number. All financial math is deterministic Python; LLMs produce narrative that a regex validator blocks if citation coverage drops below 95%. A custom 1,293-line Excel formula evaluator keeps the DCF workbook and downstream JSON consistent without requiring Excel at runtime. Reproducibility validated empirically: CV 0.016–0.035 across 9 production runs, paraphrase stability 0.983.

stock-analyst  ·  api-runner  ·  vynnai-web  ·  blog


AutoCodeRover  ·  acquired by Sonar · ISSTA 2024

Autonomous code repair agent. I designed the Self-Fix Agent — when a patch fails, an LLM-as-a-Judge diagnoses which pipeline stage failed, generates corrective feedback, and replays from that stage while preserving upstream state via UUID-targeted responses. I also built the JetBrains IDE plugin end-to-end in Kotlin: GumTree 3-way AST merge, PSI-based context enrichment, embedded SonarLint 10.3.0. AutoCodeRover moved from 38.4% → 51.6% on SWE-bench Verified during my contribution period; Sonar's Foundation Agent, built on the core, reached 79.2% on Verified — top-ranked among autonomous remediation agents as of Feb 2026.

auto-code-rover  ·  jetbrains-ide-plugin  ·  blog


taste  ·  Agent OS kernel · v0 shipped

The implementation of the thesis in Beyond the Harness. Planner / Worker / Monitor split across three Claude tiers on git as the memory substrate: branches are execution contexts, commits are checkpoints, git worktree is process isolation, git reset --hard is rollback. Three demos shipped with committed transcripts and full cost telemetry — real-Claude run at $0.0964 / 43s / 15-of-15 tests green, parallel worktrees at ~60% wall-clock reduction, hermetic rollback where a regression is caught by pytest and the session branch stays clean. 40 tests, CI-green, pip install-able.

taste-is-all-you-need  ·  design thesis


Research

architectural-damping A deterministic downstream calculator absorbs 83% of LLM-layer prompt-injection successes — and the exact figure (ρ = 0.83) is predictable ex ante from source code. 6/6 frozen attackability predictions held on the pilot. Identifies attack-surface rotation as a failure mode distinct from Nasr et al. 2025's ASR recovery. System under study: VYNN AI (offline replica).
speculative-decoding-t4 Sequoia predicts 1.68× speedup on T4; I measured 0.56×. A four-term decomposition reconciles the 3× gap to within 1.1% of measurement noise. The natural A100 optimization (cross-iteration KV persistence) measurably worsens T4 — the fourth hidden assumption, surfaced by attempting the optimization.
football-llm + scaling study QLoRA-tuned Llama-3.1 8B on FIFA World Cup prediction (52.3% result acc, 29.7% exact score; anonymized variants beat named ones, ruling out team-name memorization). The follow-up scaling study is the more interesting result: under the standard reporting convention, QLoRA at n=192 beats 5-shot ICL by 12.5pp — but under a coherence-required metric (label + score + ground truth all agreeing), the gap collapses to a tie at 42.2%. Magnitude/direction decomposition shows the LLM's real edge is in score magnitude (19pp pregame on O/U 2.5), driven by pretrained scoreline priors tabular features can't replicate.
LUMINA Four-agent citation screening for medical systematic reviews. 0.982 mean sensitivity / 0.018 FNR across 15 SRMAs (~150K citations). On 4 held-out benchmark SRMAs from Tran et al. 2024 (Ann Intern Med), perfect 1.000 sensitivity with 20–40pp specificity improvements over their GPT-3.5 PICOS baseline. Sole first author.

What I think

The harness is the bottleneck, not the model. When agents fail in production, the infrastructure around the LLM broke — not the LLM itself. taste is my attempt at what that infrastructure should look like; Beyond the Harness is the argument behind it.

Context engineering is the real leverage. Most agent failures I've debugged trace back to what the agent didn't know, not what it reasoned poorly about. The architectural-damping study extends this outward: even what lives between the LLM and the user is a context layer, and it can absorb LLM-layer compromise before it reaches anyone.

"Usually right" isn't good enough. VYNN's recommendation validator exists because LLMs fabricate financial numbers. ACR's GumTree merge exists because git apply fails when code has diverged. The speculative-decoding gap showed a cost model with "usually right" assumptions can predict 1.68× and deliver 0.56× on different hardware. Systems that run unsupervised have to hold on the edge cases, not just the common ones.


Writing


Open to full-time SWE / ML engineering roles starting 2027. If you're building something hard, I'd love to hear about it.
zanwen.fu@duke.edu

Last updated: April 2026

Pinned Loading

  1. taste-is-all-you-need taste-is-all-you-need Public

    The OS layer for All Agents - Core, Storage, RAM, Thread, IPC, Monitor, Manager

    Python

  2. Agentic-Analyst/stock-analyst Agentic-Analyst/stock-analyst Public

    VYNN AI Agent Backend is a standalone agent execution system for financial analysis. It orchestrates LLM-based agents to scrape historical financial data, build valuation models, analyze real-time …

    Python 3

  3. Agentic-Analyst/vynnai-web Agentic-Analyst/vynnai-web Public

    VYNN AI frontend — React 18 + TypeScript dashboard with SSE-streaming AI chat, real-time WebSocket market data, portfolio management, and automated daily report generation. Part of the VYNN AI plat…

    TypeScript

  4. jetbrains-ide-plugin jetbrains-ide-plugin Public

    JetBrains IDE plugin bringing autonomous code repair into the developer workflow — real-time agent streaming, interactive feedback loops, and three-way AST patch merging.

    Kotlin

  5. auto-code-rover auto-code-rover Public

    A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-bench lite and 51.6% tasks (pass@3) in SWE-bench verified with…

    Python

  6. agentic-reviewers-for-SRMA agentic-reviewers-for-SRMA Public

    The LUMINA agent is a LLM-based intelligent screener designed for automating the large-scale citation screening phase in medical systematic review and meta-analysis (SRMA).

    Python 2 1