MemClaw Blog · April 2026
The Karpathy Loop Changed How We Think About AI Research.
Here’s What It’s Still Missing.
Yanki · 10 min read
In early March, Andrej Karpathy pushed a 630-line Python script to GitHub and went to sleep. By morning, his agent had run 50 experiments, discovered a better learning rate, and committed the proof to git — without a single human instruction in between.
By day two, the agent had completed 700 experiments and found 20 optimizations on code Karpathy had already hand-tuned for months — including a bug in his attention implementation he’d missed entirely. Shopify CEO Tobi Lütke tried it overnight: 37 experiments, and he woke up to a 0.8B model outperforming his hand-tuned 1.6B. Half the parameters, better results. Then he pointed it at Liquid, Shopify’s templating engine — 53% faster rendering, 61% fewer memory allocations, 93 automated commits.
The industry dubbed it the Karpathy Loop.
And it’s about to change everything.
The Pattern, Not the Project
Strip away the ML specifics and what Karpathy built is a universal design pattern with three primitives:
- One file the agent can edit — the single surface area for experimentation.
- One objective metric — unambiguous, machine-evaluable, no committee required.
- One time budget per cycle — a fixed window that forces the agent to commit or revert.
Fig 1 — The Karpathy Loop: one agent, one file, one metric, repeat forever
The human writes a program.md — goals, constraints, stopping criteria — and walks away. The agent loops: hypothesize → edit → run → measure → keep or discard → repeat. All night. All weekend. Indefinitely.
This isn’t AutoML with a fresh coat of paint. AutoML uses random mutations and evolutionary search. The Karpathy Loop uses an LLM agent that reads research, forms hypotheses, and reasons about why a change should work before trying it. The loop is intelligent, not stochastic.
Karpathy himself put it plainly: any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm. The question isn’t whether this pattern works. The question is whether your problem fits inside it.
Most do.
From Solo Agent to Fleet Reality
Here’s where the conversation gets interesting — and where the original autoresearch repo reveals its limits.
Karpathy’s loop is beautiful in isolation. One agent, one file, one metric, one GPU. But production isn’t isolation.
Fig 2 — Autoresearch works solo. Production fleets need infrastructure it doesn’t provide.
Production is: Multiple agents running parallel experiment loops across different objectives. Multiple teams — R&D optimizing model quality while DevOps optimizes inference latency while Security audits every change. Discoveries that need to flow — Agent A finds that a warmup schedule change improves convergence, and Agent B needs to know this before it wastes 200 cycles rediscovering it. Governance — which agent wrote what, when, why, and who’s allowed to see it.
In Karpathy’s repo, the “memory” is a git log. The “governance” is that only one agent touches train.py. The “knowledge sharing” is a human reading results.tsv in the morning.
That works for a demo. It does not work for an organization running 50 agents across five teams with compliance requirements and a board that asks questions.
What the Karpathy Loop Actually Needs
Let’s map the loop’s implicit requirements to what a real deployment demands:
Persistent, Structured Memory
Every experiment cycle produces knowledge: what was tried, what worked, what failed, and why. In autoresearch, this lives in git commits and a flat TSV. In a fleet deployment, you need semantic search across thousands of experiment results, entity extraction linking outcomes to models, hyperparameters, and architectures, and contradiction detection — flagging when Agent A’s finding conflicts with Agent B’s before one of them wastes a night of compute chasing a dead end.
Cross-Agent, Cross-Fleet Recall
The whole point of running multiple agents is parallelism. But parallelism without shared memory is just expensive duplication. When Agent A discovers that QKNorm needs a scaler multiplier for attention sharpening, every agent in the fleet should have access to that finding — filtered by relevance, scoped by permissions.
Governed Access
Not every agent should see everything. The agent optimizing your public model’s architecture shouldn’t have access to the proprietary dataset agent’s findings. The intern’s experimental loop shouldn’t be able to overwrite the production baseline. You need visibility scopes, trust tiers, and an audit trail on every read and write.
Knowledge That Compounds
This is the meta-lesson from both the Karpathy Loop and Meta’s Hyperagents research (published the same month): the most powerful AI systems don’t just solve problems — they get better at solving problems. Hyperagents, left to self-improve across diverse domains, independently invented persistent memory and performance tracking as core infrastructure. They converged on exactly the primitives that governed memory provides.
The loop needs memory that improves itself. Per-agent retrieval tuning. Deduplication of near-identical findings. Lifecycle management that transitions stale knowledge to archived status before it poisons future experiments.
Enter MemClaw
MemClaw is governed shared memory for AI agent fleets. It was built precisely for this architecture — the one the industry is now converging on.
Fig 3 — MemClaw as fleet memory: write → recall → insights (reflect) → evolve (learn). The full compounding loop.
Here’s how it maps to the Karpathy Loop at scale:
Write once, recall everywhere. memclaw_write stores an experiment outcome. The LLM enrichment pipeline classifies it, extracts entities (model name, hyperparameter, metric delta), scans for PII, detects contradictions with existing knowledge, and updates the knowledge graph — in one call. Every agent in the fleet can recall it via memclaw_recall, filtered by their visibility scope and trust tier.
Hybrid search for experiment archaeology. Vector similarity finds conceptually related results. BM25 keyword search catches exact hyperparameter values. The composite ranking surfaces what matters. memclaw_recall with include_brief=true synthesizes a context paragraph from the top results — so your agent gets a briefing, not a raw list of 200 results it has to parse.
Contradiction detection before wasted compute. Agent A writes: “Increasing warmup to 4.7 improved val_bpb by 3%.” Agent B writes: “Warmup above 3.0 degraded val_bpb on depth-24 models.” MemClaw flags the contradiction via RDF triple analysis, links both to the relevant entities, and surfaces it to any agent querying warmup strategies. No human required.
Lifecycle automation. Experiment results aren’t permanent truths. The finding that worked on your 125M model may not transfer to 1.3B.
MemClaw’s lifecycle lets you transition knowledge as your research evolves. The Crystallizer merges near-duplicate findings into canonical atomic facts with full provenance.
Fig 4 — Recall → Write → Insights (reflect) → Evolve (learn) → next cycle.
Reflection and feedback — the self-improving edge. memclaw_insights reflects over your memory store first — surfacing contradictions, failure patterns, stale knowledge, and divergence across agents. Focus it on what matters (contradictions, failures, stale, divergence, patterns, discover) and findings are saved as permanent insight-type memories. Then memclaw_evolve closes the loop: informed by those insights, it records the experiment outcome linked to the memories that influenced the decision — and adjusts their weights automatically. Reflect first, then learn. The fleet doesn’t just remember. It learns which memories actually led to good outcomes.
Full audit trail. Every write, every read, every transition — logged, timestamped, attributed to a specific agent and fleet. When the board asks “how did we arrive at this architecture decision?”, you have the answer. When compliance asks “which agent accessed the proprietary training data findings?”, you have that too.
The Implementation
Getting MemClaw into a Karpathy-style loop is intentionally simple.
For a single-agent setup (the original autoresearch pattern): add MemClaw as an MCP server in your agent’s config. Paste the JSON block, add your API key, and the 9 tools appear in your agent’s tool surface. Modify program.md to instruct the agent: after each experiment cycle, memclaw_write the result with the metric, the change description, and the outcome. Before each new hypothesis, memclaw_recall for related prior results (set include_brief=true for a synthesized context window). Periodically run memclaw_insights to surface contradictions and patterns across accumulated results. After each outcome, memclaw_evolve to record whether the finding held — closing the feedback loop so the fleet learns which memories actually drive results.
For a multi-agent fleet: use the OpenClaw plugin. Each agent auto-stamps its fleet_id on every write. Visibility scopes control which fleets can see which findings. Trust tiers control which agents can write to shared knowledge vs. fleet-local knowledge. The audit trail tracks everything.
The result: your Karpathy Loop doesn’t just iterate — it compounds. Every cycle’s findings become part of the fleet’s permanent, searchable, governed knowledge base. Every future cycle starts smarter than the last. And when you scale from one agent to ten to a hundred, the knowledge infrastructure is already there.
The Bigger Picture
Karpathy said it best: in 2026, you’re not writing code. You’re spinning up AI agents. The Karpathy Loop proved that an agent can run research autonomously. Meta’s Hyperagents proved that agents left to self-improve independently invent persistent memory as core infrastructure.
The question is no longer whether your agents need governed memory. It’s whether you build it yourself — or use a platform designed for it from day one.
The Hyper-Agent Generation is here. The agents are running. The only question left is whether their discoveries compound or die in silos.
Governed memory for the Hyper-Agent Generation.
Start Free at memclaw.net →MemClaw is governed shared memory for AI agent fleets — multi-agent, multi-fleet, multi-tenant, with permissions, audit trails, and self-learning built in. Built by Caura.ai.