Benchmark results on LoCoMo and LongMemEval, and the approach that shapes them.
The future we’re building for is not one smart assistant per person. It’s fleets — dozens, then thousands of agents working on behalf of a company, across teams and customers, most of them acting without a human in the immediate loop. Call it the age of agent fleets, or the paperclip era. The unit of deployment is the fleet, not the chatbot.
Public agent-memory benchmarks haven’t caught up to that shape. LoCoMo and LongMemEval, the two most cited, each measure one agent, one user, one long conversation. They’re honest benchmarks. They just measure last year’s problem.
We ran MemClaw through both. Here’s what we’re proud of.
Accuracy puts us in the same conversation as the top memory systems on the market. Scores across the field cluster in a narrow band, and ours sit comfortably inside it. That’s not the axis we’ve been pushing hardest along.
Latency and token efficiency are. Running one agent, a few hundred milliseconds of search latency disappears inside the LLM call behind it. Running a thousand agents making millions of recall calls a day, latency and token bills stop being microbenchmark curiosities and start deciding whether a deployment is viable. Those are the numbers the graphic above is pointing at, and they are the ones that compound as agent count grows.
We care about them because our architecture was shaped, from day one, around fleets of agents rather than a single agent scenarios.
Most of what makes MemClaw interesting.
A single-agent benchmark cannot ask whether agent #17’s mistake this morning prevented agents #1 through #40 from repeating it this afternoon. It cannot ask whether a new agent joining the fleet inherits what the fleet already knows, or starts from zero. It cannot ask whether a memory created inside the sales fleet is visible — or correctly invisible — to an agent in support. It cannot ask any governance question at all.
Those are the questions that decide whether a memory system is deployable inside a company.
MemClaw was designed around that shape:
For a single chatbot, you’ll do well with any of us — MemClaw, Mem0, or Zep. The accuracy numbers are comparable, and the choice often comes down to stack fit and latency or token budget.
MemClaw was designed for the shape past that: a fleet of agents sharing what they learn, under governance, across teams. That’s where our design choices start paying for themselves.
The field needs a benchmark for the problem the field is actually running into. One that measures cross-agent recall, outcome propagation between agents, and governance-aware retrieval under realistic permission shapes. We’re working toward one. If you’re thinking about this too, we’d love to compare notes.
Self-host the OSS in minutes, or start on the managed platform. Same engine either way.