BenchmarksFleet MemoryLoCoMoLongMemEval

Fast, Token-Efficient,
and Built for Fleets

Benchmark results on LoCoMo and LongMemEval, and the approach that shapes them.

2026-04-19 · Caura.AI · 5 min read

The future we’re building for is not one smart assistant per person. It’s fleets — dozens, then thousands of agents working on behalf of a company, across teams and customers, most of them acting without a human in the immediate loop. Call it the age of agent fleets, or the paperclip era. The unit of deployment is the fleet, not the chatbot.

Public agent-memory benchmarks haven’t caught up to that shape. LoCoMo and LongMemEval, the two most cited, each measure one agent, one user, one long conversation. They’re honest benchmarks. They just measure last year’s problem.

We ran MemClaw through both. Here’s what we’re proud of.

MemClaw Benchmark Results

LoCoMo

77.6%

Accuracy(LLM-judge)

96.6%

Token savingsvs full context

LongMemEval

72.5%

Accuracy(LLM-judge)

98.2%

Token savingsvs full context

Search Latency

23 ms

p50, warm

27 ms

p95

01What we optimize for

Accuracy puts us in the same conversation as the top memory systems on the market. Scores across the field cluster in a narrow band, and ours sit comfortably inside it. That’s not the axis we’ve been pushing hardest along.

Latency and token efficiency are. Running one agent, a few hundred milliseconds of search latency disappears inside the LLM call behind it. Running a thousand agents making millions of recall calls a day, latency and token bills stop being microbenchmark curiosities and start deciding whether a deployment is viable. Those are the numbers the graphic above is pointing at, and they are the ones that compound as agent count grows.

We care about them because our architecture was shaped, from day one, around fleets of agents rather than a single agent scenarios.

02What these benchmarks cannot measure

Most of what makes MemClaw interesting.

A single-agent benchmark cannot ask whether agent #17’s mistake this morning prevented agents #1 through #40 from repeating it this afternoon. It cannot ask whether a new agent joining the fleet inherits what the fleet already knows, or starts from zero. It cannot ask whether a memory created inside the sales fleet is visible — or correctly invisible — to an agent in support. It cannot ask any governance question at all.

Those are the questions that decide whether a memory system is deployable inside a company.

MemClaw was designed around that shape:

Scoped memory. Every write is stamped as agent-private, fleet-wide, or cross-fleet. Recall is filtered accordingly, by default.
Continuous cross-agent learning. Agents report what happened after acting on recalled memory. Successes reinforce. Failures write a preventive rule at fleet scope, so every other agent sees the lesson before repeating it.
Governance built in. Tenant isolation, per-agent trust tiers, PII quarantine before cross-fleet exposure, full audit log. None of this moves a recall@k number. All of it moves whether you can deploy at all.

03Which to pick

For a single chatbot, you’ll do well with any of us — MemClaw, Mem0, or Zep. The accuracy numbers are comparable, and the choice often comes down to stack fit and latency or token budget.

MemClaw was designed for the shape past that: a fleet of agents sharing what they learn, under governance, across teams. That’s where our design choices start paying for themselves.

04Toward a fleet-native benchmark

The field needs a benchmark for the problem the field is actually running into. One that measures cross-agent recall, outcome propagation between agents, and governance-aware retrieval under realistic permission shapes. We’re working toward one. If you’re thinking about this too, we’d love to compare notes.

Give your fleet a memory.

Self-host the OSS in minutes, or start on the managed platform. Same engine either way.

Start on memclaw.net →View on GitHub