Most AI memory systems were built around a single-agent abstraction: one user, one conversation, one retrieval stream, one context window. That assumption is reasonable for a chatbot. It quietly breaks the moment you run a fleet.
When dozens of agents read and write the same persistent state, the hard questions stop being about retrieval quality. They become: who is allowed to see this memory, which version is current, where a fact came from, and how knowledge crosses an agent boundary without leaking. Those are distributed-systems and database-consistency questions wearing retrieval clothing — and bigger context windows do not answer a single one of them.
So we wrote them down. “Governed Shared Memory for Multi-Agent LLM Systems” is now on arXiv (2606.24535). The paper does three things: it formalizes the problem, it defines an architecture, and it measures our own production service through an open harness — including the two architectural bugs that measurement caught.
AI memory is evolving from a context-window problem into a distributed-systems problem.
— Governed Shared Memory for Multi-Agent LLM Systems, §3From conversation history to shared operational state
A support agent updates a billing record. Hours later a planning agent, an analytics dashboard, a routing orchestrator, and a compliance audit all depend on that change. Memory here is no longer “what was said” — it is shared operational state, and its correctness story has more in common with a distributed database than with a retrieval index.
The paper makes this precise. A fleet-memory system is defined as a five-tuple:
The central challenge is no longer retrieving semantically relevant text. It is maintaining shared state that stays operationally correct as the fleet writes to it. A write is not an immutable conversational artifact; it is a state transition that can supersede, restrict, or contradict what came before.
Four ways fleet memory breaks
The framing isn’t abstract. Each gap in G, P, or T maps to a concrete, observable failure — and naive semantic retrieval is vulnerable to all four, because eligibility is governed by embedding similarity rather than explicit policy.
Unauthorized leakage
An agent retrieves memory outside its authorized scope — a support agent pulling billing notes meant only for finance.
Stale propagation
Updates don’t synchronize. One agent rewrites a shipping address while another keeps acting on the old one.
Contradiction persistence
Conflicting facts coexist and stay retrievable. Append-only stores have no principled way to choose between them.
Provenance collapse
A retrieved fact can’t be traced to its writer, source, or time. Debugging becomes guesswork and audits become unverifiable.
Governed memory, by design
The architecture answers each failure mode with a primitive: scoped retrieval, temporal supersession, provenance tracking, and policy-governed propagation. Semantic similarity is necessary but never sufficient — every retrieval must also satisfy the governance policy attached to the candidate row’s scope.
Memory carries an explicit visibility scope at write time:
| Scope | Visibility |
|---|---|
| Agent-local | Visible only to the writing agent |
| Team-shared | Shared among a defined group of agents |
| Tenant-global | Shared across the tenant environment |
| Restricted | Explicitly policy-constrained |
Retrieval is no longer a single similarity lookup. It’s a pipeline — semantic candidate generation, policy filtering, temporal resolution, provenance enrichment, then ranked delivery — and each stage encodes one of the governance dimensions.
ArgusFleet: a measurement, not a benchmark
To make the primitives falsifiable, the paper introduces ArgusFleet — an open-source Python 3.12 harness (github.com/caura-ai/argusfleet) that exercises the live REST API with one experiment per governance dimension: leakage probes scope, contradiction probes time, provenance walks derivation chains, and propagation measures authorized visibility and cross-fleet leakage.
Two things make it unusual. First, every workload is seeded and deterministic, with a per-run nonce so re-running against a stateful production service produces fresh writes instead of collisions — and every number in the paper regenerates from committed event traces. Second, this is explicitly a measurement of one production service, not a baseline shootout. The point isn’t to win a leaderboard. It’s to run the formalism against reality and see what the formalism predicts that reality then confirms — or breaks.
The scope invariant predicted a concrete violation on one API path. The experiment measured it. The operator fixed it, and a re-probe confirmed the fix. That loop is the contribution — not a clean scoreboard.
The numbers, measured against memclaw.net
All four experiments ran against the production memclaw.net service from a freshly-provisioned tenant. The positive results are clean where the architecture is exercised end to end.
Provenance is the cleanest result
All 50 depth-four derivation chains reconstructed with the correct writer identity at every hop — completeness 1.000, accuracy 1.000 — at a per-hop fetch latency of 291 ms p50. Each chain modeled a real incident narrative: observation, hypothesis, mitigation, verification.
The consistency cost lives at write time, not read time
Under strong write mode, a freshly written fact became visible to authorized readers in effectively one search round-trip (0.83 s p50) — not the tens of seconds a naive batched probe schedule had suggested. Reads stay fast; the price of correctness is paid by the writer, which is exactly where you want it in a fleet that recalls far more often than it writes.
The bugs are the point
A design paper would have stopped at the clean numbers. A live evaluation surfaced two production-relevant issues that no design-only treatment would have caught — and that’s the part we’re proudest to publish.
A dedup optimization starved a correctness mechanism
Contradiction resolution worked perfectly when both conflicting writes were admitted — detection rate 1.000. But across all fact-runs it was only 0.490. The cause wasn’t the detector. A synchronous near-duplicate gate was rejecting the second, contradictory write before the asynchronous contradiction detector ever saw it. An optimization meant to suppress noise was silently suppressing the signal.
A read path that resolved identity, then ignored it
Tenant isolation held everywhere. But the GET-by-id path evaluated only the tenant projection of a row’s scope and skipped the sub-tenant check — so at measurement time a low-trust agent could fetch a cross-fleet row by its identifier that the trust ladder should have denied. It’s the textbook confused-deputy pattern: a handler that resolves the caller’s identity and then fails to use it. The search path enforced the full predicate; the direct-fetch path did not.
Both failures are invisible in a design document and invisible in a single-agent benchmark. You only see them when the formalism meets a running, multi-tenant service — which is the paper’s central argument for live evaluation.
Long-context retrieval alone is not memory
The conclusion the paper defends, with evidence: a bigger context window does not give you scoped access, temporal correctness, provenance, or safe propagation. Governed shared memory demands explicit systems-level abstractions — and the only honest way to know whether your abstractions hold is to measure them against a live service and publish what breaks.
That’s the discipline behind MemClaw, and now it’s on the record.
Read the paper.
“Governed Shared Memory for Multi-Agent LLM Systems” — the full formalism, methodology, and committed event traces. The harness is open source; so is MemClaw.