Most AI memory systems were built around a single-agent abstraction: one user, one conversation, one retrieval stream, one context window. That assumption is reasonable for a chatbot. It quietly breaks the moment you run a fleet.

When dozens of agents read and write the same persistent state, the hard questions stop being about retrieval quality. They become: who is allowed to see this memory, which version is current, where a fact came from, and how knowledge crosses an agent boundary without leaking. Those are distributed-systems and database-consistency questions wearing retrieval clothing — and bigger context windows do not answer a single one of them.

So we wrote them down. “Governed Shared Memory for Multi-Agent LLM Systems” is now on arXiv (2606.24535). The paper does three things: it formalizes the problem, it defines an architecture, and it measures our own production service through an open harness — including the two architectural bugs that measurement caught.

AI memory is evolving from a context-window problem into a distributed-systems problem.

— Governed Shared Memory for Multi-Agent LLM Systems, §3

FIG 1A fleet writes and recalls through MemClaw, which governs memory along four dimensions — scope, time, provenance, and propagation. ArgusFleet probes the same live REST surface with one experiment per dimension.

01 The fleet-memory problem

From conversation history to shared operational state

A support agent updates a billing record. Hours later a planning agent, an analytics dashboard, a routing orchestrator, and a compliance audit all depend on that change. Memory here is no longer “what was said” — it is shared operational state, and its correctness story has more in common with a distributed database than with a retrieval index.

The paper makes this precise. A fleet-memory system is defined as a five-tuple:

FIG 2The single-agent line of work optimizes M alone. Fleet memory has to maintain correctness across A, G, P, and T simultaneously — across interacting reads and writes from many autonomous actors over time.

The central challenge is no longer retrieving semantically relevant text. It is maintaining shared state that stays operationally correct as the fleet writes to it. A write is not an immutable conversational artifact; it is a state transition that can supersede, restrict, or contradict what came before.

02 The failure modes

Four ways fleet memory breaks

The framing isn’t abstract. Each gap in G, P, or T maps to a concrete, observable failure — and naive semantic retrieval is vulnerable to all four, because eligibility is governed by embedding similarity rather than explicit policy.

🔓

scope failure

Unauthorized leakage

An agent retrieves memory outside its authorized scope — a support agent pulling billing notes meant only for finance.

🕰️

time failure

Stale propagation

Updates don’t synchronize. One agent rewrites a shipping address while another keeps acting on the old one.

⚔️

resolution failure

Contradiction persistence

Conflicting facts coexist and stay retrievable. Append-only stores have no principled way to choose between them.

🧬

provenance failure

Provenance collapse

A retrieved fact can’t be traced to its writer, source, or time. Debugging becomes guesswork and audits become unverifiable.

03 The architecture

Governed memory, by design

The architecture answers each failure mode with a primitive: scoped retrieval, temporal supersession, provenance tracking, and policy-governed propagation. Semantic similarity is necessary but never sufficient — every retrieval must also satisfy the governance policy attached to the candidate row’s scope.

Memory carries an explicit visibility scope at write time:

Scope	Visibility
Agent-local	Visible only to the writing agent
Team-shared	Shared among a defined group of agents
Tenant-global	Shared across the tenant environment
Restricted	Explicitly policy-constrained

Retrieval is no longer a single similarity lookup. It’s a pipeline — semantic candidate generation, policy filtering, temporal resolution, provenance enrichment, then ranked delivery — and each stage encodes one of the governance dimensions.

FIG 3Where a vector DB returns the top-k by similarity, governed retrieval filters by policy, resolves by time, and enriches by provenance before it ranks. The same query returns different rows to different callers — by design, not by configuration.

04 The harness

ArgusFleet: a measurement, not a benchmark

To make the primitives falsifiable, the paper introduces ArgusFleet — an open-source Python 3.12 harness (github.com/caura-ai/argusfleet) that exercises the live REST API with one experiment per governance dimension: leakage probes scope, contradiction probes time, provenance walks derivation chains, and propagation measures authorized visibility and cross-fleet leakage.

Two things make it unusual. First, every workload is seeded and deterministic, with a per-run nonce so re-running against a stateful production service produces fresh writes instead of collisions — and every number in the paper regenerates from committed event traces. Second, this is explicitly a measurement of one production service, not a baseline shootout. The point isn’t to win a leaderboard. It’s to run the formalism against reality and see what the formalism predicts that reality then confirms — or breaks.

predict → measure → remediate

The scope invariant predicted a concrete violation on one API path. The experiment measured it. The operator fixed it, and a re-probe confirmed the fix. That loop is the contribution — not a clean scoreboard.

05 What the live service did

The numbers, measured against memclaw.net

All four experiments ran against the production memclaw.net service from a freshly-provisioned tenant. The positive results are clean where the architecture is exercised end to end.

100%

Provenance chains reconstructed

50 chains · depth 4

0.83s

Write → visible (p50)

tight poll · p95 1.63s

0.000

Cross-fleet leak rate

n = 80 probes

97.5%

Fleet-sibling visibility

n = 120 probes

Provenance is the cleanest result

All 50 depth-four derivation chains reconstructed with the correct writer identity at every hop — completeness 1.000, accuracy 1.000 — at a per-hop fetch latency of 291 ms p50. Each chain modeled a real incident narrative: observation, hypothesis, mitigation, verification.

FIG 4The harness fetches the leaf and walks back to the root. Every ancestor was reachable, and the observed writer matched the planned writer at each step — the property that makes audits and debugging possible.

The consistency cost lives at write time, not read time

Under strong write mode, a freshly written fact became visible to authorized readers in effectively one search round-trip (0.83 s p50) — not the tens of seconds a naive batched probe schedule had suggested. Reads stay fast; the price of correctness is paid by the writer, which is exactly where you want it in a fleet that recalls far more often than it writes.

06 The negative results

The bugs are the point

A design paper would have stopped at the clean numbers. A live evaluation surfaced two production-relevant issues that no design-only treatment would have caught — and that’s the part we’re proudest to publish.

A dedup optimization starved a correctness mechanism

Contradiction resolution worked perfectly when both conflicting writes were admitted — detection rate 1.000. But across all fact-runs it was only 0.490. The cause wasn’t the detector. A synchronous near-duplicate gate was rejecting the second, contradictory write before the asynchronous contradiction detector ever saw it. An optimization meant to suppress noise was silently suppressing the signal.

FIG 5The gap between the two bars is a pipeline-ordering bug, not a model failure. When the contradictory write survives the dedup gate, supersession is flawless; the fix is to route structural conflicts past the gate before resolving them.

A read path that resolved identity, then ignored it

Tenant isolation held everywhere. But the GET-by-id path evaluated only the tenant projection of a row’s scope and skipped the sub-tenant check — so at measurement time a low-trust agent could fetch a cross-fleet row by its identifier that the trust ladder should have denied. It’s the textbook confused-deputy pattern: a handler that resolves the caller’s identity and then fails to use it. The search path enforced the full predicate; the direct-fetch path did not.

FIG 6Scope enforcement was bimodal across API paths. The gap was disclosed and remediated server-side during the study; a re-probe found zero cross-fleet reads for the low-trust credential. The paper flags every affected number as “as measured.”

why this matters

Both failures are invisible in a design document and invisible in a single-agent benchmark. You only see them when the formalism meets a running, multi-tenant service — which is the paper’s central argument for live evaluation.

07 The takeaway

Long-context retrieval alone is not memory

The conclusion the paper defends, with evidence: a bigger context window does not give you scoped access, temporal correctness, provenance, or safe propagation. Governed shared memory demands explicit systems-level abstractions — and the only honest way to know whether your abstractions hold is to measure them against a live service and publish what breaks.

That’s the discipline behind MemClaw, and now it’s on the record.

Read the paper.

“Governed Shared Memory for Multi-Agent LLM Systems” — the full formalism, methodology, and committed event traces. The harness is open source; so is MemClaw.

Read on arXiv →Download PDF ArgusFleet on GitHub Start free on MemClaw

← Back to Blog

AI memory is a distributed-systems problem.