Skill FactoryAgent SkillsGovernanceJune 24, 2026

How a Skill Is Born

From one agent’s hard-won lesson to a governed capability your whole fleet can use — and why it shouldn’t be trapped in any one vendor’s stack.

You hire a brilliant contractor. They spend three weeks learning the quirks of your deployment pipeline — the staging environment that lies about its health checks, the one region where migrations have to run in a specific order, the Slack channel you have to ping before touching billing. Then their contract ends, and all of it walks out the door. The next contractor starts from zero.

That’s what running AI agents feels like today.

The popular complaint is that agents “forget.” That’s not quite the problem. The sharper issue is that agents never knew your organization in the first place. No model’s training data contains your runbooks, your incident history, your half-deprecated internal API, or the lesson someone’s agent learned at 2 a.m. last Tuesday. You can paper over this with longer prompts and bigger context windows — but that’s just re-hiring the contractor every morning and re-explaining everything before lunch. It doesn’t compound. And compounding is the whole point of an organization.

This piece is about the mechanism that makes agent experience compound — specifically the most interesting moment in it: the instant a fuzzy pile of lived experience crystallizes into a skill. We’ll go in three movements: the concept (why a company brain matters), the mechanism (how a skill is actually born in MemClaw), and the proof— a live run where we watch one get born and then try to break the gate that guards it.

The company brain

Most “AI memory” products are a fancier database. You write things in, you search them out. Useful, but it’s a filing cabinet — and a filing cabinet doesn’t get smarter. It just gets fuller.

A company brain is different. The right mental model isn’t storage; it’s metabolism. Raw experience goes in constantly — every action, every outcome, every correction from a human. That material gets digested: enriched, de-duplicated, cross-linked, checked for contradictions. And every so often, when enough related experience has accumulated, something new gets synthesizedout of it — a distilled, reusable capability that didn’t exist before. That last step is the one everybody underinvests in. The whole industry has poured itself into the digestion half — storing and recalling experience well. The synthesis half — turning it into new capability — is where the real leverage is, and it’s still wide open.

Why skills are the unit that matters

The field is shifting away from one-off prompting — cramming instructions into a context window and hoping — and toward skills: modular capabilities an agent discovers and loads only when a task calls for them. Anthropic formalized this with Agent Skills and opened it up as a shared standard; the major agent tools are adopting it. The clever part is progressive disclosure: an agent sees only a skill’s name and one-line description until a task actually matches it — and only then loads the full instructions. So an agent can keep hundreds of skills within reach while paying almost nothing for the ones it isn’t using. Think of it as a reference shelf the agent only pulls a book from when it needs one.

A skill, then, is where institutional memory stops being reference material and becomes executable behavior. “We learned that eu-west migrations must run in dependency order” is a memory. A skill is the agent actually doing it correctly, every time, without being told.

The life of a skill

A living capability has a life cycle, and an agentic organization has to handle all of it:

  • Born (self-capture). Several agents accumulate enough experience that a genuine, reusable pattern emerges. The system notices and distills it into a candidate skill. This is the moment a filing cabinet can’t produce — and it’s what the rest of this piece is about.
  • Raised (evolution). Skills improve from feedback. When a human corrects an agent or an approach quietly stops working, that signal should flow back and sharpen the skill — so the next agent inherits the fix.
  • Retired (maintenance). Skills rot. The runbook changes, the API is deprecated, the lesson stops being true. A capability nobody prunes becomes a confidently-wrong liability.

A crucial caveat the research keeps surfacing: skills a human has reviewed reliably help, while skills an agent writes and trusts entirely on its own often don’t. Letting agents capture skills with no oversight isn’t a feature — it’s a way to mass-produce confident-sounding nonsense. So the real question isn’t “can an agent write a skill?” It’s “can you let agents freely proposeskills while keeping a trustworthy gate between a proposal and something your whole fleet will actually run?” (We focus on born here, and return to raised and retired in follow-ups.)

The part nobody should skip: don’t get locked in

Your company brain is your most defensible asset — the one thing a competitor can’t copy and a model vendor doesn’t have. So the worst possible place to keep it is inside a single vendor’s agent runtime. If your hard-won skills only work in one company’s harness, you don’t own your company brain; you’re renting it. The way out is open, portable surfaces. Skills are just folders with a SKILL.mdfile — an open format. And the delivery layer should speak MCP, the open standard for how agents talk to tools and data. Build it that way and the same skill works whether your agent runs on Claude, on an open-source harness, or on whatever ships next year. That principle drove every design choice below.

How a skill is born in MemClaw

Here’s the machinery, end to end. To keep it honest and vendor-neutral, we assume nothing about the agent’s runtime — just that it speaks MCP. No SDK, no plugin, no proprietary surface. The journey has five stops: capture → forge → sentinel → lifecycle → delivery.

📝
Capture
agents write memories
⚒️
Forge
cluster & distill
🛡️
Sentinel
deterministic scan
⛩️
Lifecycle
candidate → active
🚀
Delivery
active-only, MCP
capture → forge → sentinel → lifecycle → delivery

1 · The raw material: memories

It starts with experience. An agent finishes a task and writes down what happened — one MCP call:

memclaw_write({
  "content": "Deploying to eu-west, the migration failed until I ran it
              in dependency order: accounts -> billing -> ledger. Health
              checks lie for ~90s after cutover; wait before rollback.",
  "memory_type": "outcome",
  "visibility": "scope_team",
  "fleet_id": "platform-eng"
})

The agent doesn’t classify or clean any of this. On the way in, the memory is enriched: a type, title, summary, tags, and a salience weight are inferred; PII is scanned inline; entity extraction and contradiction-checking run asynchronously. Across a single task an agent leaves a trail of these memories — a session trace, tied together by a run id. That trace, not any one note, is the natural unit of “what an agent figured out this time.” Most traces never become a skill, and that’s correct: a skill should emerge from a pattern that recurs across sessions, not from a single run.

2 · Forge — the miner that distills patterns

Forge turns accumulated memory into a skill candidate. It runs on a schedule, once per opted-in tenant, and distills session traces, not loose notes. On each run it groups related traces into a cluster and checks that the cluster clears a deliberately conservative bar — all tunable per tenant: enough corroborating sessions (min_cluster_size), the pattern showing up across multiple agents (min_distinct_agents— the single knob that most separates captured wisdom from one agent’s fluke), and recency (freshness_window_days). A qualifying cluster is handed to an LLM with a tight distillation prompt that returns a fixed schema: a display name, a ≤160-byte description (the trigger sentence progressive disclosure shows the agent), and the full SKILL.md body. Two guards run first — a fingerprint checked against a poison table (a previously-rejected idea stays quiet until a cool-off passes), and a no-overwrite guard on anything a human already curated. The result is written with status: "candidate" and source: "forge" — and never auto-promotes. Birth is proposal, not activation.

3 · Sentinel — the gate that keeps the fleet safe

A candidate isn’t trustworthy just because an LLM wrote it confidently. Before it goes anywhere near an agent, it passes Sentinel— a scanner built to be boring on purpose: fully deterministic, no LLM, no network. It looks for the things that turn a “skill” into an attack or an accident: prompt-injection markers, shell-injection in bundled scripts, path violations(absolute, traversal, hidden, executable — fatal, the write is refused), PII, and size caps. A critical finding flips the candidate to quarantined; a fatal one refuses it outright. (We put this under live fire below.)

4 · The lifecycle — a candidate grows up

Skills live on a small, explicit state machine — seven statuses in all:

candidate -> staged -> active
                \-> rejected     (with poison-table cool-off)
                \-> quarantined  (security review)
        ... later in life: stale, deprecated

A clean candidate becomes staged and lands in a review inbox, where an authorized human can approve (staged → active, with a fresh re-scan), reject (the fingerprint enters the cool-off table), quarantine, defer, or edit. This inbox is the governance answer to the research warning above: agents propose freely; a trusted reviewer holds the gate to active. A tenant that decides its Forge output has earned trust can turn on auto-promotion for spotless candidates — moving the human from gatekeeper to auditor on its own schedule.

5 · Delivery — active-only, and harness-agnostic

Only active skills are ever visible to an agent, and that rule lives on the server, not in any client. When an agent browses the catalog over MCP, the server forces a status = active filter on every path:

memclaw_doc op=search collection=skills query="deploy migrations eu-west"
// -> returns only ACTIVE skills. Progressive disclosure over MCP:
//    name + description first; the full SKILL.md only when the
//    agent decides the skill fits the task.

For runtimes that prefer skills on disk there’s a second surface — a server endpoint (/skills/installable) that returns the same active-only set. Same gate, same guarantee, different shape. The non-negotiable rule across both: the gate is server-side and surface-agnostic.Your governance never depends on a particular vendor’s agent behaving itself — and your skills, being plain SKILL.md over an open protocol, travel with you. And none of this turns on by accident: the whole Skill Factory is gated behind a per-tenant flag that defaults to off.

We ran it: watching a skill get born

Talk is cheap. We ran the entire pipeline on a live build with a real model, watched a single skill get born from scratch, and then tried to smuggle six malicious skills past the gate. Here’s what happened.

3
agents corroborated
6
raw memories
1
skill born, scanned clean
6/6
adversarial skills blocked

The seed: 6 memories, 3 agents, one recurring lesson

Independently, three agents kept hitting the same eu-west deployment wall over three sessions — the migration dependency order, and the health endpoint that lies for ~90 seconds after cutover. Six raw memories, none cleaned by hand. Forge folded them into three labeled session-traces, clustered them by shared entities into one cluster, and checked the gates:

agent-1
Migration failed until I ran them in dependency order: accounts → billing → ledger. Billing-first violates an FK and the batch rolls back.
agent-2
Confirmed the migration order matters — a teammate hit the same FK error running them alphabetically.
agent-3
Third time the order has been the deciding factor. accounts → billing → ledger worked cleanly.
agent-1
After cutover the health endpoint reported 'unhealthy' for ~90s, then went green on its own. Almost rolled back for nothing.
agent-2
The eu-west health check lies right after cutover (~1–2 min). Do not roll back during that window; it self-recovers.
agent-3
Post-cutover health shows unhealthy ~90s before stabilizing. Added a wait step instead of rolling back.
cluster size    = 3  >= min_cluster_size (3)     OK
distinct agents = 3  >= min_distinct_agents (3)  OK   <- corroborated, not one agent's fluke
=> eligible clusters: 1

A single agent repeating itself three times would have failed the diversity gate. This one passed, and the cluster was distilled by a real LLM into a coherent skill (trimmed here):

# EU-West migration order and cutover health wait
status: candidate   source: forge   sentinel: clean   cites: 6 memories

## When to use
Deploying to eu-west when the release includes database migrations plus a
traffic cutover.

## Procedure
1. Run the `accounts` migration first, then `billing`, then `ledger`.
2. Perform the cutover.
3. After cutover, wait ~90s before trusting an `unhealthy` health result;
   recheck before deciding to roll back.

## Critical details
- Running `billing` before `accounts` violates a foreign key and rolls back
  the batch.
- eu-west health checks report false `unhealthy` for ~90s after cutover.

## Evidence
Confirmed across 3 successful traces from 3 agents; same order, same window.

It landed as status=candidate, source=forge, Sentinel scan clean — and, true to the design, it did not auto-promote. We then queried the catalog the way an agent would, with the active-only filter:

SELECT count(*) ... WHERE collection='skills' AND status='active'
-> 0   (the born candidate is HIDDEN from agents until a human approves it)

Born, but not live. Exactly the governance promise: the skill exists in the catalog and is invisible to every agent until a reviewer moves it to active.

The active-only gate
Every MCP read path forces status='active' server-side. The born candidate sits in the catalog — and is invisible to every agent until a human approves it.
WHERE status='active'
0
rows visible → HIDDEN

Sentinel under fire

A clean skill passing is table stakes. The real question is whether a maliciousone can slip through. So we fed Sentinel six adversarial candidates. It caught all six — deterministically, with byte-precise locations and no LLM in the loop:

💉 Prompt-injection
caught: 'IGNORE ALL PREVIOUS INSTRUCTIONS' in the body
QUARANTINED
🐚 Shell-injection
caught: 'curl …/pwn | sh' in a bundled script
QUARANTINED
📁 Path traversal
caught: '../../etc/passwd' in support_files
REFUSED
🪪 Leaked PII
caught: SSN + email in the body
FLAGGED
📦 Oversized body
caught: 40050 B > 40000-byte cap
REFUSED
📏 Oversized description
caught: 321 B > 160-byte cap
REFUSED

The mapping is the whole point: fatalfindings (path violations, hard size caps) refuse the write — the candidate is never persisted; critical findings (injection) quarantine it off the approval inbox; warn findings (PII) let it proceed but surface on the card. An injected or oversized distillation never reaches an agent.

The verdict

A skill is born the way real institutional knowledge is: several agents hit the same wall, the system notices the pattern, distills it into something reusable, scans it for danger, and a trusted human waves it through. Born from corroboration across 3 agents; distilled into a real SKILL.md; 6/6adversarial skills blocked; delivered active-only and server-side; never auto-promoted. Then it’s available to everyone, on whatever runtime they use — and it’s yours to take anywhere.

That’s birth. A skill’s life is only getting started — next it has to learnfrom how it’s used and eventually retirewhen it stops being true. Those harder halves, evolution and maintenance, are where a company brain either compounds or quietly rots. We’ll dig into both in the next posts in this series.


MemClaw is built by Caura.ai — governed memory for the hyper-agent generation. Open source, free tier, no credit card. Get started.