Documentation

The Treatise

A setup manual for the learning layer. Models do the thinking, the framework does the bookkeeping. This is how the pieces fit, what each call does, and which knob to turn.

The Premise

An agent should get sharper with every event it sees instead of staying frozen at the prompt it shipped with. That is the goal the whole system is built around, which is why the instruction set is treated as a living document rather than a string you hand-edit. Each interaction - a correction, an abandoned session, a metric that moved - enters as data, gets judged by a model, and feeds back into what the agent is told next time.

The framework should take in any data and learn from it, so ingestion is deliberately dumb at the door and smart downstream. You push raw JSON. The system normalizes it, decides whether it is worth a model call, classifies what it means, and turns repeated evidence into principles the agent can act on. Nothing is discarded into a log that no one reads.

It should never lose what it learned and never trust a single bad afternoon, which is why every principle carries scope, confidence, and status. Learning is earned by repetition, proven before promotion, and isolated so one user or project can never contaminate another. The agent you run next quarter is the one that taught itself, on the evidence, in the open.

What This Is Not

The point is to change what an agent does, not just what it remembers, so this is a learning system and not a memory store with a learning label. Memory stores extract facts, manage context windows, and build knowledge graphs. The agent remembers more and behaves the same. Here, memory is one component and behavior change is the output.

It is also not a runtime you have to adopt or a chatbot product you have to host, because the design wraps an agent you already run. The framework observes the data that runtime already produces and hands evolved instructions back through getInstructions. Nothing upstream is replaced, and nothing happens on the outside world unless you wire an actuator to make it happen.

The Shape of the System

Judgment and bookkeeping must never blur together, so the framework splits into three layers with one pillar across the top. Middleware is every act of judgment, expressed as model calls. Substrate is deterministic bookkeeping - scope, memory, recall, the bandit, storage. Orchestration runs many agents over a durable async queue. Communication is the pillar: how the system reaches the world, other agents, and humans.

Models do the thinking and the framework does the bookkeeping, which is the line that decides where any piece of code belongs. If it derives meaning from content, it is a middleware and it calls a model. If it counts, compares scope keys, runs statistics, or persists state, it is substrate and it never guesses. Communication only carries decisions outward; it does not make them.

Persistence is a set of small interfaces rather than a baked-in database, so the lesson repo, the vector store, the work queue, and the bandit store are each swappable and ship with an in-memory default. Supply durable implementations and the queue plus stores can be backed by a shared service, so a restart resumes un-drained events and workers share one queue; the in-flight signal buffer and the gate baseline are per-process working state. The top object that ties it together is Evolve, which owns the middleware registry, the lesson repo, the vector store, the bandit, the actuators, the agent registry, and the coordinator.

Install and First Boot

Sensible defaults should let a zero-config instance run, so the core ships with built-in middlewares, an in-memory vector store, an in-memory lesson repo, and an in-memory work queue out of the box. You construct one object: new Evolve with a provider, a defaultScope, and optionally an embedder. Everything else has a working default you can override later.

Tests and examples must run offline with no key and no network, which is why a deterministic MockProvider and a LocalEmbedder ship in the testing module. The embedder produces stable vectors for the same input, the clock and id generator are injectable, and the closed-loop example runs end to end without touching a real model. Keep the default LocalEmbedder and the recall path works with zero vendor dependency.

Ingest Anything

The framework should take in any data and never fetch it itself, so ingest accepts any JSON and the developer always pushes it. evolve.ingest(json, opts) takes a blob of any shape, merges the optional scope over your defaultScope, defaults the type to external_data, and preserves an optional context object as a first-class field rather than a stray string. No adapter is required to start.

Model calls must never block the agent that is serving users, so ingest is non-blocking: it normalizes the event, writes it to a work queue, and returns queued before any model runs. queueDepth lets you observe backlog. A worker calls evolve.drain to pull events and run the heavy loop - gate, classify, buffer, distill, reconcile - off the hot path. Chat, metrics, on-chain blobs, and tickets all enter the same way, with no domain code in core.

Context is the place a developer says what the data means, so when you pass context it becomes the primary text the system reasons and embeds against. A numeric blob with no text still works: a deterministic serializer renders it to a stable key and value form, so an mcap-shaped event and an engagement-shaped event each get a consistent vector and can be recalled later by meaning.

The Judgment Gate

Models do the thinking, but burning a model on every tick of a high-volume numeric stream is pointless and expensive, so a JudgmentGate decides which events earn a model call using math and structure alone, never strings. For numeric payloads it only passes an event when a tracked field moves beyond a z-score or percent threshold against a per-scope rolling baseline. That baseline is scope-keyed and tracked per worker process.

The gate should throttle ambient noise without ever silencing a real signal, which is why it also dedupes near-identical events inside a window and supports a sampleRate dial per event type. Explicit signal events still always pass to intent. The gate is replaceable: implement JudgmentGate and hand it in, and your policy decides what reaches the expensive layer while embeddings are cached so re-ingesting the same input does not re-embed it.

Middleware: The Model Layer

Intent, sentiment, contradiction, and relevance must come from models and never from keyword lists, so every act of judgment is a Middleware: a named unit with a role string, a prompt template with handlebars slots, a JSON output schema, an enabled flag, and its own model. The run method builds the prompt, calls the structured path, validates the result, repairs once on bad JSON, and on hard failure returns a typed degraded result instead of throwing. There is no substring matching anywhere in this layer.

Each job can use a different model, set per middleware in config, which is why the registry resolves models per middleware. Six units ship: intent classifies an event, sentiment scores user content and is off by default, contradiction judges whether new information conflicts with a memory and how to resolve it, distill turns evidence into one principle, embed turns text into a vector, and narrate restates a decision in plain language. Reflection is not a middleware but the drain loop that buffers signals per scope and calls distill plus the reconciler. Give intent a cheap classifier and distill a stronger model, and flip the sentiment unit on, all through the middlewares option.

Whatever can be flexible must be flexible, so role, prompt template, schema, model, and enabled state are all overridable per middleware without forking core. Output is always typed and never free text used as a control signal. Confidence is the one number a model never gets to set: distill returns a principle but its confidence is derived by the evidence ladder, not lifted from the model response.

Scope and Isolation

Nothing learned should ever become global by accident, so scope is structural and enforced rather than a naming convention. Six levels run from global through tenant, project, user, session, to agent. Every read and write routes through the ScopeEnforcer, and a stored memory only matches a query when its scope is a superset or equal to the query scope. An empty scope is rejected on write and is never treated as a wildcard on read.

One agent should be able to run many isolated projects safely, which is why a principle is stamped with the narrowest common scope of the signals that produced it. A correction made under one tenant, project, and user is recalled there and at the scopes that contain it, but never at a sibling project or a different tenant. Reflection groups signals by scope bucket before distilling, so a mixed queue can narrow scope but can never generalize across siblings.

Shared learning still has to be possible, so lifting a lesson upward is explicit and guarded. promoteScope moves a lesson to a broader scope only when its confidence is high, its evidence spans enough distinct child scopes, and an isGlobalSafe check clears the text of secrets and paths. The default policy is manual, so the shared knowledge an agent gains is something you grant rather than something that leaks.

Memory: One Lesson Record

Memory should be scoped, confidence-tracked, and decaying, so all of it lives in one unified Lesson record with the memory type - episodic, semantic, procedural, pattern - carried as a field rather than split across separate tables. A status field of candidate, testing, active, or retired rides the same record, so proposals and experiments are just lessons in a state rather than a parallel store.

Confidence must be earned by repetition, which is why a reconcile step sits between distillation and storage. A new principle is embedded and compared to existing lessons in the same scope bucket. A close match that the contradiction middleware does not flag is treated as a repetition: the matched lesson gains evidence and climbs the ladder at the configured counts of two, five, and ten. A genuinely new principle becomes a fresh candidate at evidence count one. The same idea seen five times becomes one lesson at evidence five, not five duplicates.

Old and weak memory should be evicted automatically and survive restarts otherwise, so decay is persisted on write-back and eviction runs through evolve.maintain on a cadence you control. Lessons below the confidence floor are removed and emit an eviction effect; high confidence lessons survive. When the contradiction middleware finds a real conflict it returns a typed resolution - replace, qualify, coexist, or merge - and the substrate applies it deterministically, only retiring an old lesson after the configured number of corroborating conflicts.

Semantic Recall

Retrieval should be by meaning and never by substring, so recall ranks lessons by vector similarity instead of matching text. The embed middleware turns a query into a vector and a VectorStore returns the nearest lessons, filtered by scope and excluding anything retired. A query that is semantically related but lexically different still finds the right lesson, which a keyword search could never do.

There should be no vendor lock-in to use semantic recall, which is why a plain in-memory cosine VectorStore ships as the zero-dependency default. Real vector databases plug into the same interface, so you can move to one without touching the rest of the system. The embedder is configurable, and the local embedder produces stable offline vectors so the recall path is fully testable with no provider.

The Bandit

The agent should pursue only what works and prove it before trusting it, so behavior is validated by a multi-arm bandit rather than a fixed A and B. The number of arms is open, the policy is pluggable between epsilon-greedy, UCB, and Thompson sampling, and exploration against exploitation is a single dial from zero to one set by BANDIT_EXPLORATION. The policy and dial are config, not code.

Real statistics must back any promotion, which is why each arm and each guardrail tracks a running mean and variance with Welford and the decision uses a real two-sample test at a configured significance. The developer pushes what happened by key with recordOutcome, carrying a primary metric and optional guardrails; the engine maps key to the arm that was served and attributes the outcome there. Promotion requires the primary metric to win and every guardrail to stay safe, so a change that improves tone but doubles escalation is vetoed.

An agent should be able to read why a variant won in plain language, so explain assembles the facts - arm means, sample sizes, deltas, p-value, guardrail status, decision - into a typed rationale deterministically, and narrate only phrases it. The model never originates a number. A testable reflection auto-creates a candidate arm, getInstructions serves the assigned arm to a given key, and on a decisive result the winning lesson is promoted to active while the loser is retired with a reason.

Communication and Actuation

The framework should poke the outside world when needed while never acting on its own, so deciding and actuating are split. The core decides that something should change and emits a typed Effect - promotion, escalation, eviction, or contradiction resolved - without assuming how it manifests. Each effect routes to zero or more Actuators that you configure, selected structurally by kind and scope.

The developer always supplies the sender, which is why the same seam covers three modes through one interface. The default InstructionStoreActuator writes a promoted lesson into the active set so getInstructions serves it, with no external access and full offline testing. A WebhookActuator serializes the effect to a sender you inject for email or webhook delivery. A CodeEditActuator hands the lesson and rationale to a writer you supply with file or repo access; core never touches the filesystem itself.

A broken sender must never crash the loop, so a throwing actuator is caught and recorded as a typed failure, and actuation is idempotent and keyed by effect id, with an applied-set you can back with your store so a restart mid-actuation never double-applies. Alongside actuation, an AgentBus carries inter-agent messages and a Narrator makes every significant decision explainable in plain language, which is how the system communicates to both humans and other agents.

Orchestration

This is a coordination of many agents and not a single object, so agents are first-class in an AgentRegistry, each with an id, a role, a scope subtree, middleware overrides, and a status. Many agents coexist, each bound to a scope subtree, all learning from the same substrate while keeping their state isolated.

Agents should hand off, spawn, and report, which is why a Coordinator routes events and work by scope and role. Handoff reassigns work within the scope rules, spawn creates a sub-agent in a child scope, and report bubbles a summary upward. These movements travel over the AgentBus so they are durable and scope-respecting rather than fire-and-forget.

The system is built to scale out without a single hardcoded brain, so the lessons, vectors, work queue, and bandit state sit behind interfaces you can point at a shared backend. ingest enqueues and returns; a worker drains. Back the queue with a shared store and two workers share it without double-processing while a restart resumes un-drained events; reflection triggers per scope bucket rather than off one global counter. The defaults are in-memory, so durable cross-process operation is a matter of supplying durable implementations.

Escalating to a Human

An agent should flag a human when it is unsure, so uncertainty is a first-class trigger with a structural definition rather than a vibe. Escalation fires when a model strength or confidence falls below the uncertainty floor, when a contradiction stays unresolved past its window, when two active in-scope lessons are judged conflicting, or when an experiment hits its maximum duration still inconclusive. Every threshold is configurable.

The reason a human is pulled in should be legible, which is why each escalation emits an EscalationEffect carrying a plain-language reason built from deterministic facts and phrased by narrate. It routes to whatever channel you wired, so escalation reaches Slack, email, or a webhook before anyone opens a dashboard, and a confident, consistent stream never raises a false alarm.

Configuration and Env

Whatever can be flexible must be flexible, so every seam is bring-your-own: the data is any JSON, the lesson repo, vector store, work queue, and bandit store are pluggable interfaces, each model is a per-middleware provider, the embedder is swappable, the metric is dev-pushed, the sender is an actuator, the bandit policy is selectable, and a middleware is a model with a role and a prompt. Swap any one without touching core.

Configuration should never be hardcoded, so core takes every setting as a constructor option on EvolveOptions and stays dependency-free — there is no hidden env coupling in core, and what you pass is exactly what runs. You set the provider and defaultScope, optionally an embedder, the per-middleware model, role, prompt, and enabled flag (sentiment is off until you enable it), the bandit policy and exploration dial, the reconcile threshold, the uncertainty floor, and any of the persistence stores. The rule is simple: if a reasonable developer would want to change it, it is a constructor option, and a zero-config instance still boots on the in-memory defaults.

Extending It

You should be able to add your own judgment without forking the framework, so a custom middleware is one registry call: give it an id, a role, a prompt template, an output schema, and a model, and it runs on the same validate-and-repair path as the built-ins. A toxicity classifier or a domain-specific intent unit drops in beside intent and sentiment.

Every other seam extends the same way, which is why custom storage, a custom vector store, a custom actuator, and a custom bandit policy each implement one interface and pass into Evolve. Pushing a metric is the smallest extension of all: call recordOutcome with a key and a number that means whatever you decide - reply rate, market cap, resolution time - and the engine never needs to know where it came from.

ButtCore: The Markdown Protocol

The same learning loop should be available with zero dependencies for a single developer and their coding assistant, which is what ButtCore is: a markdown protocol of seven files plus AGENTS.md that an agent reads at session start and writes back to as work proceeds. There is no database, no API key, and no configuration beyond the files themselves. It is a separate artifact from the framework and stands on its own.

The lesson should outlive the incident, which is why ButtCore mirrors the framework loop in files. SOUL.md holds the continuously updated identity and distilled principles, PATTERNS.md holds cross-project observations, MISTAKES.md records root causes, and LEXICON.md maps personal vocabulary to behavior. Confidence escalates by repetition the same way: one observation logs to SESSION_LOG, two earn low confidence, five earn medium, and ten with no contradictions earn high.

Local knowledge must never contaminate global identity, so ButtCore separates scope with a boundary marker that only lets data reach global files through distillation, which enforces abstraction. A project choosing PostgreSQL becomes a general preference for established relational databases before it can be stored globally, and global files never carry project names or file paths. Use ButtCore alone for individual workflows, the framework alone for production systems, or both together.

Roadmap

The framework should keep getting more capable without ever breaking the contract, so the near-term work hardens the substrate it already runs on. Today the persistence seams ship with in-memory defaults; durable reference implementations are next — a file-backed store for single-node durability, then SQLite and Postgres with pgvector for multi-tenant production deployments — each behind the existing LessonRepo, VectorStore, QueueStore, and BanditStore interfaces so nothing in core changes.

Bring-your-own should reach further over time, which is why the roadmap extends the seams rather than adding lock-in. More reference providers and embedders, a wider set of reference actuators for common channels, and a dashboard surface for inspecting lessons, experiments, and escalations are next. Each arrives behind the same interfaces, so adopting one is a swap and never a rewrite.

The Differentiation

Memory stores what happened and learning changes what happens next, which is the whole distinction this framework is built on. Other tools extract facts, manage context, or build graphs and stop there. Here the loop is closed: observe, judge with a model, distill, prove with a bandit, and apply. The agent does not just remember more, it behaves differently.

That difference has to be trustworthy to be worth anything, which is why it is evidence-backed, scope-aware, and explainable by construction. Confidence is earned by repetition, promotion is gated by real statistics and guardrails, isolation is structural, and every significant decision can be read in plain language. The framework does the bookkeeping so the model can do the thinking, and it gets sharper with every single event.