Skip to content

Agent memory

Agents remember. Past chat messages, past tool outputs, persistent facts about users — all searchable via a single recall tool the agent calls like any other function.

Why this exists

Long conversations grow past what fits in a model's context window. Even with prompt caching, older turns drop out of the active prompt. Agents that need to refer back to "what did the customer say at the start of the call" or "what was the result of that database query 15 turns ago" can't, unless something brings the past back into context.

That something is the recall tool. The agent decides when to call it; Martha decides what's relevant.

How an agent uses it

Same shape as every other tool call:

json
{
  "name": "recall",
  "arguments": {
    "query": "the metabase query about order count",
    "top_k": 5,
    "source_kinds": ["tool_output"]
  }
}

Parameters

ParameterTypeDefaultDescription
querystringrequiredNatural-language search query. Both keyword and semantic similarity are matched.
top_kinteger5Number of items to return (max 20).
source_kindslist of stringsallFilter by kind: chat_message, tool_output, document_chunk, fact. Omit to search all.
scopestring"any"Where to look. See Scopes below.
enable_rerankbooleantrueCross-encoder rerank for higher precision. Disable only if you specifically want raw scores.

Response

json
{
  "items": [
    {
      "id": "uuid",
      "source_kind": "tool_output",
      "source_ref": "tool_call_id_xyz",
      "content": "...",
      "event_time": "2026-05-02T14:23:01Z",
      "score": 0.87
    }
  ],
  "total": 1,
  "degraded": false,
  "rerank_used": true
}

degraded: true means semantic search is partially indexed in the background — keyword search is complete, semantic recall may be incomplete. Recall is still usable.

rerank_used: false means the precision rerank was skipped — items still come back, just with merge-order ranking.

Scopes

The scope parameter picks which slice of memory to search:

ScopeWhat it coversRequired for use
sessionChat messages and tool outputs from the current chat sessionAlways available
tenantDocument chunks indexed across the whole tenantAlways available
userFacts written specifically as belonging to a user, across all their sessionsAuthenticated human user
agentFacts written specifically as belonging to this agent, across all sessionsAgent attached to the chat
anyAll four, weighted-mergedDefault — no extra requirements

Tenant isolation is unconditional in every scope. A recall in tenant A cannot ever return rows from tenant B.

You can tune the merge weights via environment variables if your agent's domain calls for it:

MARTHA_RECALL_WEIGHT_SESSION=1.3
MARTHA_RECALL_WEIGHT_USER=1.1
MARTHA_RECALL_WEIGHT_AGENT=1.0
MARTHA_RECALL_WEIGHT_TENANT=1.0

Setting a weight to 0 suppresses that source class entirely.

Persistent facts

For things you want an agent to remember across sessions, the agent calls remember_fact:

json
{
  "name": "remember_fact",
  "arguments": {
    "content": "User prefers metric units and dark mode.",
    "scope": "user"
  }
}

Returns { "id": "uuid", "was_new": true }. Idempotent — the same content from the same identity in the same scope writes once.

Two scopes available:

  • user — fact belongs to the human user. Recallable in any future session that user has, regardless of which agent.
  • agent — fact belongs to this agent. Recallable in any future session this agent runs, regardless of which user.

Subsequent remember_fact calls that contradict an earlier fact retire the older one automatically. The agent doesn't have to track this — when it next calls recall, only the current fact appears. Older versions stay in the audit trail, never in default recall.

What's stored

SourceWhat gets recordedWhen
chat_messageEvery user message and assistant responseAt message persist time
tool_outputEvery tool result (full or elided preview if oversized)When the tool returns
document_chunkEvery chunk of every uploaded documentWhen document ingestion finalizes
factEach remember_fact callWhen the agent makes it

Tying back to large tool outputs

When a tool returns a large payload, Martha stores it under a tool_output_key (e.g. tout_abc...) and shows the agent a preview. Memory holds the same preview. To re-read the full payload, the agent calls read_tool_output(tool_output_key="tout_abc...") — same key, full content.

Compressed mode for very long conversations

When a conversation gets so long it threatens the model's context window (above 80% of capacity), Martha automatically switches to a 4-message recent tail in the prompt. The agent reaches older content via recall instead.

This is invisible to end users — long sessions just keep working. The recall tool description tells the agent to reach for it when the conversation has been trimmed for length, and in practice agents do.

You don't need to do anything to enable this. It's automatic.

How fresh is recall

New chat messages, tool outputs, and facts are recallable via keyword search immediately. Semantic search becomes available within ~500 ms in the typical path, and within 5 seconds in the worst case (background indexing). The degraded flag in the response surfaces this.

Precision rerank

Recall over-fetches up to 20 candidates and reorders them with a cross-encoder reranker for final ordering. Default model is a 306M-parameter multilingual reranker covering 70+ languages. If the reranker is unavailable, recall falls back to merge-order results — rerank_used: false in the response signals this. Recall always succeeds; rerank is a quality upgrade, not a correctness gate.

You can swap rerank models per deployment via configuration. Higher-quality multilingual options are available; English-only options run roughly twice as fast.

Tuning

Operational knobs surfaced as environment variables:

VariableDefaultWhat it does
MARTHA_RERANKER_URLunsetReranker endpoint. Unset disables rerank entirely.
MARTHA_RECALL_WEIGHT_SESSION1.3Weight for session items in scope=any
MARTHA_RECALL_WEIGHT_USER1.1Weight for user items in scope=any
MARTHA_RECALL_WEIGHT_AGENT1.0Weight for agent items in scope=any
MARTHA_RECALL_WEIGHT_TENANT1.0Weight for tenant items in scope=any
MARTHA_MEMORY_JUDGE_MODELsmall/fast modelJudge model for fact contradiction. Empty string disables.
MARTHA_MEMORY_JUDGE_TOPK5Candidates compared per pending fact
MARTHA_MEMORY_JUDGE_THRESHOLD0.7Distance cutoff for "close enough to judge"
MARTHA_MEMORY_ANONYMIZE_SOFT_TTL_DAYS30Days before retired facts are anonymized

Retention

Retired memory rows are anonymized, not hard-deleted. Content is replaced with [REDACTED], identity and embedding columns are cleared, and an audit-log event records the action. Rows stay in place so any references to them (supersede chains, audit pointers) remain intact. Default recall always excludes anonymized rows.

Per-tenant retention policy is configurable; the system runs in dry-run mode by default for the first observation window in any new deployment.

Tenant isolation

Every recall query filters by tenant at the SQL level. The tenant identifier comes from the request token, never from agent input. An agent cannot recall content from another tenant even if it tries to inject one — the value is overridden server-side.

  • Document tools — searching uploaded document corpora (different store, different lifecycle from session memory)
  • Citations — formal source references from agent responses to document chunks

Martha is built by aiaiai-pt.