Skip to content

Surfaces & capability keys

If you're building a chat surface — a web client, an embed, the CLI, an SMS bridge — this is the contract for telling Martha what your surface can render and answer. Martha is conservative: it only emits a widget or a blocking interactive tool if the surface has declared it can handle it. Everything fails closed.

capability_keys

On POST /api/chat/{session_id}, the request body may include capability_keys — the list of things this surface can render/answer:

jsonc
{ "content": "show me the air quality stations",
  "capability_keys": ["map_widget", "report_widget", "widget_stream", "ask_user"] }

The server uses them two ways:

  1. Display widgets (map_widget, report_widget, webxr_widget, generate_widget) — declaring a key injects server-owned instructions so the agent may emit that widget. A surface that doesn't declare them never gets them.
  2. Interactive tools (ask_user) — declaring the key lets the agent issue a blocking question; absent it, the agent gets an immediate fallback instead of hanging.

Fail-closed by design. A surface that sends no capability_keys (bare CLI, SMS, the minimal embed) gets plain text only — no widgets, no blocking tools. This is intentional: a surface that can't render an approval/question must never be left waiting on one.

Display widgets

Two delivery modes, selected by the widget_stream key:

  • Inline (default): the agent emits <report_widget>{…JSON…}</report_widget> inside the assistant text; your client parses the tags out and renders the component.
  • Streamed frames (widget_stream): the server lifts a complete widget out of the token stream and sends it as its own SSE frame, so raw JSON never appears as text mid-stream:
    data: {"type":"widget","widget_type":"report","data":{ … }}
    Consume it on a dedicated channel (e.g. an onWidget handler) and render the widget directly. Persisted/reloaded messages still contain the inline tags, so a text parser remains the fallback for history.

Declare only the widget types your client actually renders — the declared keys are the source of truth for what the agent is told it can produce.

Interactive tools: ask_user

ask_user is a blocking clarifying question (AskUserQuestion-shaped: a question + options). When the agent calls it:

  1. The chat turn suspends (durably — survives restarts) and an SSE tool_status frame is emitted with status:"awaiting_input" and the question.
  2. Your surface renders the question and collects the answer.
  3. You deliver it back:
    POST /api/chat/{session_id}/human-input
    { "tool_call_id": "...", "answer": "Staging" }
    Auth is the session's own principal (session ownership) — answering a clarifying question is not a privileged action, so a service-account chat surface can deliver it. (Contrast: approval decisions are privileged and go elsewhere — see Capability approvals.)
  4. The agent loop wakes and continues with the answer as the tool result.

If nobody answers within the timeout (24h), the agent receives a structured "timed out" result and proceeds — it never hangs forever.

ask_user carries no permissions. It's a clarification pause; the agent keeps acting under its own grants the whole time. The human's answer is just data fed back as a tool result.

Requirements recap

For ask_user to work end-to-end: the agent must be granted ask_user (it's a platform function — see Permissions & access), and the surface must send the ask_user capability key. Miss either and it fails closed to a fallback.

See also

  • Permissions & access model — grants, the two-surface rule, identities.
  • Capability approvals — the privileged, blocking, human-gated sibling of ask_user.

Martha is built by aiaiai-pt.