Document ingestion

Upload a document and Martha turns it into something agents can search. PDF, markdown, HTML, image — all the same path: parse, chunk, embed, index. When the pipeline finishes, every document tool (keyword, semantic, visual, page-image) sees the result.

How it works

Upload → Validate → Parse + chunk → Enrich (in parallel) → Finalize
           │            │               │                       │
           │            │               ├─ Vector embed         └─ Mark "ready"
           │            │               ├─ Vision-model describe
           │            │               └─ Visual index
           │            └─ Layout-aware parse + chunk + page classify
           └─ Size, type, and tenant checks

Each stage runs independently with its own retry and timeout policy. The three enrichment steps run in parallel and are non-fatal — failures degrade gracefully (e.g. lose semantic search but keep keyword search) rather than blocking the document from being usable.

Page rendering and visual processing

For PDFs, every page is rendered to PNG and uploaded to storage (up to MAX_RENDER_PAGES, default 300). This enables:

The get_page_image and visual_search tools to return page images to agents.
Visual indexing for image-first retrieval across all pages — useful for diagrams, schematics, charts.

Pages are classified as text, drawing, or table based on layout. Drawing pages get extra treatment:

Vision-model descriptions of the page content, stored as searchable description chunks. This means agents can find a diagram by what's in it, not just by surrounding text.

All enrichment is optional and non-fatal — if any of it fails, keyword search still works.

Supported Formats

Format	Content Type	Notes
PDF	`application/pdf`	Full text extraction with optional OCR
DOCX	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Word documents
PPTX	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	PowerPoint
HTML	`text/html`	Web pages
Markdown	`text/markdown`	Markdown files
CSV	`text/csv`	Tabular data
Images	`image/png`, `image/jpeg`, `image/tiff`, `image/webp`, `image/bmp`	OCR text extraction
JSON	`application/json`	Structured text — read as-is, chunked by character span
YAML	`application/yaml`, `application/x-yaml`, `text/yaml`, `text/x-yaml`	Structured text — read as-is, chunked by character span
XML	`application/xml`, `text/xml`	Structured text — read as-is, chunked by character span
Plain text	`text/plain`	Read as-is, chunked by character span

!!! info "Structured text formats skip layout analysis" JSON, YAML, XML, and plain text content types are decoded as UTF-8 and passed straight to the chunker — they don't go through the document layout pipeline (no page rendering, no OCR, no vision-model description). This is faster (sub-second for most files) and avoids spurious "no parseable content" errors from the layout parser on what is genuinely just text. Pages are reported as 0; chunk count reflects the file's character count divided by the chunk-size token budget.

!!! info "Maximum file size" The default maximum is 50 MB per document. This can be adjusted via the INGESTION_MAX_DOC_SIZE environment variable.

Graceful Degradation

The enrich stage (embedding, VLM descriptions, visual indexing) is not required for a document to be usable. If any enrichment fails:

The document is still marked as "ready"
Chunks are stored with full text (keyword search works)
Semantic search is unavailable if embeddings failed
Drawing descriptions are unavailable if VLM failed
Visual retrieval is unavailable if ColPali indexing failed
All can be backfilled by re-ingesting

Monitoring Ingestion

Admin UI

The Documents page shows ingestion status for each document:

Gray badge — Pending (not yet started)
Blue badge with spinner — Ingesting (workflow running)
Green badge — Ready (ingestion complete)
Red badge — Error (with details)

When any document is actively ingesting, the page auto-refreshes every 3 seconds. Click on a document to see detailed progress: current stage, percentage, chunk count, and any errors. Hover a red Error badge to see why that document failed (the parser error).

For bulk uploads, two aggregate views summarize progress without scrolling the table:

Collection (Documents page): a strip under the collection header shows the subtree totals — N documents · R ready · I in progress · F failed — with a Retry F failed button when there are failures.
Sync source (Documents → Sources → a source): an Overview panel shows a progress bar, counts, the recent drain rate + a naive ETA, and a table of failed documents with per-document reasons and a Retry action.

Both retry actions are bounded and safe to run on thousands of documents — see Retrying failed documents.

API

Poll the per-document status endpoint for one document:

bash

GET /api/admin/documents/{document_id}/ingestion-status?tenant_id=your-tenant

Response:

json

{
  "document_id": "a1b2c3d4-...",
  "ingestion_status": "ingesting",
  "stage": "embed",
  "progress_pct": 65,
  "chunk_count": 42,
  "revision_id": "e5f6g7h8-...",
  "error": null,
  "started_at": "2026-02-17T10:30:00Z",
  "completed_at": null
}

Stages progress through: validate → create_revision → parse_and_chunk → enrich → finalize → done.

Aggregate status (collection or sync source)

For bulk ingest, query the aggregate status of a whole collection (and its sub-collections) or a sync source:

bash

GET /api/admin/collections/{collection_id}/ingestion-status
GET /api/admin/document-sync/sources/{source_id}/status

Both return the same shape — counts by status, a recent drain rate, a naive ETA, and the top grouped error reasons:

json

{
  "collection_id": "a1b2c3d4-...",
  "total": 248,
  "counts": { "pending": 0, "ingesting": 4, "ready": 231, "error": 13, "other": 0 },
  "drain_rate_per_min": 2.1,
  "drain_window_minutes": 15,
  "completed_in_window": 32,
  "eta_seconds": 360,
  "top_errors": [
    { "reason": "docling parse timeout after 120s", "count": 9 },
    { "reason": "unsupported content type: application/x-empty", "count": 4 }
  ]
}

The collection view is subtree-scoped by default (pass ?include_descendants=false for direct children only).
?window_minutes=60 widens the drain-rate window (1–1440).
eta_seconds is null when nothing is outstanding or the pipeline is idle.

List the failed documents (with each one's parser error) for either scope:

bash

GET /api/admin/collections/{collection_id}/ingestion-errors
GET /api/admin/document-sync/sources/{source_id}/errors

Re-Ingestion

To re-process a single document (e.g., after changing chunking settings or to backfill embeddings):

bash

POST /api/admin/documents/{document_id}/reingest?tenant_id=your-tenant

This creates a new revision — the previous revision and its chunks are preserved. The read_doc and search_docs tools always use the latest successful revision.

!!! warning "Concurrent re-ingestion" A document that is already being ingested cannot be re-ingested simultaneously. The API returns 409 Conflict until the current workflow completes.

Retrying failed documents

To re-drive every failed document in a collection (and its sub-collections) or a sync source at once:

bash

POST /api/admin/collections/{collection_id}/retry-ingestion?status=error
POST /api/admin/document-sync/sources/{source_id}/retry-ingestion?status=error

json

{ "collection_id": "a1b2c3d4-...", "statuses": ["error"], "reset_count": 13 }

This is bounded, idempotent, and safe at scale: it flips the matching documents back to pending and lets the ingestion reconciler re-drive them through the per-tenant concurrency limit at a controlled rate — so retrying thousands of documents never stampedes the workers. It never disturbs documents that are currently ingesting or already ready. Pass status=pending to re-drive stuck-pending documents instead.

In the admin UI, the same action is the Retry N failed button on the collection summary strip and on the sync-source panel. From the CLI:

bash

martha documents retry --collection my-collection --status error --yes

Revisions

Each ingestion run creates an immutable revision:

Stores the raw parsed text and structured metadata (pages, sections)
Preserves a content hash for deduplication
Old revisions and their chunks remain in the database
read_doc always serves from the latest successful revision (no S3 download needed)
Revision number auto-increments per document

Backpressure

To prevent a single tenant from overwhelming the ingestion pipeline, Martha enforces a per-tenant concurrency limit (default: 5 concurrent workflows).

When a tenant hits the limit, upload and re-ingest requests return 429 Too Many Requests. The quota is released automatically when a workflow completes (success or failure) or after a TTL safety valve expires (default: 1 hour).

Configuration

All ingestion settings can be overridden via environment variables:

Variable	Default	Description
`INGESTION_CHUNK_SIZE`	`500`	Target tokens per chunk
`INGESTION_CHUNK_OVERLAP`	`50`	Token overlap between chunks (~10%)
`INGESTION_TOKENIZER`	`cl100k_base`	Tokenizer model for chunk sizing
`INGESTION_EMBEDDING_MODEL`	`text-embedding-3-small`	Embedding model (any LiteLLM-compatible)
`INGESTION_EMBEDDING_DIMS`	`1536`	Expected embedding dimensions
`INGESTION_EMBEDDING_BATCH`	`100`	Chunks per embedding API call
`INGESTION_EMBEDDING_RETRIES`	`3`	Max retries for embedding failures
`INGESTION_MAX_CONCURRENT`	`5`	Max concurrent workflows per tenant
`INGESTION_QUOTA_TTL`	`3600`	Quota safety valve TTL in seconds
`INGESTION_PARSE_TIMEOUT`	`600`	Parse activity timeout in seconds
`INGESTION_CHUNK_TIMEOUT`	`120`	Chunk activity timeout in seconds
`INGESTION_EMBED_TIMEOUT`	`300`	Embed activity timeout in seconds
`INGESTION_OCR_ENABLED`	`true`	Enable OCR for PDFs and images
`INGESTION_MAX_DOC_SIZE`	`52428800`	Maximum document size in bytes (50 MB)
`INGESTION_MAX_ACTIVITIES`	`2`	Max concurrent ingestion stages per worker
`INGESTION_VLM_ENABLED`	`false`	Enable vision-model descriptions for drawing pages
`INGESTION_VLM_MODEL`	`gemini/gemini-3-flash`	Vision model identifier (any LiteLLM-compatible)
`INGESTION_VLM_MAX_TOKENS`	`8192`	Max tokens for vision-model responses
`INGESTION_MAX_DRAWING_PAGES`	`100`	Max drawing pages to render per document
`INGESTION_VISION_RETRIEVAL_ENABLED`	`false`	Enable image-first visual retrieval
`INGESTION_PAGE_IMAGE_DPI`	`144`	Page rendering DPI (144 = 2x scale)

Document ingestion ​

How it works ​

Page rendering and visual processing ​

Supported Formats ​

Graceful Degradation ​

Monitoring Ingestion ​

Admin UI ​

API ​

Aggregate status (collection or sync source) ​

Re-Ingestion ​

Retrying failed documents ​

Revisions ​

Backpressure ​

Configuration ​

Document ingestion

How it works

Page rendering and visual processing

Supported Formats

Graceful Degradation

Monitoring Ingestion

Admin UI

API

Aggregate status (collection or sync source)

Re-Ingestion

Retrying failed documents

Revisions

Backpressure

Configuration