Document ingestion
Upload a document and Martha turns it into something agents can search. PDF, markdown, HTML, image — all the same path: parse, chunk, embed, index. When the pipeline finishes, every document tool (keyword, semantic, visual, page-image) sees the result.
How it works
Upload → Validate → Parse + chunk → Enrich (in parallel) → Finalize
│ │ │ │
│ │ ├─ Vector embed └─ Mark "ready"
│ │ ├─ Vision-model describe
│ │ └─ Visual index
│ └─ Layout-aware parse + chunk + page classify
└─ Size, type, and tenant checksEach stage runs independently with its own retry and timeout policy. The three enrichment steps run in parallel and are non-fatal — failures degrade gracefully (e.g. lose semantic search but keep keyword search) rather than blocking the document from being usable.
Page rendering and visual processing
For PDFs, every page is rendered to PNG and uploaded to storage (up to MAX_RENDER_PAGES, default 300). This enables:
- The
get_page_imageandvisual_searchtools to return page images to agents. - Visual indexing for image-first retrieval across all pages — useful for diagrams, schematics, charts.
Pages are classified as text, drawing, or table based on layout. Drawing pages get extra treatment:
- Vision-model descriptions of the page content, stored as searchable description chunks. This means agents can find a diagram by what's in it, not just by surrounding text.
All enrichment is optional and non-fatal — if any of it fails, keyword search still works.
Supported Formats
| Format | Content Type | Notes |
|---|---|---|
application/pdf | Full text extraction with optional OCR | |
| DOCX | application/vnd.openxmlformats-officedocument.wordprocessingml.document | Word documents |
| PPTX | application/vnd.openxmlformats-officedocument.presentationml.presentation | PowerPoint |
| HTML | text/html | Web pages |
| Markdown | text/markdown | Markdown files |
| CSV | text/csv | Tabular data |
| Images | image/png, image/jpeg, image/tiff, image/webp, image/bmp | OCR text extraction |
!!! info "Maximum file size" The default maximum is 50 MB per document. This can be adjusted via the INGESTION_MAX_DOC_SIZE environment variable.
Graceful Degradation
The enrich stage (embedding, VLM descriptions, visual indexing) is not required for a document to be usable. If any enrichment fails:
- The document is still marked as "ready"
- Chunks are stored with full text (keyword search works)
- Semantic search is unavailable if embeddings failed
- Drawing descriptions are unavailable if VLM failed
- Visual retrieval is unavailable if ColPali indexing failed
- All can be backfilled by re-ingesting
Monitoring Ingestion
Admin UI
The Documents page shows ingestion status for each document:
- Gray badge — Pending (not yet started)
- Blue badge with spinner — Ingesting (workflow running)
- Green badge — Ready (ingestion complete)
- Red badge — Error (with details)
When any document is actively ingesting, the page auto-refreshes every 3 seconds. Click on a document to see detailed progress: current stage, percentage, chunk count, and any errors.
API
Poll the ingestion status endpoint for programmatic monitoring:
GET /api/admin/documents/{document_id}/ingestion-status?tenant_id=your-tenantResponse:
{
"document_id": "a1b2c3d4-...",
"ingestion_status": "ingesting",
"stage": "embed",
"progress_pct": 65,
"chunk_count": 42,
"revision_id": "e5f6g7h8-...",
"error": null,
"started_at": "2026-02-17T10:30:00Z",
"completed_at": null
}Stages progress through: validate → create_revision → parse_and_chunk → enrich → finalize → done.
Re-Ingestion
To re-process a document (e.g., after changing chunking settings or to backfill embeddings):
POST /api/admin/documents/{document_id}/reingest?tenant_id=your-tenantThis creates a new revision — the previous revision and its chunks are preserved. The read_doc and search_docs tools always use the latest successful revision.
!!! warning "Concurrent re-ingestion" A document that is already being ingested cannot be re-ingested simultaneously. The API returns 409 Conflict until the current workflow completes.
Revisions
Each ingestion run creates an immutable revision:
- Stores the raw parsed text and structured metadata (pages, sections)
- Preserves a content hash for deduplication
- Old revisions and their chunks remain in the database
read_docalways serves from the latest successful revision (no S3 download needed)- Revision number auto-increments per document
Backpressure
To prevent a single tenant from overwhelming the ingestion pipeline, Martha enforces a per-tenant concurrency limit (default: 5 concurrent workflows).
When a tenant hits the limit, upload and re-ingest requests return 429 Too Many Requests. The quota is released automatically when a workflow completes (success or failure) or after a TTL safety valve expires (default: 1 hour).
Configuration
All ingestion settings can be overridden via environment variables:
| Variable | Default | Description |
|---|---|---|
INGESTION_CHUNK_SIZE | 500 | Target tokens per chunk |
INGESTION_CHUNK_OVERLAP | 50 | Token overlap between chunks (~10%) |
INGESTION_TOKENIZER | cl100k_base | Tokenizer model for chunk sizing |
INGESTION_EMBEDDING_MODEL | text-embedding-3-small | Embedding model (any LiteLLM-compatible) |
INGESTION_EMBEDDING_DIMS | 1536 | Expected embedding dimensions |
INGESTION_EMBEDDING_BATCH | 100 | Chunks per embedding API call |
INGESTION_EMBEDDING_RETRIES | 3 | Max retries for embedding failures |
INGESTION_MAX_CONCURRENT | 5 | Max concurrent workflows per tenant |
INGESTION_QUOTA_TTL | 3600 | Quota safety valve TTL in seconds |
INGESTION_PARSE_TIMEOUT | 600 | Parse activity timeout in seconds |
INGESTION_CHUNK_TIMEOUT | 120 | Chunk activity timeout in seconds |
INGESTION_EMBED_TIMEOUT | 300 | Embed activity timeout in seconds |
INGESTION_OCR_ENABLED | true | Enable OCR for PDFs and images |
INGESTION_MAX_DOC_SIZE | 52428800 | Maximum document size in bytes (50 MB) |
INGESTION_MAX_ACTIVITIES | 2 | Max concurrent ingestion stages per worker |
INGESTION_VLM_ENABLED | false | Enable vision-model descriptions for drawing pages |
INGESTION_VLM_MODEL | gemini/gemini-3-flash | Vision model identifier (any LiteLLM-compatible) |
INGESTION_VLM_MAX_TOKENS | 8192 | Max tokens for vision-model responses |
INGESTION_MAX_DRAWING_PAGES | 100 | Max drawing pages to render per document |
INGESTION_VISION_RETRIEVAL_ENABLED | false | Enable image-first visual retrieval |
INGESTION_PAGE_IMAGE_DPI | 144 | Page rendering DPI (144 = 2x scale) |