R2 Folder Sync
Automatically ingest documents uploaded to Cloudflare R2 buckets. Users upload structured folder hierarchies via any S3-compatible tool (rclone, AWS CLI, R2 console), and Martha detects changes, maps folders to collections, and triggers the ingestion pipeline.
How It Works
S3 client uploads to R2
-> R2 event notification -> Cloudflare Queue
-> Martha polls queue (every 5s)
-> Creates collection + document records
-> Triggers DocumentIngestionWorkflow
-> Document is parsed, chunked, embeddedA daily reconciliation workflow catches any events missed due to downtime.
R2 Key Convention
Files must follow this path structure:
{tenant_id}/{collection_slug}/{filename}| Segment | Maps to | Example |
|---|---|---|
tenant_id | Data isolation boundary | acme-corp |
collection_slug | DocumentCollection (auto-created if needed) | site-surveys |
filename | Document record | report.pdf |
Deeper paths are flattened into the filename:
acme-corp/site-surveys/photos/tower-north.jpg
-> tenant: acme-corp
-> collection: site-surveys
-> filename: photos/tower-north.jpgConfiguration
| Setting | Env Var | Default | Description |
|---|---|---|---|
| Enable sync | R2_SYNC_ENABLED | false | Master toggle |
| Account ID | CF_ACCOUNT_ID | required | Cloudflare account ID |
| Queue ID | CF_QUEUE_ID | required | Cloudflare Queue ID |
| API Token | CF_API_TOKEN | required | Token with queues_read + queues_write |
| Poll interval | R2_SYNC_POLL_INTERVAL_SECONDS | 5 | Seconds between queue polls |
| Batch size | R2_SYNC_BATCH_SIZE | 100 | Messages per poll (max 100) |
| Auto-create collections | R2_SYNC_AUTO_CREATE_COLLECTIONS | true | Create collections from folder names |
| Reconciliation cron | R2_SYNC_RECONCILIATION_CRON | 0 3 * * * | Daily reconciliation schedule |
| Visibility timeout | R2_SYNC_VISIBILITY_TIMEOUT_MS | 300000 | Message lock duration (5 min) |
Cloudflare Setup (One-Time)
# Create the queue
wrangler queues create martha-r2-events
# Enable HTTP pull consumer
wrangler queues consumer http add martha-r2-events
# Add notification rules on your R2 bucket
wrangler r2 bucket notification create <bucket-name> \
--queue=martha-r2-events --event-type=object-create
wrangler r2 bucket notification create <bucket-name> \
--queue=martha-r2-events --event-type=object-deleteUpload Methods
| Method | Use Case |
|---|---|
| AWS CLI / rclone / s3cmd | Bulk uploads from ops teams |
| R2 console (dashboard) | Ad-hoc uploads |
| Presigned URLs from Martha API | Web/CLI uploads (future) |
| Super Slurper | Initial migration from S3/MinIO |
!!! warning "wrangler uploads don't trigger notifications" wrangler r2 object put uses an internal API that does not fire event notifications. Always use S3-compatible tools (AWS CLI, rclone, boto3) for uploads that should trigger sync.
Admin API
Trigger Reconciliation
POST /api/admin/r2-sync/reconcile
Content-Type: application/json
Authorization: Bearer <jwt>
{
"tenant_id": "acme-corp"
}Starts R2ReconciliationWorkflow which scans R2 objects vs DB records for the given tenant, processing missed creates, changed files, and orphaned records.
Change Detection
- New files: eTag recorded, ingestion triggered
- Changed files: different eTag detected, re-ingestion triggered
- Deleted files: document soft-deleted (
is_active=false), chunks preserved for citation stability - Duplicate notifications: same eTag = skip (idempotent)
Backpressure
Large batch uploads (e.g., 1,000 files via rclone) are naturally throttled by the existing per-tenant ingestion quota (MAX_CONCURRENT_PER_TENANT=5). Documents queue in Temporal and process in order.