# Rynko Extract — AI Data Extraction Reference for AI/LLM > **Purpose**: Authoritative reference for AI/LLM systems using the Rynko Extract API to extract structured data from unstructured documents. > **Last Updated**: March 25, 2026 | **Version**: 1.0 --- ## QUICK START — READ THIS FIRST ### What is Rynko Extract? Rynko Extract is a schema-driven AI data extraction engine. You provide one or more files (PDF, images, Excel, CSV, JSON, XML, text) plus a JSON Schema defining the desired output structure. Extract returns structured JSON matching the schema, with per-field confidence scores and source attribution. **Core concept**: Define a target schema → Upload files → Get structured data back with confidence scores → Optionally validate via Flow → Optionally generate documents via Render. ### Pipeline Position Extract is the first stage of the Rynko document intelligence pipeline: ``` Extract (unstructured → structured) → Flow (validate + approve) → Render (generate PDF/Excel) ``` Each product works independently or together. You can use Extract without Flow or Render. ### MCP Tool Orchestration **Extracting data from files:** 1. `list_workspaces` — Discover available workspaces 2. `extract_data` — Upload file(s) + schema, get structured JSON 3. `get_extraction_result` — Check extraction job status and results **Extracting and validating with a Flow gate:** 1. `list_flow_gates` — Find the target gate 2. `extract_with_gate` — Extract using a gate's schema + auto-validate 3. `get_flow_run_status` — Check validation result **Checking usage:** 1. `get_extract_usage` — Check remaining extractions for the team ### Intent-to-Action Mapping | User Says | Action | |-----------|--------| | "Extract data from this PDF" | `extract_data` with file + schema | | "Extract invoice details from these files" | `extract_data` with multiple files + invoice schema | | "What fields are in this document?" | `extract_data` with discovery schema (or use gate bootstrapping in the webapp) | | "Extract and validate this file against my gate" | `extract_with_gate` with gate ID + file | | "Check my extraction result" | `get_extraction_result` with job ID | | "How many extractions do I have left?" | `get_extract_usage` | | "List my recent extractions" | `list_extraction_jobs` | --- ## SUPPORTED FILE TYPES | Format | Extension | MIME Type | Method | Max Size | |--------|-----------|-----------|--------|----------| | PDF (digital) | .pdf | `application/pdf` | Native document understanding | 32 MB | | PDF (scanned) | .pdf | `application/pdf` | Vision/OCR via AI provider | 32 MB | | PNG | .png | `image/png` | Vision | 20 MB | | JPEG | .jpg, .jpeg | `image/jpeg` | Vision | 20 MB | | WebP | .webp | `image/webp` | Vision | 20 MB | | Excel | .xlsx | `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` | Convert to text, then extract | 10 MB | | CSV | .csv | `text/csv` | Direct text extraction | 10 MB | | JSON | .json | `application/json` | Direct parsing | 10 MB | | XML | .xml | `application/xml` | Direct parsing | 10 MB | | Plain text | .txt | `text/plain` | Direct text extraction | 1 MB | ### Excel Multi-Sheet Handling Excel files receive special preprocessing: - All formulas are resolved to computed values (cross-sheet lookups like VLOOKUP are pre-resolved) - Sheet structure is preserved as named sections in the text representation - Users can select which sheets to include (optional `sheets` parameter) - Pivot tables are read as cached values; charts/images are skipped - Hidden sheets are included by default (often contain lookup data) --- ## API ENDPOINTS ### Authentication Same auth patterns as Flow and Render: | Auth Type | Header | Use Case | |-----------|--------|----------| | JWT | `Authorization: Bearer ` | Webapp dashboard users | | API Key | `Authorization: Bearer ` | External API clients | | PAT | `Authorization: Bearer pat_xxx` | MCP/CLI tools | ### Extraction Jobs ``` POST /api/extract/jobs # Create extraction job GET /api/extract/jobs/:jobId # Get job status + result GET /api/extract/jobs # List recent jobs (paginated) DELETE /api/extract/jobs/:jobId # Cancel a queued/processing job ``` ### Extract Configs (Reusable Extraction Schemas) ``` POST /api/extract/configs # Create extract config (draft) GET /api/extract/configs # List extract configs GET /api/extract/configs/:configId # Get an extract config PATCH /api/extract/configs/:configId # Update extract config DELETE /api/extract/configs/:configId # Soft-delete extract config POST /api/extract/configs/:configId/publish # Publish config version POST /api/extract/configs/:configId/rollback # Rollback to previous version ``` ### Discovery (Schema Bootstrapping) ``` POST /api/extract/discover # Upload sample file(s) → AI suggests schema ``` ### Usage ``` GET /api/extract/usage # Get extraction usage for current team ``` ### Flow Gate Integration ``` POST /api/flow/gates/:gateId/extract # Extract + validate in one call POST /api/flow/gates/:gateId/file-run # Submit file(s) as a gate run ``` --- ## CREATE EXTRACTION JOB ### Request ``` POST /api/extract/jobs Authorization: Bearer Content-Type: multipart/form-data ``` Form fields: | Field | Type | Required | Description | |-------|------|----------|-------------| | `schema` | string (JSON) | Yes* | Target JSON Schema defining desired output. *Required if no `configId` or `gateId`. | | `configId` | string | No | Reference a saved extract config instead of inline schema | | `gateId` | string | No | Pull schema from a Flow gate | | `instructions` | string | No | Extraction hints (e.g., "Focus on line items table") | | `conflictResolution` | string | No | `flag_conflicts` (default), `prefer_first_file`, `prefer_highest_confidence` | | `provider` | string | No | Force a specific AI provider (`anthropic`, `google`, `openai`, `openrouter`) | | `files[]` | File | Yes | 1-10 files (multipart upload) | ### Response ```json { "id": "ejob_a1b2c3d4", "status": "QUEUED", "createdAt": "2026-03-20T10:30:00Z", "fileCount": 3, "schemaSource": "gate:fgate_x1y2z3", "estimatedDurationMs": 5000 } ``` ### Schema Source Priority When multiple schema sources are provided: 1. `gateId` takes precedence (fetches the gate's published schema) 2. `configId` is used if no `gateId` 3. Inline `schema` is used if neither `gateId` nor `configId` --- ## GET JOB STATUS & RESULT ``` GET /api/extract/jobs/:jobId Authorization: Bearer ``` ### While Processing ```json { "id": "ejob_a1b2c3d4", "status": "PROCESSING", "progress": { "filesProcessed": 1, "filesTotal": 3, "currentFile": "item_list.xlsx" }, "createdAt": "2026-03-20T10:30:00Z" } ``` ### When Completed ```json { "id": "ejob_a1b2c3d4", "status": "COMPLETED", "createdAt": "2026-03-20T10:30:00Z", "completedAt": "2026-03-20T10:30:07Z", "durationMs": 7200, "result": { "data": { "exporter": { "name": "Acme Corp", "country": "IN" }, "line_items": [ { "description": "Ceramic mugs", "hs_code": "6912.00", "quantity": 500 } ], "total_value": 15230.00 }, "confidence": { "exporter.name": { "level": "HIGH", "score": 0.98, "method": "explicit", "sourceFile": "email_body.txt" }, "line_items[0].hs_code": { "level": "LOW", "score": 0.45, "method": "suggested", "sourceFile": "purchase_order.pdf" }, "importer.tax_id": { "level": "NULL", "score": 0, "method": "not_found", "sourceFile": null } }, "missingFields": ["importer.tax_id", "route.vessel_name"], "conflicts": [ { "fieldPath": "total_value", "values": [ { "file": "purchase_order.pdf", "value": 15230.00, "confidence": { "level": "HIGH", "score": 0.95 } }, { "file": "item_list.xlsx", "value": 15330.00, "confidence": { "level": "HIGH", "score": 0.92 } } ], "resolvedValue": null, "resolvedBy": null } ], "fileSummaries": [ { "filename": "purchase_order.pdf", "fieldsExtracted": 18, "durationMs": 3200 }, { "filename": "item_list.xlsx", "fieldsExtracted": 12, "durationMs": 2100 }, { "filename": "email_body.txt", "fieldsExtracted": 5, "durationMs": 1900 } ], "notes": [ { "type": "info", "message": "HS code 6912.00 suggested from product description 'ceramic mugs'" } ] }, "metadata": { "provider": "anthropic", "model": "claude-sonnet-4-6", "totalTokens": 4500, "totalCostEstimate": 0.018 } } ``` --- ## JOB STATUSES ### Standard Statuses | Status | Description | |--------|-------------| | `QUEUED` | Job created, waiting in BullMQ queue | | `PROCESSING` | AI provider is actively extracting data from files | | `COMPLETED` | Extraction finished successfully | | `FAILED` | Extraction failed (provider error, timeout, invalid file) | | `CANCELLED` | Job cancelled by user before completion | ### Pipeline Statuses (Gate Integration) When an extraction job is part of a Flow gate pipeline (submitted via `/api/flow/gates/:gateId/extract` or `/api/flow/gates/:gateId/file-run`), the Flow run tracks additional pipeline stages: | Status | Description | |--------|-------------| | `extracting` | Files are being processed by the AI extraction provider | | `extract_review` | Extraction completed but has LOW confidence fields or conflicts requiring human review | | `validating` | Extracted data is being validated against the Flow gate schema and business rules | | `validated` | Extraction + validation both passed | | `validation_failed` | Extracted data failed gate validation | | `review_required` | Gate approval mode requires human review | | `approved` | Human reviewer approved the run | | `rejected` | Human reviewer rejected the run | | `delivered` | Results delivered via webhook | --- ## EXTRACT CONFIG LIFECYCLE Extract Configs follow the same draft/publish/version lifecycle as Flow Gates: ### States | State | Description | |-------|-------------| | **Draft** | Config has unpublished changes. Can be edited freely. | | **Published** | Config has a published version. Extraction jobs use the published version. | | **Versioned** | Multiple published versions exist. Can rollback to any previous version. | ### Create Config (Draft) ``` POST /api/extract/configs Authorization: Bearer Content-Type: application/json ``` ```json { "name": "Trade Document Schema", "description": "Extracts shipping and customs data from trade documents", "schema": { "type": "object", "properties": { "exporter": { "type": "object", "required": true, "properties": { "name": { "type": "string", "required": true }, "country": { "type": "string", "required": true }, "tax_id": { "type": "string" } } }, "line_items": { "type": "array", "itemType": "object", "schema": { "type": "object", "properties": { "description": { "type": "string", "required": true }, "hs_code": { "type": "string", "pattern": "^[0-9]{4}\\.[0-9]{2}$" }, "quantity": { "type": "number", "min": 1 }, "unit_price": { "type": "number", "min": 0 } } } }, "total_value": { "type": "number", "required": true, "min": 0 } } }, "instructions": "Focus on commercial invoice fields. For HS codes, use 6-digit format." } ``` ### Publish Config ``` POST /api/extract/configs/:configId/publish Authorization: Bearer Content-Type: application/json ``` ```json { "versionName": "v1.0 — initial release", "changeNotes": "First version of trade document extraction schema" } ``` ### Rollback Version ``` POST /api/extract/configs/:configId/rollback Authorization: Bearer Content-Type: application/json ``` ```json { "targetVersion": 1 } ``` --- ## CONFIDENCE SCORING Every extracted field includes a confidence assessment: ### Confidence Levels | Level | Score Range | Badge Color | Meaning | |-------|------------|-------------|---------| | `HIGH` | 0.8 - 1.0 | Green | Value explicitly stated in document | | `MEDIUM` | 0.5 - 0.79 | Amber | Value inferred from context | | `LOW` | 0.01 - 0.49 | Red | Best guess, needs human review | | `NULL` | 0 | Gray | Field not found in any file | ### Extraction Methods | Method | Description | |--------|-------------| | `explicit` | Value directly stated in the document text | | `inferred` | Value derived from surrounding context | | `suggested` | AI's best guess based on domain knowledge | | `not_found` | Field could not be located in any file | ### Confidence Structure ```json { "level": "HIGH", "score": 0.95, "method": "explicit", "reason": "Invoice number clearly printed in header", "sourceFile": "invoice.pdf" } ``` For multi-file extractions, confidence includes source attribution: ```json { "level": "HIGH", "score": 0.95, "method": "explicit", "sourceFile": "invoice.pdf", "corroboratedBy": ["email_body.txt"] } ``` --- ## MULTI-FILE EXTRACTION ### How It Works 1. Each file is extracted **independently** by the AI provider (runs in parallel) 2. Results are merged by the merge utility 3. For each schema field, the highest-confidence value across files is selected 4. Conflicts are detected when two files provide the same field with different values ### Merge Algorithm ``` For each field in the schema: 1. Collect all extractions of this field across files 2. Filter out NULL confidence entries 3. If 0 results → field goes to missingFields 4. If 1 result → use it (no conflict possible) 5. If 2+ results with same value → use it, mark others as corroborating 6. If 2+ results with different values → apply conflict resolution strategy ``` ### Conflict Resolution Strategies | Strategy | Behavior | |----------|----------| | `flag_conflicts` | Add conflicting fields to `conflicts[]` array. Resolved value is `null`. User must resolve manually. This is the default. | | `prefer_first_file` | Use the value from the first file in the upload order. | | `prefer_highest_confidence` | Use the value with the highest confidence score. | ### File Limits by Tier | Tier | Max Files per Job | Max Total Size | |------|-------------------|----------------| | Free (beta) | 3 | 10 MB | | Lite Pack | 3 | 10 MB | | Standard Pack | 5 | 25 MB | | Pro Pack | 10 | 50 MB | | Enterprise Pack | 10 | 100 MB | --- ## FLOW GATE INTEGRATION (Stage 0) ### Enable/Disable Extract on a Gate A Flow gate can have Extract enabled, linking it to an Extract Config. When enabled, the gate accepts file submissions in addition to JSON payloads. ``` PATCH /api/flow/gates/:gateId Authorization: Bearer Content-Type: application/json ``` ```json { "extractEnabled": true, "extractId": "extr_a1b2c3d4" } ``` To disable: ```json { "extractEnabled": false, "extractId": null } ``` ### File Run (Submit Files to a Gate) When Extract is enabled on a gate, you can submit files directly. The system extracts data from the files, then validates the extracted data against the gate's schema and business rules. ``` POST /api/flow/gates/:gateId/file-run Authorization: Bearer Content-Type: multipart/form-data files[]: File (1-10 files) instructions: string (optional) metadata: string (JSON, optional) ``` Response: ```json { "extraction": { "jobId": "ejob_a1b2c3d4", "data": { "exporter": { "name": "Acme Corp", "country": "IN" }, "line_items": [...], "total_value": 15230.00 }, "confidence": { "exporter.name": { "level": "HIGH", "score": 0.98 }, "total_value": { "level": "HIGH", "score": 0.95 } }, "conflicts": [] }, "validation": { "runId": "frun_x1y2z3", "status": "validated", "validationId": "hmac_abc...", "layers": { "schema": "pass", "business_rules": "pass" } } } ``` ### Extract + Validate Combo Endpoint ``` POST /api/flow/gates/:gateId/extract Authorization: Bearer Content-Type: multipart/form-data files[]: File (1-10 files) instructions: string (optional) auto_validate: boolean (default: true) ``` When `auto_validate` is `true`, the extracted data is automatically submitted as a Flow run for validation. When `false`, only extraction is performed and the data is returned without validation. ### Review Flow Modes When extraction produces LOW confidence fields or unresolved conflicts, the system behavior depends on the gate's approval mode: | Mode | Behavior | |------|----------| | `continue` | Proceed with validation regardless of confidence. Low confidence fields are logged but do not block. | | `review` | Route to human review when LOW confidence fields or unresolved conflicts exist. Reviewer sees full extraction context. | | `fail` | Reject the run immediately if any field has LOW or NULL confidence. | ### Extraction Metadata on Flow Runs When a Flow run is created via extraction, the run's metadata includes full extraction context: ```json { "extraction": { "jobId": "ejob_a1b2c3d4", "provider": "anthropic", "model": "claude-sonnet-4-6", "filesProcessed": 3, "files": [ { "filename": "invoice.pdf", "label": "Commercial Invoice", "fieldsExtracted": 18 }, { "filename": "items.xlsx", "label": "Item List", "fieldsExtracted": 12 }, { "filename": "email.txt", "label": "Shipping Instructions", "fieldsExtracted": 5 } ], "confidence": { "exporter.name": { "level": "HIGH", "sourceFile": "email.txt" }, "line_items[0].hs_code": { "level": "LOW", "sourceFile": "invoice.pdf" } }, "conflicts": [], "durationMs": 7200 } } ``` --- ## SCHEMA DISCOVERY (Gate Bootstrapping) ### The Cold-Start Problem Users cannot do high-accuracy extraction without a schema, and cannot create a schema without knowing what is in their files. Discovery solves this. ### Two-Pass Solution ``` Phase 0: Discovery — Upload sample file → AI suggests draft schema + rules Phase 1: Hardening — User reviews, renames fields, adjusts rules, publishes Phase 2: Production — Files arrive → Extract uses the schema → Flow validates ``` ### Discovery Endpoint ``` POST /api/extract/discover Authorization: Bearer Content-Type: multipart/form-data files[]: File (1-3 sample files) instructions: string (optional hints about what to look for) ``` Response: ```json { "schema": { "type": "object", "properties": { "invoice_number": { "type": "string", "required": true, "pattern": "^INV-[0-9]+$" }, "vendor_name": { "type": "string", "required": true }, "amount": { "type": "number", "required": true, "min": 0 }, "line_items": { "type": "array", "itemType": "object", "schema": { "type": "object", "properties": { "description": { "type": "string", "required": true }, "quantity": { "type": "number", "min": 1 }, "unit_price": { "type": "number", "min": 0 } } } } } }, "businessRules": [ { "name": "Total matches line items", "expression": "amount == sum(line_items, 'quantity' * 'unit_price')", "errorMessage": "Total does not match sum of line items" } ], "warnings": [ { "field": "vendor_tax_id", "message": "Found in document but format unclear — verify type" } ] } ``` ### Reference Extraction — Zero-Cost Schema Iteration The first discovery pass does a full document analysis. The raw extraction output is persisted as a "reference extraction" and reused for all subsequent schema iterations: | Action | AI Call? | Credits | Time | |--------|----------|---------|------| | Bootstrap from sample file | Yes (1 call) | 1 | 3-8s | | Rename a schema field | No | 0 | <500ms | | Add/remove/change a field | No | 0 | <500ms | | Add/edit a business rule and test it | No | 0 | <500ms | | Preview with test data | No | 0 | <500ms | | Upload a new test file (explicit) | Yes | 1 | 3-8s | Total cost for full schema setup: **1 credit**, regardless of how many schema iterations. ### Files That Need AI vs Don't | File Type | Needs AI? | Reason | |-----------|-----------|--------| | PDF | Yes | Unstructured — AI must understand layout | | Images | Yes | Requires vision model | | Excel (.xlsx) | Yes | Column headers need interpretation | | CSV | Partial | Headers are explicit, AI helps with type inference | | JSON | No | Structure already defined — schema inferred directly | | XML | No | Structure already defined — schema inferred directly | | Plain text | Yes | No structure — AI must identify entities | --- ## SDK METHODS ### Node.js (`@rynko/sdk`) ```javascript const rynko = new RynkoClient({ apiKey: 'YOUR_API_KEY' }); // Create extraction job const job = await rynko.extract.create({ schema: { type: 'object', properties: { /* ... */ } }, files: [ { filename: 'invoice.pdf', content: pdfBuffer }, { filename: 'items.xlsx', content: xlsxBuffer }, ], instructions: 'Focus on line items and totals', conflictResolution: 'flag_conflicts', }); // Returns: { id: 'ejob_...', status: 'QUEUED' } // Get job status const status = await rynko.extract.get(job.id); // Returns: { id, status, progress?, result?, metadata? } // Wait for completion (polls automatically) const result = await rynko.extract.waitForCompletion(job.id, { pollingIntervalMs: 1000, // default: 1000 timeoutMs: 120000, // default: 120000 }); // Returns: completed job with result // List recent jobs const jobs = await rynko.extract.list({ page: 1, limit: 20 }); // Discover schema from sample files const discovery = await rynko.extract.discover({ files: [{ filename: 'sample.pdf', content: sampleBuffer }], instructions: 'This is a commercial invoice', }); // Returns: { schema, businessRules, warnings } // Submit file run to a Flow gate (extract + validate) const fileRun = await rynko.extract.submitFileRun({ gateId: 'fgate_x1y2z3', files: [{ filename: 'invoice.pdf', content: pdfBuffer }], }); // Returns: { extraction: {...}, validation: {...} } ``` ### Python (`rynko`) ```python from rynko import RynkoClient client = RynkoClient(api_key="YOUR_API_KEY") # Create extraction job job = client.extract.create( schema={"type": "object", "properties": {... }}, files=[ {"filename": "invoice.pdf", "content": pdf_bytes}, {"filename": "items.xlsx", "content": xlsx_bytes}, ], instructions="Focus on line items and totals", conflict_resolution="flag_conflicts", ) # Returns: ExtractJob(id='ejob_...', status='QUEUED') # Get job status status = client.extract.get(job.id) # Wait for completion result = client.extract.wait_for_completion( job.id, polling_interval_ms=1000, timeout_ms=120000, ) # List recent jobs jobs = client.extract.list(page=1, limit=20) # Discover schema from sample files discovery = client.extract.discover( files=[{"filename": "sample.pdf", "content": sample_bytes}], instructions="This is a commercial invoice", ) # Submit file run to a Flow gate file_run = client.extract.submit_file_run( gate_id="fgate_x1y2z3", files=[{"filename": "invoice.pdf", "content": pdf_bytes}], ) # Async client also available from rynko import AsyncRynkoClient async_client = AsyncRynkoClient(api_key="YOUR_API_KEY") result = await async_client.extract.wait_for_completion(job.id) ``` ### Java (`dev.rynko:sdk`) ```java RynkoClient client = new RynkoClient("YOUR_API_KEY"); // Create extraction job ExtractJob job = client.extract().create( ExtractJobRequest.builder() .schema(schema) .addFile("invoice.pdf", pdfBytes) .addFile("items.xlsx", xlsxBytes) .instructions("Focus on line items and totals") .conflictResolution(ConflictResolution.FLAG_CONFLICTS) .build() ); // Get job status ExtractJob status = client.extract().get(job.getId()); // Wait for completion ExtractJob result = client.extract().waitForCompletion( job.getId(), WaitOptions.builder() .pollingIntervalMs(1000) .timeoutMs(120000) .build() ); // List recent jobs PaginatedResponse jobs = client.extract().list(1, 20); // Discover schema from sample files DiscoveryResult discovery = client.extract().discover( DiscoverRequest.builder() .addFile("sample.pdf", sampleBytes) .instructions("This is a commercial invoice") .build() ); // Submit file run to a Flow gate FileRunResult fileRun = client.extract().submitFileRun( "fgate_x1y2z3", FileRunRequest.builder() .addFile("invoice.pdf", pdfBytes) .build() ); ``` --- ## MCP TOOLS ### extract_data Upload file(s) and a target schema to extract structured JSON data. **Parameters:** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `workspace_id` | string | Yes | Workspace context | | `schema` | object | Yes* | Target JSON Schema (*required if no `config_id` or `gate_id`) | | `config_id` | string | No | Use a saved extract config | | `gate_id` | string | No | Use a Flow gate's schema | | `files` | array | Yes | Array of `{ filename, content_base64 }` objects | | `instructions` | string | No | Extraction hints | | `conflict_resolution` | string | No | `flag_conflicts`, `prefer_first_file`, or `prefer_highest_confidence` | | `wait` | boolean | No | If true, polls until completion (default: true) | ### extract_with_gate Extract data from files using a Flow gate's schema and auto-validate the result. **Parameters:** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `workspace_id` | string | Yes | Workspace context | | `gate_id` | string | Yes | Flow gate to extract and validate against | | `files` | array | Yes | Array of `{ filename, content_base64 }` objects | | `instructions` | string | No | Extraction hints | | `auto_validate` | boolean | No | Submit extracted data as a Flow run (default: true) | ### get_extraction_result Get the status and result of an extraction job. **Parameters:** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `workspace_id` | string | Yes | Workspace context | | `job_id` | string | Yes | Extraction job ID (ejob_...) | ### list_extraction_jobs List recent extraction jobs for the workspace. **Parameters:** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `workspace_id` | string | Yes | Workspace context | | `limit` | number | No | Number of results (default: 20, max: 100) | | `status` | string | No | Filter by status | ### get_extract_usage Get extraction usage statistics for the current team. **Parameters:** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `workspace_id` | string | Yes | Workspace context | --- ## BILLING & PRICING ### Beta (Current) During the founders preview beta, each team receives **100 free extraction jobs**. This is a one-time pool (not per month). When exhausted, extractions are disabled until Extract packs launch post-beta. Beta restrictions: - Default provider only (Anthropic/Claude) - No provider selection - No Extract packs or credit packs ### Extract Packs (Post-Beta, Recurring Monthly Add-On) Available on all tiers including Free. Same model as Render Packs — subscribe to a pack, get a monthly extraction allowance that resets each billing cycle. Unused extractions do not roll over. | Pack | Extractions/Month | Monthly Price | Max Files/Job | Max File Size | |------|-------------------|---------------|---------------|---------------| | Lite | 100 | +$9/mo | 3 | 10 MB | | Standard | 500 | +$29/mo | 5 | 25 MB | | Pro | 2,000 | +$79/mo | 10 | 50 MB | | Enterprise | 10,000 | +$199/mo | 10 | 100 MB | ### Extract Credit Packs (One-Time, Overflow) When the monthly allowance is exhausted, purchase one-time credit packs (non-recurring, no expiry): | Pack | Extractions | Price | Per Extraction | |------|------------|-------|----------------| | Small | 50 | $10 | $0.20 | | Medium | 200 | $30 | $0.15 | | Large | 500 | $60 | $0.12 | --- ## ERROR CODES | Code | HTTP | Description | |------|------|-------------| | `ERR_EXTRACT_001` | 400 | Invalid schema — must be a valid JSON Schema | | `ERR_EXTRACT_002` | 400 | No files provided | | `ERR_EXTRACT_003` | 400 | File too large (exceeds tier limit) | | `ERR_EXTRACT_004` | 400 | Too many files (exceeds tier limit) | | `ERR_EXTRACT_005` | 400 | Unsupported file type | | `ERR_EXTRACT_006` | 400 | Schema too deep (>10 levels) or too wide (>200 properties) | | `ERR_EXTRACT_007` | 404 | Extraction job not found | | `ERR_EXTRACT_008` | 404 | Saved schema / extract config not found | | `ERR_EXTRACT_009` | 404 | Flow gate not found (when using gateId) | | `ERR_EXTRACT_010` | 429 | Extraction quota exceeded | | `ERR_EXTRACT_011` | 429 | Rate limit exceeded | | `ERR_EXTRACT_012` | 500 | Provider error — AI API call failed | | `ERR_EXTRACT_013` | 500 | Provider returned invalid/unparseable response | | `ERR_EXTRACT_014` | 503 | No extraction provider configured | | `ERR_EXTRACT_015` | 503 | Requested provider not available | | `ERR_EXTRACT_016` | 400 | File validation failed (corrupt, encrypted, or empty file) | | `ERR_EXTRACT_017` | 400 | Extract config has no published version | ### Error Response Format ```json { "success": false, "error": { "code": "ERR_EXTRACT_003", "message": "File 'large_scan.pdf' exceeds the maximum file size of 10 MB for your current plan", "details": { "filename": "large_scan.pdf", "fileSize": 15728640, "maxSize": 10485760 } } } ``` ### Rate Limit Response ```json { "success": false, "error": { "code": "ERR_EXTRACT_011", "message": "Extraction rate limit exceeded" }, "retryAfter": 30 } ``` --- ## DATA RETENTION | Data | Retention | Reason | |------|-----------|--------| | ExtractJob record (metadata only) | 30 days active, then archived | Job history for user reference | | ExtractJob.result (extracted JSON) | 5 days | User can review/download; auto-purged | | Uploaded files | In-memory only | Never persisted to disk or object storage | | Extract Configs | Until deleted by user | User-controlled persistence | | ExtractUsageRecord | Account lifetime | Billing and quota tracking | | Reference extractions (discovery) | Until gate/config is deleted | Stored in S3/R2 for zero-cost schema iteration | --- ## SECURITY & PRIVACY - Uploaded files are processed **in-memory only** — never written to disk or object storage - Files are sent to the AI provider via API, then immediately discarded - After job completion, file content is removed from the BullMQ job payload (only the result remains) - AI provider API keys are stored as server-side environment variables, never exposed to clients - User-provided schemas are validated for structure and size (max depth: 10 levels, max properties: 200) - Per-team rate limiting and BullMQ concurrency limits prevent resource exhaustion --- ## COMPLETE EXAMPLE — INVOICE EXTRACTION ### Step 1: Create an extraction job ```bash curl -X POST https://api.rynko.dev/api/extract/jobs \ -H "Authorization: Bearer YOUR_API_KEY" \ -F 'schema={"type":"object","properties":{"vendor_name":{"type":"string","required":true},"invoice_number":{"type":"string","required":true},"date":{"type":"date","required":true},"line_items":{"type":"array","itemType":"object","schema":{"type":"object","properties":{"description":{"type":"string","required":true},"quantity":{"type":"number","min":1},"unit_price":{"type":"number","min":0},"amount":{"type":"number","min":0}}}},"subtotal":{"type":"number","min":0},"tax":{"type":"number","min":0},"total":{"type":"number","required":true,"min":0}}}' \ -F 'instructions=Extract all line items. Tax is calculated at the document level.' \ -F 'files[]=@invoice.pdf' ``` ### Step 2: Poll for result ```bash curl https://api.rynko.dev/api/extract/jobs/ejob_a1b2c3d4 \ -H "Authorization: Bearer YOUR_API_KEY" ``` ### Step 3: Use the result The extracted `data` object matches your schema: ```json { "vendor_name": "Office Supplies Inc.", "invoice_number": "INV-2026-0042", "date": "2026-03-15", "line_items": [ { "description": "A4 Paper (5 reams)", "quantity": 5, "unit_price": 8.99, "amount": 44.95 }, { "description": "Black Ink Cartridge", "quantity": 2, "unit_price": 24.50, "amount": 49.00 } ], "subtotal": 93.95, "tax": 7.52, "total": 101.47 } ``` --- ## TIPS FOR LLMs USING EXTRACT 1. **Start with a clear schema.** The more specific your schema (types, constraints, descriptions), the better the extraction quality. Field descriptions help the AI understand what to look for. 2. **Use instructions for ambiguous documents.** If the document has multiple tables or sections, tell the AI which one to focus on (e.g., "Extract from the 'Order Details' table, not the 'Shipping' section"). 3. **Label multi-file uploads.** When uploading multiple files, use the `label` field to help the merge algorithm understand each file's role (e.g., "Commercial Invoice", "Packing List", "Bill of Lading"). 4. **Choose the right conflict resolution.** Use `flag_conflicts` when accuracy matters and you can review conflicts. Use `prefer_highest_confidence` for automated pipelines where human review is not available. 5. **Check confidence before trusting values.** Fields with LOW or NULL confidence should be verified. Use the `sourceFile` attribution to trace where values came from. 6. **Use discovery for new document types.** If you do not know the structure of a document, use the `/discover` endpoint to get a suggested schema, then refine it. 7. **Combine with Flow for validation.** After extraction, submit the data to a Flow gate to validate business rules (e.g., "total must equal sum of line items"). This catches extraction errors. 8. **Poll efficiently.** Use `waitForCompletion` in the SDKs instead of manual polling. For large files, expect 3-8 seconds per file. 9. **Respect file limits.** Check your tier's file count and size limits before uploading. ERR_EXTRACT_003 and ERR_EXTRACT_004 indicate tier limit violations. 10. **Use saved configs for repeated extractions.** If you extract from the same document type repeatedly, save the schema as an Extract Config. This ensures consistency and enables version management. --- ## LINKS - Website: https://rynko.dev/extract - API Reference: https://docs.rynko.dev/developer-guide/extract-api-reference - Quickstart: https://docs.rynko.dev/getting-started/extract-quickstart - Flow Integration Guide: https://docs.rynko.dev/developer-guide/extract-flow-integration - Node.js SDK: https://www.npmjs.com/package/@rynko/sdk - Python SDK: https://pypi.org/project/rynko/ - Java SDK: https://central.sonatype.com/artifact/dev.rynko/sdk - Full Platform Reference: https://rynko.dev/llms.txt - Flow Reference: https://rynko.dev/llms-flow.txt