DAG Orchestration¶
DAGs (Directed Acyclic Graphs) in Archivus aren't "document workflows"—they're intelligence pipelines that orchestrate complex multi-step verification and enrichment processes.
The Concept¶
Traditional workflow: Task A → Task B → Task C (fixed sequence)
DAG Intelligence Pipeline: Complex graph where each node can be: - AI processing (analysis, summarization, extraction) - Human approval (gates for critical decisions) - External tools (MCP integrations) - Control flow (branching, merging, delays) - Actions (notifications, data transforms, exports)
Architecture¶
graph LR
START([Start]) --> INGEST[Ingest Document]
INGEST --> EXTRACT[Extract Entities]
EXTRACT --> VERIFY[Verify Claims]
VERIFY --> ENRICH[Enrich Context]
ENRICH --> ANCHOR[Anchor to Hedera]
ANCHOR --> END([End])
VERIFY -.requires approval.-> APPROVAL{Human Review}
APPROVAL -.approved.-> ENRICH
APPROVAL -.rejected.-> REJECT[Mark Invalid] Node Types (24 Production-Ready)¶
Control Flow Nodes¶
- Start: Entry point for the pipeline
- End: Terminal node
- Condition: Branch based on data or confidence scores
- Merge: Combine multiple paths
- Delay: Time-based pause (schedule dependencies)
AI Processing Nodes¶
Real LLM integration with multi-provider routing:
- AI Analyze: Comprehensive analysis with themes, sentiment, insights
- AI Summarize: Structured summaries with configurable style
- AI Classify: Document type classification with confidence scores
- AI Extract: Entity extraction (dates, amounts, names, addresses, emails)
- AI Tag: Auto-tag generation with confidence and categories
- AI Compare: Multi-document comparison with similarity scores
Provider Routing: - Claude for complex reasoning tasks - Gemini for high-volume bulk processing (99% cost savings) - OpenAI for embeddings and structured extraction
Transform Nodes¶
- Transform: Data manipulation and field mapping
- Filter: Conditional data filtering
- Aggregate: Combine multiple inputs
- Split: Divide data for parallel processing
Action Nodes¶
- Notify: Email, WebSocket, Slack notifications
- Webhook: Call external APIs with retry logic
- Set Field: Update document metadata
- Tag: Apply tags to documents
- Move: Relocate documents to folders
- Archive: Archive to long-term storage
- Export: Export to external systems
Human-in-the-Loop Nodes¶
Critical for Team+ tier:
- Approval: Request approval with escalation
- 24-hour timeout → escalate to manager
- 48-hour timeout → auto-reject
-
WebSocket notifications for real-time alerts
-
Review: Assign document review tasks
- Assignee notification
- Status tracking
-
Comment threads
-
Assign: Delegate tasks with deadlines
- Priority levels
- Follow-up reminders
- Completion tracking
Execution Model¶
Dependency Resolution¶
Parallel Execution: Independent nodes run concurrently via topological sorting.
Example:
┌─ Extract Entities ─┐
Start ┤ ├─ Merge → Verify → End
└─ Extract Metadata ─┘
Extract Entities and Extract Metadata run in parallel.
Merge waits for both to complete.
Shared Execution Context¶
All node outputs flow into a shared context (JSONB):
{
"document_id": "...",
"entities": ["John Smith", "Acme Corp"],
"classification": {
"type": "INVOICE",
"confidence": 0.92
},
"metadata": {
"total_amount": 1500.00,
"vendor": "Acme Corp"
}
}
Downstream nodes reference upstream outputs:
Failure Handling¶
Fail-fast behavior: Error in one node propagates to all pending nodes.
Retry Logic: - Configurable retry count (default: 3) - Exponential backoff (1s, 2s, 4s, 8s...) - Error categorization (transient vs permanent)
Graceful Degradation: - Optional nodes can fail without stopping pipeline - Error outputs captured in context - Notification on critical failures
White-Label AI Integration¶
DAGs support BYOB (Bring Your Own Brain) for Enterprise tier:
Per-Tenant AI Provider: - Platform default (Archivus-managed) - Anthropic Claude (customer API key) - OpenAI GPT (customer API key) - Ollama (self-hosted for air-gapped) - Custom providers via MCP
Benefits: - Data sovereignty (models run on customer infrastructure) - Cost control (customer billing) - Compliance (regulated industries)
Configuration:
{
"ai_provider": "anthropic",
"api_key": "<encrypted>",
"model": "claude-3-5-sonnet-20241022",
"custom_prompt_prefix": "[Company Policy Context]"
}
Human-in-the-Loop Workflow¶
Approval Node Example¶
Document Classification (AI Confidence: 0.75)
↓
Approval Request Created
↓
WebSocket Notification → User's browser
↓
User Reviews:
- View document
- See AI reasoning
- Check confidence score
↓
User Decides:
├─ Approve → Pipeline continues
├─ Reject → Pipeline terminates
└─ Timeout (24h) → Escalate to manager
└─ Timeout (48h) → Auto-reject
Escalation Logic¶
Level 1: Assignee (24-hour timeout)
↓ timeout
Level 2: Manager (24-hour timeout)
↓ timeout
Level 3: Auto-reject + notify admins
All escalations logged for compliance audit.
Real-World Pipeline: Contract Intelligence¶
graph TB
START([Contract Upload]) --> EXTRACT[AI Extract Entities]
EXTRACT --> CLASSIFY[AI Classify Document Type]
CLASSIFY --> CHECK{Confidence >= 0.9?}
CHECK -->|Yes| ENRICH[Enrich with Wikidata]
CHECK -->|No| APPROVAL[Request Human Review]
APPROVAL -->|Approved| ENRICH
APPROVAL -->|Rejected| REJECT[Mark Invalid]
ENRICH --> VERIFY[Verify Claims]
VERIFY --> CONTRA[Detect Contradictions]
CONTRA --> HASCONTRA{Has Contradictions?}
HASCONTRA -->|Yes| NOTIFY[Notify Legal Team]
HASCONTRA -->|No| ROUTE[Route to Folder]
NOTIFY --> ROUTE
ROUTE --> ANCHOR[Anchor to Hedera]
ANCHOR --> END([Complete])
REJECT --> END Nodes: 1. AI Extract Entities (Gemini - cost-effective bulk processing) 2. AI Classify (Claude - high-accuracy reasoning) 3. Condition (check confidence threshold) 4. Approval (human-in-the-loop gate) 5. Enrich (external Wikidata lookup) 6. Verify Claims (GOLAG agent verification) 7. Detect Contradictions (symbolic graph query) 8. Condition (check contradiction status) 9. Notify (WebSocket + email alert) 10. Route (move to appropriate folder) 11. Anchor (Hedera consensus anchoring)
Duration: ~2-5 minutes (with parallel execution) Cost: ~0.15 AI credits (Gemini bulk + Claude reasoning)
Statistics Tracking¶
Every DAG execution records: - Success rate per node type - Average duration per node - Error frequency and categorization - Approval wait times - Escalation rates
Dashboard Metrics:
DAG: Contract Intelligence
├─ Executions: 1,247
├─ Success Rate: 94.3%
├─ Avg Duration: 3m 22s
├─ Approval Rate: 12.4%
├─ Escalation Rate: 2.1%
└─ Cost per Execution: 0.14 credits
Tier Gating¶
| Tier | DAG Access | Features |
|---|---|---|
| Free/Starter | None | No DAG orchestration |
| Pro | Read-only | View sample DAGs, can't create |
| Team | 5 DAGs, 2000 executions/mo | All node types except BYOB AI |
| Enterprise | Unlimited | All features + BYOB AI + Custom nodes |
Use Cases¶
1. Invoice Processing¶
Upload → Extract Fields → Verify Amounts → Check Duplicates → Approval (>$10k) → Route → Export to Accounting
2. Compliance Scanning¶
Upload → Scan for PII → Classify Sensitivity → High Risk? → Approval → Tag → Notify Compliance Team → Archive
3. Knowledge Graph Enrichment¶
Upload → Extract Entities → Enrich with Wikidata → Verify Claims → Detect Contradictions → Anchor to Hedera
4. Multi-Document Analysis¶
Upload Batch → Extract from All → Aggregate Data → AI Compare → Generate Report → Notify Stakeholders
Benefits¶
1. Consistency¶
Same pipeline runs on every document—no human variance.
2. Auditability¶
Every step logged with: - Input/output data - Execution time - AI confidence scores - Approval decisions - Error traces
3. Scalability¶
Parallel execution + worker pools = hundreds of documents per hour.
4. Flexibility¶
Mix AI processing, human judgment, external tools in any sequence.
5. Cost Control¶
- Gemini for bulk tasks (99% cheaper than Claude)
- Claude for reasoning tasks (when accuracy matters)
- Caching to avoid redundant AI calls
What This Enables¶
| Manual Processing | DAG Orchestration |
|---|---|
| Human does every step | Human approves key decisions only |
| Inconsistent logic | Same pipeline every time |
| Hours per document | Minutes per document |
| No audit trail | Complete execution logs |
| Can't scale | Parallel execution at scale |
| Opaque reasoning | Transparent node outputs |
The Result¶
Complex multi-step intelligence processing that: - Scales (hundreds of documents per hour) - Adapts (human-in-the-loop for edge cases) - Verifies (GOLAG agents check AI outputs) - Audits (every step logged for compliance) - Integrates (external tools via MCP)
Not "document automation"—intelligence orchestration.
Workflows that think, not just execute.