Skip to content

DAG Orchestration

DAGs (Directed Acyclic Graphs) in Archivus aren't "document workflows"—they're intelligence pipelines that orchestrate complex multi-step verification and enrichment processes.

The Concept

Traditional workflow: Task A → Task B → Task C (fixed sequence)

DAG Intelligence Pipeline: Complex graph where each node can be: - AI processing (analysis, summarization, extraction) - Human approval (gates for critical decisions) - External tools (MCP integrations) - Control flow (branching, merging, delays) - Actions (notifications, data transforms, exports)

Architecture

graph LR
    START([Start]) --> INGEST[Ingest Document]
    INGEST --> EXTRACT[Extract Entities]
    EXTRACT --> VERIFY[Verify Claims]
    VERIFY --> ENRICH[Enrich Context]
    ENRICH --> ANCHOR[Anchor to Hedera]
    ANCHOR --> END([End])

    VERIFY -.requires approval.-> APPROVAL{Human Review}
    APPROVAL -.approved.-> ENRICH
    APPROVAL -.rejected.-> REJECT[Mark Invalid]

Node Types (24 Production-Ready)

Control Flow Nodes

  • Start: Entry point for the pipeline
  • End: Terminal node
  • Condition: Branch based on data or confidence scores
  • Merge: Combine multiple paths
  • Delay: Time-based pause (schedule dependencies)

AI Processing Nodes

Real LLM integration with multi-provider routing:

  • AI Analyze: Comprehensive analysis with themes, sentiment, insights
  • AI Summarize: Structured summaries with configurable style
  • AI Classify: Document type classification with confidence scores
  • AI Extract: Entity extraction (dates, amounts, names, addresses, emails)
  • AI Tag: Auto-tag generation with confidence and categories
  • AI Compare: Multi-document comparison with similarity scores

Provider Routing: - Claude for complex reasoning tasks - Gemini for high-volume bulk processing (99% cost savings) - OpenAI for embeddings and structured extraction

Transform Nodes

  • Transform: Data manipulation and field mapping
  • Filter: Conditional data filtering
  • Aggregate: Combine multiple inputs
  • Split: Divide data for parallel processing

Action Nodes

  • Notify: Email, WebSocket, Slack notifications
  • Webhook: Call external APIs with retry logic
  • Set Field: Update document metadata
  • Tag: Apply tags to documents
  • Move: Relocate documents to folders
  • Archive: Archive to long-term storage
  • Export: Export to external systems

Human-in-the-Loop Nodes

Critical for Team+ tier:

  • Approval: Request approval with escalation
  • 24-hour timeout → escalate to manager
  • 48-hour timeout → auto-reject
  • WebSocket notifications for real-time alerts

  • Review: Assign document review tasks

  • Assignee notification
  • Status tracking
  • Comment threads

  • Assign: Delegate tasks with deadlines

  • Priority levels
  • Follow-up reminders
  • Completion tracking

Execution Model

Dependency Resolution

Node A completes → Checks dependencies → Queues Node B if all upstream complete

Parallel Execution: Independent nodes run concurrently via topological sorting.

Example:

      ┌─ Extract Entities ─┐
Start ┤                     ├─ Merge → Verify → End
      └─ Extract Metadata ─┘

Extract Entities and Extract Metadata run in parallel.
Merge waits for both to complete.

Shared Execution Context

All node outputs flow into a shared context (JSONB):

{
  "document_id": "...",
  "entities": ["John Smith", "Acme Corp"],
  "classification": {
    "type": "INVOICE",
    "confidence": 0.92
  },
  "metadata": {
    "total_amount": 1500.00,
    "vendor": "Acme Corp"
  }
}

Downstream nodes reference upstream outputs:

Condition node: IF context.classification.confidence < 0.8 THEN request_review

Failure Handling

Fail-fast behavior: Error in one node propagates to all pending nodes.

Retry Logic: - Configurable retry count (default: 3) - Exponential backoff (1s, 2s, 4s, 8s...) - Error categorization (transient vs permanent)

Graceful Degradation: - Optional nodes can fail without stopping pipeline - Error outputs captured in context - Notification on critical failures

White-Label AI Integration

DAGs support BYOB (Bring Your Own Brain) for Enterprise tier:

Per-Tenant AI Provider: - Platform default (Archivus-managed) - Anthropic Claude (customer API key) - OpenAI GPT (customer API key) - Ollama (self-hosted for air-gapped) - Custom providers via MCP

Benefits: - Data sovereignty (models run on customer infrastructure) - Cost control (customer billing) - Compliance (regulated industries)

Configuration:

{
  "ai_provider": "anthropic",
  "api_key": "<encrypted>",
  "model": "claude-3-5-sonnet-20241022",
  "custom_prompt_prefix": "[Company Policy Context]"
}

Human-in-the-Loop Workflow

Approval Node Example

Document Classification (AI Confidence: 0.75)
Approval Request Created
WebSocket Notification → User's browser
User Reviews:
  - View document
  - See AI reasoning
  - Check confidence score
User Decides:
  ├─ Approve → Pipeline continues
  ├─ Reject → Pipeline terminates
  └─ Timeout (24h) → Escalate to manager
        └─ Timeout (48h) → Auto-reject

Escalation Logic

Level 1: Assignee (24-hour timeout)
    ↓ timeout
Level 2: Manager (24-hour timeout)
    ↓ timeout
Level 3: Auto-reject + notify admins

All escalations logged for compliance audit.

Real-World Pipeline: Contract Intelligence

graph TB
    START([Contract Upload]) --> EXTRACT[AI Extract Entities]
    EXTRACT --> CLASSIFY[AI Classify Document Type]
    CLASSIFY --> CHECK{Confidence >= 0.9?}

    CHECK -->|Yes| ENRICH[Enrich with Wikidata]
    CHECK -->|No| APPROVAL[Request Human Review]

    APPROVAL -->|Approved| ENRICH
    APPROVAL -->|Rejected| REJECT[Mark Invalid]

    ENRICH --> VERIFY[Verify Claims]
    VERIFY --> CONTRA[Detect Contradictions]
    CONTRA --> HASCONTRA{Has Contradictions?}

    HASCONTRA -->|Yes| NOTIFY[Notify Legal Team]
    HASCONTRA -->|No| ROUTE[Route to Folder]

    NOTIFY --> ROUTE
    ROUTE --> ANCHOR[Anchor to Hedera]
    ANCHOR --> END([Complete])

    REJECT --> END

Nodes: 1. AI Extract Entities (Gemini - cost-effective bulk processing) 2. AI Classify (Claude - high-accuracy reasoning) 3. Condition (check confidence threshold) 4. Approval (human-in-the-loop gate) 5. Enrich (external Wikidata lookup) 6. Verify Claims (GOLAG agent verification) 7. Detect Contradictions (symbolic graph query) 8. Condition (check contradiction status) 9. Notify (WebSocket + email alert) 10. Route (move to appropriate folder) 11. Anchor (Hedera consensus anchoring)

Duration: ~2-5 minutes (with parallel execution) Cost: ~0.15 AI credits (Gemini bulk + Claude reasoning)

Statistics Tracking

Every DAG execution records: - Success rate per node type - Average duration per node - Error frequency and categorization - Approval wait times - Escalation rates

Dashboard Metrics:

DAG: Contract Intelligence
├─ Executions: 1,247
├─ Success Rate: 94.3%
├─ Avg Duration: 3m 22s
├─ Approval Rate: 12.4%
├─ Escalation Rate: 2.1%
└─ Cost per Execution: 0.14 credits

Tier Gating

Tier DAG Access Features
Free/Starter None No DAG orchestration
Pro Read-only View sample DAGs, can't create
Team 5 DAGs, 2000 executions/mo All node types except BYOB AI
Enterprise Unlimited All features + BYOB AI + Custom nodes

Use Cases

1. Invoice Processing

Upload → Extract Fields → Verify Amounts → Check Duplicates → Approval (>$10k) → Route → Export to Accounting

2. Compliance Scanning

Upload → Scan for PII → Classify Sensitivity → High Risk? → Approval → Tag → Notify Compliance Team → Archive

3. Knowledge Graph Enrichment

Upload → Extract Entities → Enrich with Wikidata → Verify Claims → Detect Contradictions → Anchor to Hedera

4. Multi-Document Analysis

Upload Batch → Extract from All → Aggregate Data → AI Compare → Generate Report → Notify Stakeholders

Benefits

1. Consistency

Same pipeline runs on every document—no human variance.

2. Auditability

Every step logged with: - Input/output data - Execution time - AI confidence scores - Approval decisions - Error traces

3. Scalability

Parallel execution + worker pools = hundreds of documents per hour.

4. Flexibility

Mix AI processing, human judgment, external tools in any sequence.

5. Cost Control

  • Gemini for bulk tasks (99% cheaper than Claude)
  • Claude for reasoning tasks (when accuracy matters)
  • Caching to avoid redundant AI calls

What This Enables

Manual Processing DAG Orchestration
Human does every step Human approves key decisions only
Inconsistent logic Same pipeline every time
Hours per document Minutes per document
No audit trail Complete execution logs
Can't scale Parallel execution at scale
Opaque reasoning Transparent node outputs

The Result

Complex multi-step intelligence processing that: - Scales (hundreds of documents per hour) - Adapts (human-in-the-loop for edge cases) - Verifies (GOLAG agents check AI outputs) - Audits (every step logged for compliance) - Integrates (external tools via MCP)

Not "document automation"—intelligence orchestration.


Workflows that think, not just execute.