DAG Orchestration¶

DAGs (Directed Acyclic Graphs) in Archivus aren't "document workflows"—they're intelligence pipelines that orchestrate complex multi-step verification and enrichment processes.

The Concept¶

Traditional workflow: Task A → Task B → Task C (fixed sequence)

DAG Intelligence Pipeline: Complex graph where each node can be: - AI processing (analysis, summarization, extraction) - Human approval (gates for critical decisions) - External tools (MCP integrations) - Control flow (branching, merging, delays) - Actions (notifications, data transforms, exports)

Architecture¶

graph LR
    START([Start]) --> INGEST[Ingest Document]
    INGEST --> EXTRACT[Extract Entities]
    EXTRACT --> VERIFY[Verify Claims]
    VERIFY --> ENRICH[Enrich Context]
    ENRICH --> ANCHOR[Anchor to Hedera]
    ANCHOR --> END([End])

    VERIFY -.requires approval.-> APPROVAL{Human Review}
    APPROVAL -.approved.-> ENRICH
    APPROVAL -.rejected.-> REJECT[Mark Invalid]

Node Types (24 Production-Ready)¶

Control Flow Nodes¶

Start: Entry point for the pipeline
End: Terminal node
Condition: Branch based on data or confidence scores
Merge: Combine multiple paths
Delay: Time-based pause (schedule dependencies)

AI Processing Nodes¶

Real LLM integration with multi-provider routing:

AI Analyze: Comprehensive analysis with themes, sentiment, insights
AI Summarize: Structured summaries with configurable style
AI Classify: Document type classification with confidence scores
AI Extract: Entity extraction (dates, amounts, names, addresses, emails)
AI Tag: Auto-tag generation with confidence and categories
AI Compare: Multi-document comparison with similarity scores

Provider Routing: - Claude for complex reasoning tasks - Gemini for high-volume bulk processing (99% cost savings) - OpenAI for embeddings and structured extraction

Transform Nodes¶

Transform: Data manipulation and field mapping
Filter: Conditional data filtering
Aggregate: Combine multiple inputs
Split: Divide data for parallel processing

Action Nodes¶

Notify: Email, WebSocket, Slack notifications
Webhook: Call external APIs with retry logic
Set Field: Update document metadata
Tag: Apply tags to documents
Move: Relocate documents to folders
Archive: Archive to long-term storage
Export: Export to external systems

Human-in-the-Loop Nodes¶

Critical for Team+ tier:

Approval: Request approval with escalation
24-hour timeout → escalate to manager
48-hour timeout → auto-reject
WebSocket notifications for real-time alerts
Review: Assign document review tasks
Assignee notification
Status tracking
Comment threads
Assign: Delegate tasks with deadlines
Priority levels
Follow-up reminders
Completion tracking

Execution Model¶

Dependency Resolution¶

Node A completes → Checks dependencies → Queues Node B if all upstream complete

Parallel Execution: Independent nodes run concurrently via topological sorting.

Example:

      ┌─ Extract Entities ─┐
Start ┤                     ├─ Merge → Verify → End
      └─ Extract Metadata ─┘

Extract Entities and Extract Metadata run in parallel.
Merge waits for both to complete.

Shared Execution Context¶

All node outputs flow into a shared context (JSONB):

{
  "document_id": "...",
  "entities": ["John Smith", "Acme Corp"],
  "classification": {
    "type": "INVOICE",
    "confidence": 0.92
  },
  "metadata": {
    "total_amount": 1500.00,
    "vendor": "Acme Corp"
  }
}

Downstream nodes reference upstream outputs:

Condition node: IF context.classification.confidence < 0.8 THEN request_review

Failure Handling¶

Fail-fast behavior: Error in one node propagates to all pending nodes.

Retry Logic: - Configurable retry count (default: 3) - Exponential backoff (1s, 2s, 4s, 8s...) - Error categorization (transient vs permanent)

Graceful Degradation: - Optional nodes can fail without stopping pipeline - Error outputs captured in context - Notification on critical failures

White-Label AI Integration¶

DAGs support BYOB (Bring Your Own Brain) for Enterprise tier:

Per-Tenant AI Provider: - Platform default (Archivus-managed) - Anthropic Claude (customer API key) - OpenAI GPT (customer API key) - Ollama (self-hosted for air-gapped) - Custom providers via MCP

Benefits: - Data sovereignty (models run on customer infrastructure) - Cost control (customer billing) - Compliance (regulated industries)

Configuration:

{
  "ai_provider": "anthropic",
  "api_key": "<encrypted>",
  "model": "claude-3-5-sonnet-20241022",
  "custom_prompt_prefix": "[Company Policy Context]"
}

Human-in-the-Loop Workflow¶

Approval Node Example¶

Document Classification (AI Confidence: 0.75)
    ↓
Approval Request Created
    ↓
WebSocket Notification → User's browser
    ↓
User Reviews:
  - View document
  - See AI reasoning
  - Check confidence score
    ↓
User Decides:
  ├─ Approve → Pipeline continues
  ├─ Reject → Pipeline terminates
  └─ Timeout (24h) → Escalate to manager
        └─ Timeout (48h) → Auto-reject

Escalation Logic¶

Level 1: Assignee (24-hour timeout)
    ↓ timeout
Level 2: Manager (24-hour timeout)
    ↓ timeout
Level 3: Auto-reject + notify admins

All escalations logged for compliance audit.

Real-World Pipeline: Contract Intelligence¶

graph TB
    START([Contract Upload]) --> EXTRACT[AI Extract Entities]
    EXTRACT --> CLASSIFY[AI Classify Document Type]
    CLASSIFY --> CHECK{Confidence >= 0.9?}

    CHECK -->|Yes| ENRICH[Enrich with Wikidata]
    CHECK -->|No| APPROVAL[Request Human Review]

    APPROVAL -->|Approved| ENRICH
    APPROVAL -->|Rejected| REJECT[Mark Invalid]

    ENRICH --> VERIFY[Verify Claims]
    VERIFY --> CONTRA[Detect Contradictions]
    CONTRA --> HASCONTRA{Has Contradictions?}

    HASCONTRA -->|Yes| NOTIFY[Notify Legal Team]
    HASCONTRA -->|No| ROUTE[Route to Folder]

    NOTIFY --> ROUTE
    ROUTE --> ANCHOR[Anchor to Hedera]
    ANCHOR --> END([Complete])

    REJECT --> END

Nodes: 1. AI Extract Entities (Gemini - cost-effective bulk processing) 2. AI Classify (Claude - high-accuracy reasoning) 3. Condition (check confidence threshold) 4. Approval (human-in-the-loop gate) 5. Enrich (external Wikidata lookup) 6. Verify Claims (GOLAG agent verification) 7. Detect Contradictions (symbolic graph query) 8. Condition (check contradiction status) 9. Notify (WebSocket + email alert) 10. Route (move to appropriate folder) 11. Anchor (Hedera consensus anchoring)

Duration: ~2-5 minutes (with parallel execution) Cost: ~0.15 AI credits (Gemini bulk + Claude reasoning)

Statistics Tracking¶

Every DAG execution records: - Success rate per node type - Average duration per node - Error frequency and categorization - Approval wait times - Escalation rates

Dashboard Metrics:

DAG: Contract Intelligence
├─ Executions: 1,247
├─ Success Rate: 94.3%
├─ Avg Duration: 3m 22s
├─ Approval Rate: 12.4%
├─ Escalation Rate: 2.1%
└─ Cost per Execution: 0.14 credits

Tier Gating¶

Tier	DAG Access	Features
Free/Starter	None	No DAG orchestration
Pro	Read-only	View sample DAGs, can't create
Team	5 DAGs, 2000 executions/mo	All node types except BYOB AI
Enterprise	Unlimited	All features + BYOB AI + Custom nodes

Use Cases¶

1. Invoice Processing¶

Upload → Extract Fields → Verify Amounts → Check Duplicates → Approval (>$10k) → Route → Export to Accounting

2. Compliance Scanning¶

Upload → Scan for PII → Classify Sensitivity → High Risk? → Approval → Tag → Notify Compliance Team → Archive

3. Knowledge Graph Enrichment¶

Upload → Extract Entities → Enrich with Wikidata → Verify Claims → Detect Contradictions → Anchor to Hedera

4. Multi-Document Analysis¶

Upload Batch → Extract from All → Aggregate Data → AI Compare → Generate Report → Notify Stakeholders

Benefits¶

1. Consistency¶

Same pipeline runs on every document—no human variance.

2. Auditability¶

Every step logged with: - Input/output data - Execution time - AI confidence scores - Approval decisions - Error traces

3. Scalability¶

Parallel execution + worker pools = hundreds of documents per hour.

4. Flexibility¶

Mix AI processing, human judgment, external tools in any sequence.

5. Cost Control¶

Gemini for bulk tasks (99% cheaper than Claude)
Claude for reasoning tasks (when accuracy matters)
Caching to avoid redundant AI calls

What This Enables¶

Manual Processing	DAG Orchestration
Human does every step	Human approves key decisions only
Inconsistent logic	Same pipeline every time
Hours per document	Minutes per document
No audit trail	Complete execution logs
Can't scale	Parallel execution at scale
Opaque reasoning	Transparent node outputs

The Result¶

Complex multi-step intelligence processing that: - Scales (hundreds of documents per hour) - Adapts (human-in-the-loop for edge cases) - Verifies (GOLAG agents check AI outputs) - Audits (every step logged for compliance) - Integrates (external tools via MCP)

Not "document automation"—intelligence orchestration.

Workflows that think, not just execute.