Hybrid Analytics

DuckDB-Powered Analytics for Real-Time Insights (Team+)


Overview

Archivus implements a hybrid analytics architecture that combines PostgreSQL (OLTP) with DuckDB (OLAP) to deliver 10-100x faster analytical queries without impacting operational workloads.


Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Archivus Data Layer                          │
│                                                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                  Operational Data (OLTP)                   │  │
│  │                      PostgreSQL                            │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐         │  │
│  │  │Documents│ │  Chats  │ │ Users   │ │ Folders │         │  │
│  │  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘         │  │
│  └───────┼───────────┼───────────┼───────────┼───────────────┘  │
│          │           │           │           │                   │
│          └───────────┴─────┬─────┴───────────┘                   │
│                            │                                      │
│                     ┌──────▼──────┐                              │
│                     │   ETL Sync  │  (Background workers)        │
│                     │  (Periodic) │                              │
│                     └──────┬──────┘                              │
│                            │                                      │
│  ┌─────────────────────────▼─────────────────────────────────┐  │
│  │                  Analytical Data (OLAP)                    │  │
│  │                   DuckDB + S3 Parquet                      │  │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐         │  │
│  │  │ document_   │ │   chat_     │ │   usage_    │         │  │
│  │  │  analytics  │ │  analytics  │ │  analytics  │         │  │
│  │  └─────────────┘ └─────────────┘ └─────────────┘         │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Performance Comparison

Query Examples

Query Type PostgreSQL DuckDB Improvement
Daily document counts (30 days) 450ms 12ms 37x faster
Monthly active users 890ms 25ms 35x faster
Storage growth trend 1.2s 45ms 27x faster
Top tags analysis 2.1s 78ms 27x faster
AI credit usage breakdown 3.4s 120ms 28x faster
Cross-tenant aggregation* 8.5s 340ms 25x faster

*Super admin only

Why DuckDB?

Feature PostgreSQL DuckDB
Primary use Transactions (OLTP) Analytics (OLAP)
Storage Row-based Column-based
Aggregations Moderate Excellent
Concurrent writes Excellent Limited
Query latency Low for CRUD Low for analytics

Data Model

Parquet Files on S3

s3://archivus-analytics/
├── tenant_{id}/
│   ├── documents/
│   │   ├── 2026-01-18.parquet
│   │   ├── 2026-01-17.parquet
│   │   └── ...
│   ├── chats/
│   │   └── ...
│   ├── usage/
│   │   └── ...
│   └── ai_traffic/
│       └── ...

Schema Example: Document Analytics

-- DuckDB table backed by Parquet
CREATE TABLE document_analytics AS
SELECT * FROM read_parquet('s3://archivus-analytics/tenant_*/documents/*.parquet');

-- Schema
document_id     UUID
tenant_id       UUID
created_at      TIMESTAMP
file_type       VARCHAR
file_size_bytes BIGINT
page_count      INTEGER
has_ai_analysis BOOLEAN
embedding_count INTEGER
tag_count       INTEGER
view_count      INTEGER
chat_count      INTEGER

Available Analytics

Document Analytics

GET /api/v1/analytics/documents?period=30d

Response:
{
  "period": "30d",
  "summary": {
    "total_documents": 1523,
    "total_size_gb": 12.4,
    "documents_added": 234,
    "documents_analyzed": 189
  },
  "by_type": [
    {"type": "pdf", "count": 890, "size_gb": 8.2},
    {"type": "docx", "count": 412, "size_gb": 2.8},
    {"type": "xlsx", "count": 221, "size_gb": 1.4}
  ],
  "daily_trend": [
    {"date": "2026-01-18", "added": 12, "deleted": 2},
    {"date": "2026-01-17", "added": 8, "deleted": 0}
  ]
}

AI Usage Analytics

GET /api/v1/analytics/ai-usage?period=30d

Response:
{
  "period": "30d",
  "summary": {
    "total_credits_used": 4520,
    "total_requests": 2890,
    "avg_credits_per_day": 150.7
  },
  "by_operation": [
    {"operation": "chat", "credits": 2100, "requests": 1050},
    {"operation": "summary", "credits": 890, "requests": 445},
    {"operation": "analysis", "credits": 1200, "requests": 400},
    {"operation": "search", "credits": 330, "requests": 660}
  ],
  "by_user": [
    {"user_id": "user_123", "credits": 1200, "requests": 450},
    {"user_id": "user_456", "credits": 980, "requests": 320}
  ]
}

Storage Analytics

GET /api/v1/analytics/storage?period=90d

Response:
{
  "period": "90d",
  "current": {
    "total_size_gb": 45.2,
    "quota_gb": 100,
    "utilization_percent": 45.2
  },
  "growth": {
    "last_30_days_gb": 8.5,
    "projected_30_days_gb": 9.2,
    "days_until_quota": 183
  },
  "by_folder": [
    {"folder": "/contracts", "size_gb": 12.4, "documents": 456},
    {"folder": "/invoices", "size_gb": 8.9, "documents": 2341}
  ]
}

Search Analytics

GET /api/v1/analytics/search?period=30d

Response:
{
  "period": "30d",
  "summary": {
    "total_searches": 3420,
    "avg_results": 8.5,
    "avg_latency_ms": 340
  },
  "popular_queries": [
    {"query": "contract terms", "count": 89},
    {"query": "invoice 2025", "count": 67}
  ],
  "search_success_rate": 0.92,
  "zero_result_queries": 274
}

ETL Process

Sync Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       ETL Pipeline                               │
│                                                                   │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐       │
│  │  PostgreSQL │────▶│   Worker    │────▶│  S3 Parquet │       │
│  │   (Source)  │     │  (Extract)  │     │   (Target)  │       │
│  └─────────────┘     └──────┬──────┘     └─────────────┘       │
│                             │                                    │
│                      ┌──────▼──────┐                            │
│                      │   DuckDB    │                            │
│                      │  (Query)    │                            │
│                      └─────────────┘                            │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Sync Schedule

Data Type Sync Frequency Latency
Documents Every 5 minutes ~5 min
Chat sessions Every 15 minutes ~15 min
AI usage Hourly ~1 hour
Storage metrics Daily ~24 hours

Incremental Updates

-- Only sync changed records
SELECT * FROM documents
WHERE updated_at > :last_sync_time
  AND tenant_id = :tenant_id;

Query API

Custom Analytics Queries

POST /api/v1/analytics/query
Content-Type: application/json

{
  "query": "document_growth",
  "parameters": {
    "start_date": "2026-01-01",
    "end_date": "2026-01-18",
    "group_by": "week"
  }
}

Response:
{
  "query": "document_growth",
  "results": [
    {"week": "2026-W01", "documents_added": 45, "total": 1450},
    {"week": "2026-W02", "documents_added": 52, "total": 1502},
    {"week": "2026-W03", "documents_added": 21, "total": 1523}
  ],
  "execution_time_ms": 23
}

Available Query Types

Query Type Description
document_growth Document count over time
storage_trend Storage usage trend
ai_usage AI credit consumption
user_activity User engagement metrics
search_patterns Search behavior analysis
tag_distribution Tag usage analysis

Cost Efficiency

Pricing Model

Component Cost
S3 Storage ~$0.023/GB/month
DuckDB Compute Ephemeral (no always-on)
ETL Workers Shared infrastructure

vs Traditional Data Warehouse

Aspect Traditional DW Archivus Hybrid
Fixed costs $500-5000/month $0
Storage $0.10-0.50/GB $0.023/GB
Query pricing Per-query or reserved Included
Setup complexity High Built-in

Tenant Isolation

Analytics data maintains tenant isolation:

-- All analytics queries filtered by tenant
SELECT * FROM document_analytics
WHERE tenant_id = current_setting('app.tenant_id');

-- Parquet files organized by tenant
s3://archivus-analytics/tenant_{tenant_id}/...

API Reference

Analytics Endpoints

Endpoint Method Description
/analytics/documents GET Document analytics
/analytics/ai-usage GET AI credit usage
/analytics/storage GET Storage metrics
/analytics/search GET Search analytics
/analytics/users GET User activity
/analytics/query POST Custom queries
/analytics/export POST Export analytics data

Tier Availability

Feature Pro Team Enterprise
Basic analytics Dashboard only Full API Full API
Custom queries - Limited Unlimited
Data retention 30 days 90 days Custom
Export - CSV CSV, Parquet
Real-time sync - 15 min 5 min

Best Practices

Query Optimization

  1. Use date ranges - Limit query scope
  2. Specify aggregation level - Daily, weekly, monthly
  3. Filter early - Use query parameters

Data Management

  1. Monitor sync status - Ensure ETL healthy
  2. Review data freshness - Understand latency
  3. Archive old data - Manage storage costs


Ready to analyze? Explore Analytics Dashboard