Hybrid Analytics

DuckDB-Powered Analytics for Real-Time Insights (Team+)

Overview

Archivus implements a hybrid analytics architecture that combines PostgreSQL (OLTP) with DuckDB (OLAP) to deliver 10-100x faster analytical queries without impacting operational workloads.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Archivus Data Layer                          │
│                                                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                  Operational Data (OLTP)                   │  │
│  │                      PostgreSQL                            │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐         │  │
│  │  │Documents│ │  Chats  │ │ Users   │ │ Folders │         │  │
│  │  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘         │  │
│  └───────┼───────────┼───────────┼───────────┼───────────────┘  │
│          │           │           │           │                   │
│          └───────────┴─────┬─────┴───────────┘                   │
│                            │                                      │
│                     ┌──────▼──────┐                              │
│                     │   ETL Sync  │  (Background workers)        │
│                     │  (Periodic) │                              │
│                     └──────┬──────┘                              │
│                            │                                      │
│  ┌─────────────────────────▼─────────────────────────────────┐  │
│  │                  Analytical Data (OLAP)                    │  │
│  │                   DuckDB + S3 Parquet                      │  │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐         │  │
│  │  │ document_   │ │   chat_     │ │   usage_    │         │  │
│  │  │  analytics  │ │  analytics  │ │  analytics  │         │  │
│  │  └─────────────┘ └─────────────┘ └─────────────┘         │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Performance Comparison

Query Examples

Query Type	PostgreSQL	DuckDB	Improvement
Daily document counts (30 days)	450ms	12ms	37x faster
Monthly active users	890ms	25ms	35x faster
Storage growth trend	1.2s	45ms	27x faster
Top tags analysis	2.1s	78ms	27x faster
AI credit usage breakdown	3.4s	120ms	28x faster
Cross-tenant aggregation*	8.5s	340ms	25x faster

*Super admin only

Why DuckDB?

Feature	PostgreSQL	DuckDB
Primary use	Transactions (OLTP)	Analytics (OLAP)
Storage	Row-based	Column-based
Aggregations	Moderate	Excellent
Concurrent writes	Excellent	Limited
Query latency	Low for CRUD	Low for analytics

Data Model

Parquet Files on S3

s3://archivus-analytics/
├── tenant_{id}/
│   ├── documents/
│   │   ├── 2026-01-18.parquet
│   │   ├── 2026-01-17.parquet
│   │   └── ...
│   ├── chats/
│   │   └── ...
│   ├── usage/
│   │   └── ...
│   └── ai_traffic/
│       └── ...

Schema Example: Document Analytics

-- DuckDB table backed by Parquet
CREATE TABLE document_analytics AS
SELECT * FROM read_parquet('s3://archivus-analytics/tenant_*/documents/*.parquet');

-- Schema
document_id     UUID
tenant_id       UUID
created_at      TIMESTAMP
file_type       VARCHAR
file_size_bytes BIGINT
page_count      INTEGER
has_ai_analysis BOOLEAN
embedding_count INTEGER
tag_count       INTEGER
view_count      INTEGER
chat_count      INTEGER

Available Analytics

Document Analytics

GET /api/v1/analytics/documents?period=30d

Response:
{
  "period": "30d",
  "summary": {
    "total_documents": 1523,
    "total_size_gb": 12.4,
    "documents_added": 234,
    "documents_analyzed": 189
  },
  "by_type": [
    {"type": "pdf", "count": 890, "size_gb": 8.2},
    {"type": "docx", "count": 412, "size_gb": 2.8},
    {"type": "xlsx", "count": 221, "size_gb": 1.4}
  ],
  "daily_trend": [
    {"date": "2026-01-18", "added": 12, "deleted": 2},
    {"date": "2026-01-17", "added": 8, "deleted": 0}
  ]
}

AI Usage Analytics

GET /api/v1/analytics/ai-usage?period=30d

Response:
{
  "period": "30d",
  "summary": {
    "total_credits_used": 4520,
    "total_requests": 2890,
    "avg_credits_per_day": 150.7
  },
  "by_operation": [
    {"operation": "chat", "credits": 2100, "requests": 1050},
    {"operation": "summary", "credits": 890, "requests": 445},
    {"operation": "analysis", "credits": 1200, "requests": 400},
    {"operation": "search", "credits": 330, "requests": 660}
  ],
  "by_user": [
    {"user_id": "user_123", "credits": 1200, "requests": 450},
    {"user_id": "user_456", "credits": 980, "requests": 320}
  ]
}

Storage Analytics

GET /api/v1/analytics/storage?period=90d

Response:
{
  "period": "90d",
  "current": {
    "total_size_gb": 45.2,
    "quota_gb": 100,
    "utilization_percent": 45.2
  },
  "growth": {
    "last_30_days_gb": 8.5,
    "projected_30_days_gb": 9.2,
    "days_until_quota": 183
  },
  "by_folder": [
    {"folder": "/contracts", "size_gb": 12.4, "documents": 456},
    {"folder": "/invoices", "size_gb": 8.9, "documents": 2341}
  ]
}

Search Analytics

GET /api/v1/analytics/search?period=30d

Response:
{
  "period": "30d",
  "summary": {
    "total_searches": 3420,
    "avg_results": 8.5,
    "avg_latency_ms": 340
  },
  "popular_queries": [
    {"query": "contract terms", "count": 89},
    {"query": "invoice 2025", "count": 67}
  ],
  "search_success_rate": 0.92,
  "zero_result_queries": 274
}

ETL Process

Sync Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       ETL Pipeline                               │
│                                                                   │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐       │
│  │  PostgreSQL │────▶│   Worker    │────▶│  S3 Parquet │       │
│  │   (Source)  │     │  (Extract)  │     │   (Target)  │       │
│  └─────────────┘     └──────┬──────┘     └─────────────┘       │
│                             │                                    │
│                      ┌──────▼──────┐                            │
│                      │   DuckDB    │                            │
│                      │  (Query)    │                            │
│                      └─────────────┘                            │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Sync Schedule

Data Type	Sync Frequency	Latency
Documents	Every 5 minutes	~5 min
Chat sessions	Every 15 minutes	~15 min
AI usage	Hourly	~1 hour
Storage metrics	Daily	~24 hours

Incremental Updates

-- Only sync changed records
SELECT * FROM documents
WHERE updated_at > :last_sync_time
  AND tenant_id = :tenant_id;

Query API

Custom Analytics Queries

POST /api/v1/analytics/query
Content-Type: application/json

{
  "query": "document_growth",
  "parameters": {
    "start_date": "2026-01-01",
    "end_date": "2026-01-18",
    "group_by": "week"
  }
}

Response:
{
  "query": "document_growth",
  "results": [
    {"week": "2026-W01", "documents_added": 45, "total": 1450},
    {"week": "2026-W02", "documents_added": 52, "total": 1502},
    {"week": "2026-W03", "documents_added": 21, "total": 1523}
  ],
  "execution_time_ms": 23
}

Available Query Types

Query Type	Description
`document_growth`	Document count over time
`storage_trend`	Storage usage trend
`ai_usage`	AI credit consumption
`user_activity`	User engagement metrics
`search_patterns`	Search behavior analysis
`tag_distribution`	Tag usage analysis

Cost Efficiency

Pricing Model

Component	Cost
S3 Storage	~$0.023/GB/month
DuckDB Compute	Ephemeral (no always-on)
ETL Workers	Shared infrastructure

vs Traditional Data Warehouse

Aspect	Traditional DW	Archivus Hybrid
Fixed costs	$500-5000/month	$0
Storage	$0.10-0.50/GB	$0.023/GB
Query pricing	Per-query or reserved	Included
Setup complexity	High	Built-in

Tenant Isolation

Analytics data maintains tenant isolation:

-- All analytics queries filtered by tenant
SELECT * FROM document_analytics
WHERE tenant_id = current_setting('app.tenant_id');

-- Parquet files organized by tenant
s3://archivus-analytics/tenant_{tenant_id}/...

API Reference

Analytics Endpoints

Endpoint	Method	Description
`/analytics/documents`	GET	Document analytics
`/analytics/ai-usage`	GET	AI credit usage
`/analytics/storage`	GET	Storage metrics
`/analytics/search`	GET	Search analytics
`/analytics/users`	GET	User activity
`/analytics/query`	POST	Custom queries
`/analytics/export`	POST	Export analytics data

Tier Availability

Feature	Pro	Team	Enterprise
Basic analytics	Dashboard only	Full API	Full API
Custom queries	-	Limited	Unlimited
Data retention	30 days	90 days	Custom
Export	-	CSV	CSV, Parquet
Real-time sync	-	15 min	5 min

Best Practices

Query Optimization

Use date ranges - Limit query scope
Specify aggregation level - Daily, weekly, monthly
Filter early - Use query parameters

Data Management

Monitor sync status - Ensure ETL healthy
Review data freshness - Understand latency
Archive old data - Manage storage costs

Ready to analyze? Explore Analytics Dashboard