Skip to content

BYOB AI

Bring Your Own AI allows enterprises to use their own language models and AI infrastructure with Archivus, maintaining full control over AI processing.


Why BYOB AI?

Data Privacy

Sensitive documents never leave your infrastructure for AI processing. Process everything locally with no external API calls.

Model Control

Use models fine-tuned for your industry, terminology, or compliance requirements.

Cost Optimization

High-volume users can reduce costs by running models on their own infrastructure.

Air-Gapped Compliance

Required for defense, government, and highly regulated industries where external AI services are prohibited.


Supported AI Backends

Local LLM Providers

Provider Models GPU Required Notes
Ollama Llama, Mistral, Qwen, Gemma Recommended Easy setup, production-ready
vLLM Most HuggingFace models Yes High-throughput serving
TGI Most HuggingFace models Yes Hugging Face's inference server
LocalAI Various Optional CPU-friendly option

Cloud AI Providers

Provider Models Use Case
OpenAI GPT-4, GPT-4 Turbo Your API key, your billing
Anthropic Claude 3.5, Claude 3 Your API key, your billing
Azure OpenAI GPT models Enterprise Azure deployment
Google Vertex AI Gemini, PaLM GCP-hosted models
AWS Bedrock Various AWS-managed AI

Custom Models

Deploy your own fine-tuned models:

  • Custom document classification models
  • Industry-specific entity extraction
  • Domain terminology embeddings
  • Compliance-trained summarization

Architecture

graph LR
    subgraph "Your Infrastructure"
        M[Your LLM Server]
        E[Your Embedding Server]
    end
    subgraph "Archivus"
        A[Document Processor]
        B[AI Router]
        C[Knowledge Graph]
    end

    A --> B
    B --> M
    B --> E
    M --> C
    E --> C

Processing Flow

  1. Document Upload - New document enters the system
  2. AI Router - Archivus routes AI requests to your configured backend
  3. Your Infrastructure - Processing happens on your models
  4. Results Return - AI outputs returned to Archivus for storage

What Your AI Handles

Task Description
Text Extraction OCR and content extraction
Classification Document type identification
Summarization Executive summaries
Entity Extraction People, organizations, dates, amounts
Embeddings Semantic search vectors
Q&A Natural language document queries

Configuration

ai:
  provider: ollama
  endpoint: http://localhost:11434
  models:
    chat: llama3.2
    embedding: nomic-embed-text
  timeout: 300s
  max_tokens: 4096

OpenAI (Your Key)

ai:
  provider: openai
  api_key: ${OPENAI_API_KEY}  # Your organization's key
  models:
    chat: gpt-4-turbo
    embedding: text-embedding-3-large
  organization: org-xxxxx  # Optional

Azure OpenAI

ai:
  provider: azure-openai
  endpoint: https://your-instance.openai.azure.com
  api_key: ${AZURE_OPENAI_KEY}
  api_version: "2024-02-01"
  deployments:
    chat: gpt-4-deployment
    embedding: embedding-deployment

Custom Endpoint

For any OpenAI-compatible API:

ai:
  provider: openai-compatible
  endpoint: https://your-llm-server.internal/v1
  api_key: ${YOUR_API_KEY}
  models:
    chat: your-model-name
    embedding: your-embedding-model

For Document Intelligence

Task Recommended Alternative
General Chat Llama 3.2 (8B) Mistral 7B
Long Documents Qwen2.5 (32K context) Llama 3.2
Embeddings nomic-embed-text mxbai-embed-large
Fast Classification Gemma 2B Phi-3-mini

Hardware Requirements

Suitable for lower volume or non-time-sensitive processing:

  • 16+ CPU cores
  • 32+ GB RAM
  • Models: Gemma 2B, Phi-3-mini

Good balance of performance and cost:

  • NVIDIA RTX 4090 (24GB) or better
  • Supports 7-8B parameter models
  • Models: Llama 3.2, Mistral 7B

For high throughput or larger models:

  • 2-8x NVIDIA A100 or H100
  • Supports 70B+ parameter models
  • Enables batch processing

Performance Tuning

Ollama Optimization

# Environment variables for Ollama
OLLAMA_NUM_PARALLEL=4      # Concurrent requests
OLLAMA_MAX_LOADED_MODELS=2 # Models kept in memory
OLLAMA_KEEP_ALIVE=24h      # Keep models loaded
OLLAMA_GPU_LAYERS=33       # GPU offloading (adjust for VRAM)

Batch Processing

For high-volume document processing:

  • Enable request batching for embeddings
  • Use async processing for non-blocking operations
  • Configure queue priorities for different document types

Caching

AI responses are cached to reduce redundant processing:

  • Identical queries return cached results
  • Embeddings cached by content hash
  • Classification results cached per document

Fallback Configuration

Configure fallback providers for resilience:

ai:
  primary:
    provider: ollama
    endpoint: http://primary-gpu:11434
  fallback:
    provider: openai
    api_key: ${OPENAI_API_KEY}
  fallback_on:
    - timeout
    - server_error
    - rate_limit

When the primary provider fails, Archivus automatically falls back to the secondary.


Security Considerations

Network Isolation

  • Run AI servers on private networks
  • Use internal DNS for service discovery
  • No internet access required for local LLMs

API Key Security

  • Store keys in secret management (Vault, AWS Secrets Manager)
  • Rotate keys periodically
  • Audit API key usage

Model Security

  • Verify model checksums before deployment
  • Use official model sources
  • Monitor for model drift or tampering

Cost Comparison

Example: 10,000 Documents/Month

Approach Cost Notes
Archivus AI ~$200/month Platform credits, fully managed
Your OpenAI Key ~$150/month Your billing, OpenAI rates
Local Ollama ~$50/month Electricity + amortized hardware

Break-even for local LLM hardware typically occurs at 20,000-50,000 documents/month depending on infrastructure costs.


Getting Started

1. Choose Your Backend

  • Quick Start: Ollama with Llama 3.2
  • Enterprise Cloud: Azure OpenAI or AWS Bedrock
  • Maximum Control: Self-hosted vLLM

2. Deploy and Test

# Ollama quick start
ollama pull llama3.2
ollama pull nomic-embed-text
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Hello"}'

3. Configure Archivus

Update your Archivus configuration with AI backend settings.

4. Validate

Run test documents through the pipeline to verify AI processing.


Next Steps