BYOB AI¶
Bring Your Own AI allows enterprises to use their own language models and AI infrastructure with Archivus, maintaining full control over AI processing.
Why BYOB AI?¶
Data Privacy¶
Sensitive documents never leave your infrastructure for AI processing. Process everything locally with no external API calls.
Model Control¶
Use models fine-tuned for your industry, terminology, or compliance requirements.
Cost Optimization¶
High-volume users can reduce costs by running models on their own infrastructure.
Air-Gapped Compliance¶
Required for defense, government, and highly regulated industries where external AI services are prohibited.
Supported AI Backends¶
Local LLM Providers¶
| Provider | Models | GPU Required | Notes |
|---|---|---|---|
| Ollama | Llama, Mistral, Qwen, Gemma | Recommended | Easy setup, production-ready |
| vLLM | Most HuggingFace models | Yes | High-throughput serving |
| TGI | Most HuggingFace models | Yes | Hugging Face's inference server |
| LocalAI | Various | Optional | CPU-friendly option |
Cloud AI Providers¶
| Provider | Models | Use Case |
|---|---|---|
| OpenAI | GPT-4, GPT-4 Turbo | Your API key, your billing |
| Anthropic | Claude 3.5, Claude 3 | Your API key, your billing |
| Azure OpenAI | GPT models | Enterprise Azure deployment |
| Google Vertex AI | Gemini, PaLM | GCP-hosted models |
| AWS Bedrock | Various | AWS-managed AI |
Custom Models¶
Deploy your own fine-tuned models:
- Custom document classification models
- Industry-specific entity extraction
- Domain terminology embeddings
- Compliance-trained summarization
Architecture¶
graph LR
subgraph "Your Infrastructure"
M[Your LLM Server]
E[Your Embedding Server]
end
subgraph "Archivus"
A[Document Processor]
B[AI Router]
C[Knowledge Graph]
end
A --> B
B --> M
B --> E
M --> C
E --> C Processing Flow¶
- Document Upload - New document enters the system
- AI Router - Archivus routes AI requests to your configured backend
- Your Infrastructure - Processing happens on your models
- Results Return - AI outputs returned to Archivus for storage
What Your AI Handles¶
| Task | Description |
|---|---|
| Text Extraction | OCR and content extraction |
| Classification | Document type identification |
| Summarization | Executive summaries |
| Entity Extraction | People, organizations, dates, amounts |
| Embeddings | Semantic search vectors |
| Q&A | Natural language document queries |
Configuration¶
Ollama (Recommended for On-Premises)¶
ai:
provider: ollama
endpoint: http://localhost:11434
models:
chat: llama3.2
embedding: nomic-embed-text
timeout: 300s
max_tokens: 4096
OpenAI (Your Key)¶
ai:
provider: openai
api_key: ${OPENAI_API_KEY} # Your organization's key
models:
chat: gpt-4-turbo
embedding: text-embedding-3-large
organization: org-xxxxx # Optional
Azure OpenAI¶
ai:
provider: azure-openai
endpoint: https://your-instance.openai.azure.com
api_key: ${AZURE_OPENAI_KEY}
api_version: "2024-02-01"
deployments:
chat: gpt-4-deployment
embedding: embedding-deployment
Custom Endpoint¶
For any OpenAI-compatible API:
ai:
provider: openai-compatible
endpoint: https://your-llm-server.internal/v1
api_key: ${YOUR_API_KEY}
models:
chat: your-model-name
embedding: your-embedding-model
Recommended Models¶
For Document Intelligence¶
| Task | Recommended | Alternative |
|---|---|---|
| General Chat | Llama 3.2 (8B) | Mistral 7B |
| Long Documents | Qwen2.5 (32K context) | Llama 3.2 |
| Embeddings | nomic-embed-text | mxbai-embed-large |
| Fast Classification | Gemma 2B | Phi-3-mini |
Hardware Requirements¶
Suitable for lower volume or non-time-sensitive processing:
- 16+ CPU cores
- 32+ GB RAM
- Models: Gemma 2B, Phi-3-mini
Good balance of performance and cost:
- NVIDIA RTX 4090 (24GB) or better
- Supports 7-8B parameter models
- Models: Llama 3.2, Mistral 7B
For high throughput or larger models:
- 2-8x NVIDIA A100 or H100
- Supports 70B+ parameter models
- Enables batch processing
Performance Tuning¶
Ollama Optimization¶
# Environment variables for Ollama
OLLAMA_NUM_PARALLEL=4 # Concurrent requests
OLLAMA_MAX_LOADED_MODELS=2 # Models kept in memory
OLLAMA_KEEP_ALIVE=24h # Keep models loaded
OLLAMA_GPU_LAYERS=33 # GPU offloading (adjust for VRAM)
Batch Processing¶
For high-volume document processing:
- Enable request batching for embeddings
- Use async processing for non-blocking operations
- Configure queue priorities for different document types
Caching¶
AI responses are cached to reduce redundant processing:
- Identical queries return cached results
- Embeddings cached by content hash
- Classification results cached per document
Fallback Configuration¶
Configure fallback providers for resilience:
ai:
primary:
provider: ollama
endpoint: http://primary-gpu:11434
fallback:
provider: openai
api_key: ${OPENAI_API_KEY}
fallback_on:
- timeout
- server_error
- rate_limit
When the primary provider fails, Archivus automatically falls back to the secondary.
Security Considerations¶
Network Isolation¶
- Run AI servers on private networks
- Use internal DNS for service discovery
- No internet access required for local LLMs
API Key Security¶
- Store keys in secret management (Vault, AWS Secrets Manager)
- Rotate keys periodically
- Audit API key usage
Model Security¶
- Verify model checksums before deployment
- Use official model sources
- Monitor for model drift or tampering
Cost Comparison¶
Example: 10,000 Documents/Month¶
| Approach | Cost | Notes |
|---|---|---|
| Archivus AI | ~$200/month | Platform credits, fully managed |
| Your OpenAI Key | ~$150/month | Your billing, OpenAI rates |
| Local Ollama | ~$50/month | Electricity + amortized hardware |
Break-even for local LLM hardware typically occurs at 20,000-50,000 documents/month depending on infrastructure costs.
Getting Started¶
1. Choose Your Backend¶
- Quick Start: Ollama with Llama 3.2
- Enterprise Cloud: Azure OpenAI or AWS Bedrock
- Maximum Control: Self-hosted vLLM
2. Deploy and Test¶
# Ollama quick start
ollama pull llama3.2
ollama pull nomic-embed-text
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Hello"}'
3. Configure Archivus¶
Update your Archivus configuration with AI backend settings.
4. Validate¶
Run test documents through the pipeline to verify AI processing.
Next Steps¶
- BYOB Storage - Bring your own storage
- Deployment Options - Deployment models
- Compliance - Security certifications