Knowledge Graph Architecture¶

The Knowledge Graph is the substrate layer that transforms unstructured data into queryable, verifiable intelligence.

The Problem¶

LLMs alone have no memory. They hallucinate connections. They sound confident but don't know anything.

Enterprise needs answers it can trust and prove.

The Solution: Contextual Knowledge Graph¶

A tenant-scoped graph where every piece of intelligence—extracted from documents, voice, connectors, or research—becomes a verifiable node with rich context.

graph LR
    subgraph Sources["Intelligence Sources"]
        DOC[Documents]
        VOICE[Voice Conversations]
        CONN[Data Connectors]
        WEB[Web Research]
    end

    subgraph Graph["Knowledge Graph"]
        ENT[Entities]
        REL[Relationships]
        CLAIMS[Claims]
        PROV[Provenance]
    end

    subgraph Output["Grounded AI"]
        QUERY[Query Graph]
        VERIFY[Verify Facts]
        RESPOND[Generate Response]
    end

    DOC --> Graph
    VOICE --> Graph
    CONN --> Graph
    WEB --> Graph

    Graph --> QUERY
    QUERY --> VERIFY
    VERIFY --> RESPOND

The Quadruple Model¶

Traditional knowledge graphs use triples:

(Subject, Predicate, Object)
(Obama, president_of, USA)

Archivus uses quadruples with rich context:

(Entity, Relationship, Entity, CONTEXT)

Context = {
    temporal:    { from: "2009-01-20", until: "2017-01-20" }
    geographic:  { country: "USA" }
    provenance:  { source: "inauguration.pdf", confidence: 0.99 }
    supporting:  ["Barack Obama was inaugurated as the 44th President..."]
}

Why this matters: "Who is the president?" has different answers at different times. Without temporal context, the graph cannot reason about change.

Core Components¶

1. Entities¶

People, organizations, concepts, products, events—anything with identity.

Properties: - Name and aliases - Entity type (person, organization, location, concept, etc.) - Description (long-form context for LLM injection) - External identifiers (Wikidata QID, Wikipedia URL) - Confidence score - Source provenance

Key insight: Entity descriptions improve LLM accuracy by 11-25% (arXiv:2406.11160v3). The LLM performs better when it knows about the entity, not just its name.

2. Relationships¶

Connections between entities with temporal and geographic validity.

Types: - employs, employed_by - authored, published_by - located_at, part_of - mentions, references - supports, contradicts (for claims) - supersedes (for versioning)

Context fields: - Valid from / valid until (temporal bounds) - Geographic scope - Confidence score - Evidence sentences - Source provenance

3. Claims¶

Factual statements extracted from sources.

Properties: - Claim text (the assertion) - Source type (document, web, connector, research) - Confidence (0.0 - 1.0) - Validation status (unverified, verified, disputed) - About entities (linked) - Supporting evidence - Content hash (for verification)

Claim Network: Claims can support or contradict other claims. This enables epistemic reasoning—reasoning about knowledge itself.

4. Provenance Tracking¶

Every node tracks: - Source type (primary/secondary/tertiary) - Source ID (document, connector, research finding) - Extraction method (AI, schema mapping, manual) - First seen / last updated timestamps - Confidence scoring

This is what makes "Show me the sources" answerable.

Intelligence Flow¶

From Documents¶

1. User uploads PDF
   ↓
2. Text extraction (OCR if needed)
   ↓
3. AI entity extraction (Claude)
   → People, organizations, dates, amounts
   ↓
4. Relationship extraction
   → "John Smith works at Acme Corp"
   ↓
5. Claim extraction with evidence
   → "Revenue increased 20%" + [supporting sentence]
   ↓
6. Knowledge Graph insertion
   → Entities deduplicated, relations created, claims stored

From Voice¶

1. Real-time transcription (Deepgram)
   ↓
2. AI extraction during conversation
   → Extract entities, claims, relationships
   ↓
3. Knowledge Graph update
   → All intelligence captured and linked

From Data Connectors¶

1. Connector syncs structured data (e.g., Google Reviews, POS)
   ↓
2. Schema mapping to entities
   → Direct entity creation (no LLM needed)
   ↓
3. Knowledge Graph insertion
   → Linked to existing entities

Query-Time: CGR3 Pipeline¶

CGR3 = Context Graph Reasoning (Retrieve → Rank → Reason)

Stage 1: Retrieve¶

Extract entities from user's question
Semantic search using embeddings
Text search on names and descriptions
Retrieve connected claims

Stage 2: Rank¶

Weight signals: - Similarity: How relevant to the query? - Confidence: How certain is this claim? - Recency: When was this last updated? - Authority: What's the source credibility? - Corroboration: How many sources agree?

Stage 3: Reason¶

Build structured context for LLM:

VERIFIED FACTS (confidence >= 0.8):
- Claim A (source: contract.pdf, 98% confidence)
- Claim B (source: Wikidata, 95% confidence)

CONFLICTING INFORMATION:
- Claim C says X (source: old_report.pdf, 75% confidence)
- Claim D says Y (source: new_report.pdf, 92% confidence)
[TEMPORAL DIFFERENCE]: D supersedes C (newer date)

UNCERTAIN (low confidence):
- Claim E (source: unverified_email.pdf, 45% confidence)

LLM receives structured truth instead of raw documents.

Contradiction Detection¶

The graph symbolically detects conflicts before the LLM generates a response.

Types: - Temporal: Same claim, different dates - Direct: Claim A contradicts Claim B - Semantic: Similar claims with different values

Resolution: - Surface all conflicting claims to the LLM - Label contradiction type - Provide resolution hints from AI analysis - Let LLM explain the conflict in fluent language

Entity Enrichment¶

External knowledge providers contribute to the graph without replacing existing data.

Sources: - Wikidata: Structured properties, canonical QIDs - Wikipedia: Prose context, descriptions, categories - Industry databases: Domain-specific enrichment

Source Merging: - Duplicate claims are merged, not rejected - Sources accumulate with corroboration bonus (+5% per source) - Temporal conflicts tracked when sources disagree - Everything maintains provenance

Source Authority¶

Tenant-configurable trust multipliers:

Source Type	Default Authority
Internal Documents	0.85
Wikidata	0.80
Wikipedia	0.75
Data Connectors	0.70
Web Research	0.60
User Input	0.50

Tenants can override these to match their trust models.

Temporal Reasoning¶

The graph supports queries like: - "Who was the CEO in Q3 2024?" - "What contracts were active on January 1, 2025?" - "Show me claims that changed between March and June"

All relationships and claims track valid_from and valid_until timestamps.

Federation-Ready Design¶

All entities and claims are designed to flow across organizational boundaries:

What federates: - Entity references (canonical IDs) - Verified claims (with full provenance) - Relationship signals (anonymized if needed) - Trust scores

What stays home: - Documents (source material) - Raw data (never leaves) - PII (protected)

Security & Isolation¶

Every Knowledge Graph table enforces: - Row-Level Security (RLS) with tenant_id - Service role bypass for background enrichment - Audit logging on all mutations

A tenant can never see another tenant's graph.

Performance¶

100% embedding coverage for semantic search
86.7% entity enrichment rate from external sources
Indexes: 41+ specialized indexes for graph traversal
Caching: Redis caching for frequently accessed entities

What This Enables¶

Without Knowledge Graph	With Knowledge Graph
"AI said X"	"AI said X, here's the source"
No temporal context	"This was true in Q3 2024"
Hallucinated connections	"No evidence for this relationship"
Black box reasoning	"Here's the chain of inference"
Session-based memory	Persistent organizational knowledge
No contradiction detection	"Sources A and B disagree"

The Result¶

Every question Archie answers is grounded in verified facts from your organization's Knowledge Graph.

Not "here's what I think"—here's what I know, and why I know it.

The foundation of verifiable intelligence.