Skip to content

Evolutionary Verification

Most AI systems treat confidence as a single number from a single model. Archivus uses evolutionary populations of verification agents that compete to be right.

The Problem

Enterprise AI has a trust problem: - LLMs sound confident even when wrong - No mechanism to verify AI decisions - Black box reasoning—can't explain "why" - No learning from mistakes

The result: 85% of enterprise AI projects fail (Gartner). Root cause: lack of trust.

The Solution: GOLAG

GOLAG = Game-Oriented Lagrangian Agent Governance

A population of specialized agents that: - Have finite confidence budgets (not unlimited optimism) - Vote on contested decisions using quadratic voting (cost = votes²) - Evolve over time based on accuracy - Know what they don't know (epistemic humility)

Core Concept

Finite Confidence Budgets

Every agent starts with a confidence budget (default: 100 credits).

Quadratic Voting Cost:
1 vote  → costs 1 credit
2 votes → costs 4 credits
3 votes → costs 9 credits
10 votes → costs 100 credits (entire budget)

The mechanism: Overconfident agents exhaust their budgets quickly. Well-calibrated agents accumulate influence over time.

The system learns epistemic honesty.

The Lagrangian

Every decision is optimized via:

L = (Confidence × ContextMatch) / Risk

Components: - Confidence: Agent's assessed certainty (calibrated via Expected Calibration Error) - ContextMatch: How well the situation matches learned patterns - Risk: Potential impact of incorrect decision

Low Lagrangian → Escalate to human High Lagrangian → Act autonomously

Decision Domains

GOLAG governs decisions across 13 domains:

Domain What It Decides Example
entity_dedup Are these two entities the same? "John Smith" vs "J. Smith"
claim_verification Is this claim supported by evidence? "Revenue grew 20%"
contradiction_resolution Which conflicting claim is correct? Old vs new data
field_mapping How should this source field map to KG? "total" → total_amount
qv_voting Multi-agent consensus on contested claims High-stakes decisions
document_classification What type of document is this? Invoice vs receipt
sensitivity_detection Does this contain sensitive information? PII scanning
source_authority How trustworthy is this source? Wikipedia vs unverified email
relationship_inference What relationships exist between entities? Employment, authorship
hallucination_detection Is this LLM output grounded? Fact-checking
federation_trust Should we trust this federated claim? Cross-org verification

Each domain has a population of agents competing to be the best.

Evolutionary Dynamics

Replicator Dynamics

Agents improve over time:

AC(generation+1) = AC(generation) + α(AC_best - AC_average)

What this means: - Agents with high accuracy survive and propagate their strategies - Agents with low accuracy die and are replaced - Wisdom patterns transfer to next generation - System gets better without human intervention

Expert Agents

Agents that achieve: - 95%+ accuracy over 20+ decisions - Become experts (+50 budget bonus) - Gain influence in contested decisions - Contribute to "wisdom library"

Dying Agents

When an agent's budget drops below 5: - Transfers wisdom patterns to successors - Records failure modes for future agents - Gracefully retires from the population

Multi-Agent Adjudication

For contested claims, multiple agents vote:

graph TD
    CLAIM[Contested Claim]

    CLAIM --> V1[Verifier Agent]
    CLAIM --> T1[Temporal Agent]
    CLAIM --> R1[Reasoner Agent]
    CLAIM --> S1[Sentinel Agent]

    V1 --> VOTE[Quadratic Voting]
    T1 --> VOTE
    R1 --> VOTE
    S1 --> VOTE

    VOTE --> DECISION[Consensus Decision]

The process: 1. Each agent evaluates the claim 2. Agents allocate votes from their budgets 3. Quadratic cost enforces honesty 4. Consensus emerges from weighted votes 5. Outcome recorded for future learning

Calibration Scoring

Agents are evaluated using:

Expected Calibration Error (ECE): - Measures how well confidence matches actual accuracy - Agent says "90% confident" → should be right 90% of the time - Poor calibration → budget penalties

Brier Score: - Measures prediction accuracy - Lower is better (0 = perfect prediction)

Weighted Accuracy: - Recent decisions weighted more heavily - Accounts for changing environments

Wisdom Accumulation

Agents don't just learn from success—they accumulate wisdom patterns:

Pattern Type Example Usage
exact_match "This exact situation before" High confidence
semantic_similar "Similar context" Moderate confidence
heuristic "This rule usually works" Pattern matching
failure_avoidance "This failed before" Negative learning
success_heuristic "This worked well" Positive reinforcement
domain_rule "Domain-specific logic" Expert knowledge

Swarm Coordination

For complex tasks, agents form swarms:

Task: Verify company merger claim

Swarm Members:
├─ Document Analyst Agent (searches for evidence)
├─ Temporal Reasoning Agent (checks timeline consistency)
├─ Entity Dedup Agent (verifies company identities)
└─ Source Authority Agent (evaluates document credibility)

Swarm Budget: Pooled from members
Coordination: Delegated budget allocation
Outcome: Consensus verdict with provenance chain

What Makes This Different

Single-Model AI GOLAG Multi-Agent
One confidence score Population consensus
Unlimited confidence Finite budgets force honesty
No learning from mistakes Evolutionary improvement
Black box Transparent voting records
No calibration ECE-calibrated confidence
Same accuracy forever Gets smarter over time

Real-World Example

Scenario: Contradictory revenue claims from two documents

Old Report: "Q3 revenue: $1.2M"
New Report: "Q3 revenue: $1.5M"

Agent Population Votes:

Verifier Agent #1:
- Confidence: 0.85
- Votes allocated: 3 (cost: 9 credits)
- Reasoning: "New report more detailed, higher source authority"
- Decision: Trust new report

Temporal Agent #2:
- Confidence: 0.92
- Votes allocated: 4 (cost: 16 credits)
- Reasoning: "New report dated later, supersedes old data"
- Decision: Trust new report

Sentinel Agent #3:
- Confidence: 0.60
- Votes allocated: 1 (cost: 1 credit)
- Reasoning: "Significant difference, low confidence, escalate"
- Decision: Request human review

Weighted Consensus: Trust new report (7 total votes weighted by confidence)
Escalation: Yes (one agent flagged for human review due to magnitude)

Enterprise Benefits

1. Auditable Decisions

Every agent decision is logged with: - Vote allocation - Reasoning - Confidence score - Outcome verification

2. Continuous Improvement

  • Agents learn from every decision
  • Accuracy improves without retraining
  • New patterns incorporated automatically

3. Calibrated Confidence

  • No false certainty
  • "Don't know" is a valid answer
  • Escalation to humans when appropriate

4. Transparent Reasoning

  • See which agents voted
  • Understand why they decided
  • Trace decision chains

5. Domain Expertise

  • Expert agents emerge naturally
  • High-stakes decisions use best performers
  • Knowledge accumulates over time

Tier Gating

Tier GOLAG Domains Features
Free/Starter None No agent-based verification
Pro 3 basic Document classification, sensitivity detection, field mapping
Team 7 standard + Entity dedup, claim verification, contradiction resolution, source authority
Enterprise All 13 + QV voting, relationship inference, hallucination detection, federation trust, orchestration

The Result

When Archivus makes a decision, you don't just get an answer—you get: - Confidence score (calibrated via evolutionary learning) - Agent consensus (multiple specialized agents agreed) - Provenance chain (here's the evidence) - Contradiction warnings (here's what disagrees) - Escalation triggers (when humans should review)

Not "trust us"—verify yourself.


Verification through evolution, not declaration.