Evolutionary Verification¶

Most AI systems treat confidence as a single number from a single model. Archivus uses evolutionary populations of verification agents that compete to be right.

The Problem¶

Enterprise AI has a trust problem: - LLMs sound confident even when wrong - No mechanism to verify AI decisions - Black box reasoning—can't explain "why" - No learning from mistakes

The result: 85% of enterprise AI projects fail (Gartner). Root cause: lack of trust.

The Solution: GOLAG¶

GOLAG = Game-Oriented Lagrangian Agent Governance

A population of specialized agents that: - Have finite confidence budgets (not unlimited optimism) - Vote on contested decisions using quadratic voting (cost = votes²) - Evolve over time based on accuracy - Know what they don't know (epistemic humility)

Core Concept¶

Finite Confidence Budgets¶

Every agent starts with a confidence budget (default: 100 credits).

Quadratic Voting Cost:
1 vote  → costs 1 credit
2 votes → costs 4 credits
3 votes → costs 9 credits
10 votes → costs 100 credits (entire budget)

The mechanism: Overconfident agents exhaust their budgets quickly. Well-calibrated agents accumulate influence over time.

The system learns epistemic honesty.

The Lagrangian¶

Every decision is optimized via:

L = (Confidence × ContextMatch) / Risk

Components: - Confidence: Agent's assessed certainty (calibrated via Expected Calibration Error) - ContextMatch: How well the situation matches learned patterns - Risk: Potential impact of incorrect decision

Low Lagrangian → Escalate to human High Lagrangian → Act autonomously

Decision Domains¶

GOLAG governs decisions across 13 domains:

Domain	What It Decides	Example
`entity_dedup`	Are these two entities the same?	"John Smith" vs "J. Smith"
`claim_verification`	Is this claim supported by evidence?	"Revenue grew 20%"
`contradiction_resolution`	Which conflicting claim is correct?	Old vs new data
`field_mapping`	How should this source field map to KG?	"total" → total_amount
`qv_voting`	Multi-agent consensus on contested claims	High-stakes decisions
`document_classification`	What type of document is this?	Invoice vs receipt
`sensitivity_detection`	Does this contain sensitive information?	PII scanning
`source_authority`	How trustworthy is this source?	Wikipedia vs unverified email
`relationship_inference`	What relationships exist between entities?	Employment, authorship
`hallucination_detection`	Is this LLM output grounded?	Fact-checking
`federation_trust`	Should we trust this federated claim?	Cross-org verification

Each domain has a population of agents competing to be the best.

Evolutionary Dynamics¶

Replicator Dynamics¶

Agents improve over time:

AC(generation+1) = AC(generation) + α(AC_best - AC_average)

What this means: - Agents with high accuracy survive and propagate their strategies - Agents with low accuracy die and are replaced - Wisdom patterns transfer to next generation - System gets better without human intervention

Expert Agents¶

Agents that achieve: - 95%+ accuracy over 20+ decisions - Become experts (+50 budget bonus) - Gain influence in contested decisions - Contribute to "wisdom library"

Dying Agents¶

When an agent's budget drops below 5: - Transfers wisdom patterns to successors - Records failure modes for future agents - Gracefully retires from the population

Multi-Agent Adjudication¶

For contested claims, multiple agents vote:

graph TD
    CLAIM[Contested Claim]

    CLAIM --> V1[Verifier Agent]
    CLAIM --> T1[Temporal Agent]
    CLAIM --> R1[Reasoner Agent]
    CLAIM --> S1[Sentinel Agent]

    V1 --> VOTE[Quadratic Voting]
    T1 --> VOTE
    R1 --> VOTE
    S1 --> VOTE

    VOTE --> DECISION[Consensus Decision]

The process: 1. Each agent evaluates the claim 2. Agents allocate votes from their budgets 3. Quadratic cost enforces honesty 4. Consensus emerges from weighted votes 5. Outcome recorded for future learning

Calibration Scoring¶

Agents are evaluated using:

Expected Calibration Error (ECE): - Measures how well confidence matches actual accuracy - Agent says "90% confident" → should be right 90% of the time - Poor calibration → budget penalties

Brier Score: - Measures prediction accuracy - Lower is better (0 = perfect prediction)

Weighted Accuracy: - Recent decisions weighted more heavily - Accounts for changing environments

Wisdom Accumulation¶

Agents don't just learn from success—they accumulate wisdom patterns:

Pattern Type	Example	Usage
`exact_match`	"This exact situation before"	High confidence
`semantic_similar`	"Similar context"	Moderate confidence
`heuristic`	"This rule usually works"	Pattern matching
`failure_avoidance`	"This failed before"	Negative learning
`success_heuristic`	"This worked well"	Positive reinforcement
`domain_rule`	"Domain-specific logic"	Expert knowledge

Swarm Coordination¶

For complex tasks, agents form swarms:

Task: Verify company merger claim

Swarm Members:
├─ Document Analyst Agent (searches for evidence)
├─ Temporal Reasoning Agent (checks timeline consistency)
├─ Entity Dedup Agent (verifies company identities)
└─ Source Authority Agent (evaluates document credibility)

Swarm Budget: Pooled from members
Coordination: Delegated budget allocation
Outcome: Consensus verdict with provenance chain

What Makes This Different¶

Single-Model AI	GOLAG Multi-Agent
One confidence score	Population consensus
Unlimited confidence	Finite budgets force honesty
No learning from mistakes	Evolutionary improvement
Black box	Transparent voting records
No calibration	ECE-calibrated confidence
Same accuracy forever	Gets smarter over time

Real-World Example¶

Scenario: Contradictory revenue claims from two documents

Old Report: "Q3 revenue: $1.2M"
New Report: "Q3 revenue: $1.5M"

Agent Population Votes:

Verifier Agent #1:
- Confidence: 0.85
- Votes allocated: 3 (cost: 9 credits)
- Reasoning: "New report more detailed, higher source authority"
- Decision: Trust new report

Temporal Agent #2:
- Confidence: 0.92
- Votes allocated: 4 (cost: 16 credits)
- Reasoning: "New report dated later, supersedes old data"
- Decision: Trust new report

Sentinel Agent #3:
- Confidence: 0.60
- Votes allocated: 1 (cost: 1 credit)
- Reasoning: "Significant difference, low confidence, escalate"
- Decision: Request human review

Weighted Consensus: Trust new report (7 total votes weighted by confidence)
Escalation: Yes (one agent flagged for human review due to magnitude)

Enterprise Benefits¶

1. Auditable Decisions¶

Every agent decision is logged with: - Vote allocation - Reasoning - Confidence score - Outcome verification

2. Continuous Improvement¶

Agents learn from every decision
Accuracy improves without retraining
New patterns incorporated automatically

3. Calibrated Confidence¶

No false certainty
"Don't know" is a valid answer
Escalation to humans when appropriate

4. Transparent Reasoning¶

See which agents voted
Understand why they decided
Trace decision chains

5. Domain Expertise¶

Expert agents emerge naturally
High-stakes decisions use best performers
Knowledge accumulates over time

Tier Gating¶

Tier	GOLAG Domains	Features
Free/Starter	None	No agent-based verification
Pro	3 basic	Document classification, sensitivity detection, field mapping
Team	7 standard	+ Entity dedup, claim verification, contradiction resolution, source authority
Enterprise	All 13	+ QV voting, relationship inference, hallucination detection, federation trust, orchestration

The Result¶

When Archivus makes a decision, you don't just get an answer—you get: - Confidence score (calibrated via evolutionary learning) - Agent consensus (multiple specialized agents agreed) - Provenance chain (here's the evidence) - Contradiction warnings (here's what disagrees) - Escalation triggers (when humans should review)

Not "trust us"—verify yourself.

Verification through evolution, not declaration.