Evolutionary Verification¶
Most AI systems treat confidence as a single number from a single model. Archivus uses evolutionary populations of verification agents that compete to be right.
The Problem¶
Enterprise AI has a trust problem: - LLMs sound confident even when wrong - No mechanism to verify AI decisions - Black box reasoning—can't explain "why" - No learning from mistakes
The result: 85% of enterprise AI projects fail (Gartner). Root cause: lack of trust.
The Solution: GOLAG¶
GOLAG = Game-Oriented Lagrangian Agent Governance
A population of specialized agents that: - Have finite confidence budgets (not unlimited optimism) - Vote on contested decisions using quadratic voting (cost = votes²) - Evolve over time based on accuracy - Know what they don't know (epistemic humility)
Core Concept¶
Finite Confidence Budgets¶
Every agent starts with a confidence budget (default: 100 credits).
Quadratic Voting Cost:
1 vote → costs 1 credit
2 votes → costs 4 credits
3 votes → costs 9 credits
10 votes → costs 100 credits (entire budget)
The mechanism: Overconfident agents exhaust their budgets quickly. Well-calibrated agents accumulate influence over time.
The system learns epistemic honesty.
The Lagrangian¶
Every decision is optimized via:
Components: - Confidence: Agent's assessed certainty (calibrated via Expected Calibration Error) - ContextMatch: How well the situation matches learned patterns - Risk: Potential impact of incorrect decision
Low Lagrangian → Escalate to human High Lagrangian → Act autonomously
Decision Domains¶
GOLAG governs decisions across 13 domains:
| Domain | What It Decides | Example |
|---|---|---|
entity_dedup | Are these two entities the same? | "John Smith" vs "J. Smith" |
claim_verification | Is this claim supported by evidence? | "Revenue grew 20%" |
contradiction_resolution | Which conflicting claim is correct? | Old vs new data |
field_mapping | How should this source field map to KG? | "total" → total_amount |
qv_voting | Multi-agent consensus on contested claims | High-stakes decisions |
document_classification | What type of document is this? | Invoice vs receipt |
sensitivity_detection | Does this contain sensitive information? | PII scanning |
source_authority | How trustworthy is this source? | Wikipedia vs unverified email |
relationship_inference | What relationships exist between entities? | Employment, authorship |
hallucination_detection | Is this LLM output grounded? | Fact-checking |
federation_trust | Should we trust this federated claim? | Cross-org verification |
Each domain has a population of agents competing to be the best.
Evolutionary Dynamics¶
Replicator Dynamics¶
Agents improve over time:
What this means: - Agents with high accuracy survive and propagate their strategies - Agents with low accuracy die and are replaced - Wisdom patterns transfer to next generation - System gets better without human intervention
Expert Agents¶
Agents that achieve: - 95%+ accuracy over 20+ decisions - Become experts (+50 budget bonus) - Gain influence in contested decisions - Contribute to "wisdom library"
Dying Agents¶
When an agent's budget drops below 5: - Transfers wisdom patterns to successors - Records failure modes for future agents - Gracefully retires from the population
Multi-Agent Adjudication¶
For contested claims, multiple agents vote:
graph TD
CLAIM[Contested Claim]
CLAIM --> V1[Verifier Agent]
CLAIM --> T1[Temporal Agent]
CLAIM --> R1[Reasoner Agent]
CLAIM --> S1[Sentinel Agent]
V1 --> VOTE[Quadratic Voting]
T1 --> VOTE
R1 --> VOTE
S1 --> VOTE
VOTE --> DECISION[Consensus Decision] The process: 1. Each agent evaluates the claim 2. Agents allocate votes from their budgets 3. Quadratic cost enforces honesty 4. Consensus emerges from weighted votes 5. Outcome recorded for future learning
Calibration Scoring¶
Agents are evaluated using:
Expected Calibration Error (ECE): - Measures how well confidence matches actual accuracy - Agent says "90% confident" → should be right 90% of the time - Poor calibration → budget penalties
Brier Score: - Measures prediction accuracy - Lower is better (0 = perfect prediction)
Weighted Accuracy: - Recent decisions weighted more heavily - Accounts for changing environments
Wisdom Accumulation¶
Agents don't just learn from success—they accumulate wisdom patterns:
| Pattern Type | Example | Usage |
|---|---|---|
exact_match | "This exact situation before" | High confidence |
semantic_similar | "Similar context" | Moderate confidence |
heuristic | "This rule usually works" | Pattern matching |
failure_avoidance | "This failed before" | Negative learning |
success_heuristic | "This worked well" | Positive reinforcement |
domain_rule | "Domain-specific logic" | Expert knowledge |
Swarm Coordination¶
For complex tasks, agents form swarms:
Task: Verify company merger claim
Swarm Members:
├─ Document Analyst Agent (searches for evidence)
├─ Temporal Reasoning Agent (checks timeline consistency)
├─ Entity Dedup Agent (verifies company identities)
└─ Source Authority Agent (evaluates document credibility)
Swarm Budget: Pooled from members
Coordination: Delegated budget allocation
Outcome: Consensus verdict with provenance chain
What Makes This Different¶
| Single-Model AI | GOLAG Multi-Agent |
|---|---|
| One confidence score | Population consensus |
| Unlimited confidence | Finite budgets force honesty |
| No learning from mistakes | Evolutionary improvement |
| Black box | Transparent voting records |
| No calibration | ECE-calibrated confidence |
| Same accuracy forever | Gets smarter over time |
Real-World Example¶
Scenario: Contradictory revenue claims from two documents
Old Report: "Q3 revenue: $1.2M"
New Report: "Q3 revenue: $1.5M"
Agent Population Votes:
Verifier Agent #1:
- Confidence: 0.85
- Votes allocated: 3 (cost: 9 credits)
- Reasoning: "New report more detailed, higher source authority"
- Decision: Trust new report
Temporal Agent #2:
- Confidence: 0.92
- Votes allocated: 4 (cost: 16 credits)
- Reasoning: "New report dated later, supersedes old data"
- Decision: Trust new report
Sentinel Agent #3:
- Confidence: 0.60
- Votes allocated: 1 (cost: 1 credit)
- Reasoning: "Significant difference, low confidence, escalate"
- Decision: Request human review
Weighted Consensus: Trust new report (7 total votes weighted by confidence)
Escalation: Yes (one agent flagged for human review due to magnitude)
Enterprise Benefits¶
1. Auditable Decisions¶
Every agent decision is logged with: - Vote allocation - Reasoning - Confidence score - Outcome verification
2. Continuous Improvement¶
- Agents learn from every decision
- Accuracy improves without retraining
- New patterns incorporated automatically
3. Calibrated Confidence¶
- No false certainty
- "Don't know" is a valid answer
- Escalation to humans when appropriate
4. Transparent Reasoning¶
- See which agents voted
- Understand why they decided
- Trace decision chains
5. Domain Expertise¶
- Expert agents emerge naturally
- High-stakes decisions use best performers
- Knowledge accumulates over time
Tier Gating¶
| Tier | GOLAG Domains | Features |
|---|---|---|
| Free/Starter | None | No agent-based verification |
| Pro | 3 basic | Document classification, sensitivity detection, field mapping |
| Team | 7 standard | + Entity dedup, claim verification, contradiction resolution, source authority |
| Enterprise | All 13 | + QV voting, relationship inference, hallucination detection, federation trust, orchestration |
The Result¶
When Archivus makes a decision, you don't just get an answer—you get: - Confidence score (calibrated via evolutionary learning) - Agent consensus (multiple specialized agents agreed) - Provenance chain (here's the evidence) - Contradiction warnings (here's what disagrees) - Escalation triggers (when humans should review)
Not "trust us"—verify yourself.
Verification through evolution, not declaration.