Skip to content

Why We Built Evolutionary Verification

Large language models have a confidence problem. They answer every question with equal certainty—whether they're summarizing a contract they just read or inventing citations for cases that don't exist.

The solution isn't better models. It's better verification infrastructure.

The Confidence Problem

Ask an LLM: "Are Entity A and Entity B the same person?"

The model returns a probability: 0.87.

What does that mean?

  • Is it 87% confident based on training data patterns?
  • Is it hedging because the entities have similar names?
  • Would it say 0.87 for every marginal case?
  • If you asked the same question twice, would you get the same number?

You don't know. The model doesn't know. The number is ungrounded.

Now scale this to enterprise decisions:

  • Are these two vendor records duplicates? (Merge them or keep separate?)
  • Does this claim contradict existing knowledge? (Accept it or flag for review?)
  • Which document type is this? (Route it correctly or misfile it?)

Every decision has consequences. Every confidence score is a guess. And critically—the model has no incentive to be honest about uncertainty.

The Calibration Gap

In machine learning, we talk about "calibrated confidence"—the idea that when a model says 70%, it should be right 70% of the time.

Most models are terribly calibrated. They're overconfident on easy cases and underconfident on hard ones. They've never been forced to learn what their confidence scores actually mean.

Why? Because confidence is free.

The model can say 0.99 on every answer with no cost. It never runs out of confidence. It never has to choose which claims deserve high certainty.

Finite Budgets Change Everything

What if confidence had a cost?

What if agents started with a budget of 100 points, and expressing confidence consumed points?

What if the cost scaled quadratically—so high confidence was expensive?

1 vote  costs 1 point
2 votes cost 4 points
3 votes cost 9 points
...
10 votes cost 100 points (your entire budget)

Suddenly, agents can't be maximally confident about everything. They have to choose. They have to calibrate.

This is the core insight behind GOLAG (Game-Oriented Lagrangian Agent Governance).

How It Works

Every GOLAG agent:

  1. Starts with a finite budget (100 points)
  2. Makes decisions by allocating votes from that budget
  3. Pays quadratically for each vote (cost = votes²)
  4. Gets outcomes (was the decision correct or incorrect?)
  5. Dies when the budget hits zero

When an agent dies:

  • Its working memory is destroyed (task state, temporary variables)
  • Its wisdom is extracted (patterns it learned, calibration data)
  • A new agent is born with inherited wisdom and a fresh budget

The system evolves. Agents that waste budget on overconfident wrong answers die quickly. Agents that calibrate well—knowing when to be certain and when to escalate—accumulate influence.

The Lagrangian

How does an agent decide how many votes to allocate?

We use a Lagrangian—a physics-inspired optimization function:

L = (Confidence × ContextMatch) / Risk

Where:

  • Confidence: How certain is the AI about this decision? (0.0 - 1.0)
  • ContextMatch: How well does this situation match learned patterns? (0.0 - 1.0)
  • Risk: What's the potential impact of being wrong? (0.1 - 1.0)

High Lagrangian → Strong decision → Allocate more votes

Low Lagrangian → Weak decision → Escalate to human or swarm

The same math that governs constrained optimization in classical mechanics governs agent decision-making. This isn't a metaphor—it's the actual constraint satisfaction problem: maximize decision quality subject to budget constraints.

Quadratic Voting Forces Honesty

Why quadratic costs instead of linear?

Because quadratic costs enforce diminishing returns to certainty.

Linear costs:

10 votes @ 1 point each = 10 points total
You can express maximum confidence 10 times

Quadratic costs:

10 votes @ vote² cost = 100 points total
You can only be maximally confident ONCE

With linear costs, agents can spam high confidence. With quadratic costs, they must reserve it for cases that truly warrant it.

This is inspired by quadratic voting in governance systems—a mechanism that prevents wealthy voters from dominating elections by making bulk vote purchases prohibitively expensive. In our case, it prevents overconfident agents from dominating decisions.

Evolution, Not Tuning

Most AI systems require human tuning:

  • Adjust confidence thresholds manually
  • Retrain models when they drift
  • Monitor performance and intervene when accuracy drops

GOLAG agents improve automatically:

  • Overconfident agents die faster
  • Well-calibrated agents survive longer
  • Each generation inherits the best patterns from the previous generation
  • No human intervention required

We implement this via the replicator equation—a concept from evolutionary game theory:

AC(next_gen) = AC(current_gen) + α(AC_best - AC_average)

Where:

  • AC (Actionability Confidence) is the average Lagrangian value
  • α (alpha) is the learning rate (0.15 in our system)
  • AC_best is the Lagrangian of the best-performing agent
  • AC_average is the average across all agents

Each generation gets smarter. The system learns what it doesn't know. And critically—it learns to admit uncertainty.

Expert Agents

Agents that achieve 95%+ accuracy over 20+ decisions are promoted to expert status.

Experts get:

  • +50 bonus budget points
  • Recognition in the agent registry
  • Trusted status for high-stakes decisions

This creates a meritocracy. Agents that prove calibration earn influence. Agents that don't, die and get replaced.

Why This Matters for Enterprises

Traditional AI: "Here's an answer. Trust us."

Archivus: "Here's an answer. Here's the Lagrangian that justified it. Here's which agent made the call. Here's that agent's calibration history. Here's the provenance chain. Verify it yourself."

When an enterprise needs to audit a decision:

  • Which agent made this call?
  • What was its accuracy history?
  • What was the Lagrangian value?
  • What wisdom patterns did it match?
  • What was its confidence budget at the time?

Every decision is traceable. Every agent is accountable. Every confidence score is grounded in actual performance history.

The Road Ahead

Phase 1 (Complete): Single-agent decisions with evolutionary improvement

Phase 2 (Complete): Multi-agent swarms for complex decisions—leader agents delegate budget to specialist workers

Phase 3 (In Progress): Cross-domain wisdom sharing, federated agent networks

This is the infrastructure for AI you can trust. Not because the model is smarter. Because the system is honest about what it knows and what it doesn't.


GOLAG is live in production. Every entity deduplication, every contradiction resolution, every claim verification in Archivus goes through evolutionary agents. Learn more at archivus.app.