Criterica Intelligence — 24,439 production models trained on 3.52B+ real court records
POSITION PAPER · JUNE 2026

Why calibrated outcome models cannot be replicated by language models.

There are two kinds of AI in legal technology, and they are not on the same road. The gap between them is not a capability gap that the next model generation closes. It is architectural, and it widens as language models improve, because the thing they cannot do becomes more visible.

The Structural Difference
Language Model
Generates language about risk.

Trained to predict the next word. When it says a case has a 72% chance of settling, that number is not a measurement. It is a plausible token, produced because similar sentences appeared in training text. There is no distribution behind it, no error bound, and no way to ask: of all the times it said 72%, how often did the event happen?

FluentStochasticUnauditable
Deterministic Statistical Model
Measures risk.

Trained on labeled outcomes: real cases, venues, judges, and resolutions with the actual result recorded as ground truth. When it outputs 72%, the figure traces to a distribution of comparable matters. Same inputs, same output, every time, with a versioned model and a documented training set behind it.

CalibratedReproducibleAuditable
Three Structural Facts
01
The data does not exist in their training set.

Calibration requires structured outcome labels: this matter, this jurisdiction, this resolution, this duration, this value. The open internet contains articles about cases, not the labeled, normalized outcome record. No amount of scale substitutes for ground truth the model never saw. Building that corpus is years of acquisition, normalization, and human labeling. It is the asset, and it cannot be prompted into existence.

02
Stochastic systems cannot be audited.

Ask a language model the same question twice and you can get two different answers. A capital allocator, an insurer, or a regulator cannot underwrite against an instrument that changes its mind. Determinism is the minimum requirement for underwriting, reserving, audit, and regulatory scrutiny. It is the difference between a rating and a vibe.

03
Calibration requires a closed feedback loop.

A statistical model improves because every resolved matter is scored against its original prediction and fed back as a label. A language model has no mechanism for this. It cannot report its historical error rate on case duration because it has never been measured against one.

Language models are excellent at what they were built for: drafting, summarizing, extracting. But generating language about risk and measuring risk are different problems. One writes. The other predicts. Know the outcome before capital moves.
24,439
Production Models
3.52B+
Real Court Records
89
Jurisdictions
June 2026
Figures as of
© 2026 Criterica Intelligence · The Data and Analytics Subsidiary of Criterica Group · critericaintelligence.com