POSITION PAPER · JUNE 2026

Why calibrated outcome models cannot be replicated by language models.

There are two kinds of AI in legal technology, and they are not on the same road. The gap between them is not a capability gap that the next model generation closes. It is architectural, and it widens as language models improve, because the thing they cannot do becomes more visible.

The Structural Difference

Language Model

Generates language about risk.

Trained to predict the next word. When it says a case has a 72% chance of settling, that number is not a measurement. It is a plausible token, produced because similar sentences appeared in training text. There is no distribution behind it, no error bound, and no way to ask: of all the times it said 72%, how often did the event happen?

FluentStochasticUnauditable

Deterministic Statistical Model

Measures risk.

Trained on labeled outcomes: real cases, venues, judges, and resolutions with the actual result recorded as ground truth. When it outputs 72%, the figure traces to a distribution of comparable matters. Same inputs, same output, every time, with a versioned model and a documented training set behind it.

CalibratedReproducibleAuditable

Three Structural Facts

The data does not exist in their training set.

Calibration requires structured outcome labels: this matter, this jurisdiction, this resolution, this duration, this value. The open internet contains articles about cases, not the labeled, normalized outcome record. No amount of scale substitutes for ground truth the model never saw. Building that corpus is years of acquisition, normalization, and human labeling. It is the asset, and it cannot be prompted into existence.

Stochastic systems cannot be audited.

Ask a language model the same question twice and you can get two different answers. A capital allocator, an insurer, or a regulator cannot underwrite against an instrument that changes its mind. Determinism is the minimum requirement for underwriting, reserving, audit, and regulatory scrutiny. It is the difference between a rating and a vibe.

Calibration requires a closed feedback loop.

A statistical model improves because every resolved matter is scored against its original prediction and fed back as a label. A language model has no mechanism for this. It cannot report its historical error rate on case duration because it has never been measured against one.

Language models are excellent at what they were built for: drafting, summarizing, extracting. But generating language about risk and measuring risk are different problems. One writes. The other predicts. Know the outcome before capital moves.

24,439

Production Models

3.52B+

Real Court Records

Jurisdictions

June 2026

Figures as of