Trained to predict the next word. When it says a case has a 72% chance of settling, that number is not a measurement. It is a plausible token, produced because similar sentences appeared in training text. There is no distribution behind it, no error bound, and no way to ask: of all the times it said 72%, how often did the event happen?
Trained on labeled outcomes: real cases, venues, judges, and resolutions with the actual result recorded as ground truth. When it outputs 72%, the figure traces to a distribution of comparable matters. Same inputs, same output, every time, with a versioned model and a documented training set behind it.
Calibration requires structured outcome labels: this matter, this jurisdiction, this resolution, this duration, this value. The open internet contains articles about cases, not the labeled, normalized outcome record. No amount of scale substitutes for ground truth the model never saw. Building that corpus is years of acquisition, normalization, and human labeling. It is the asset, and it cannot be prompted into existence.
Ask a language model the same question twice and you can get two different answers. A capital allocator, an insurer, or a regulator cannot underwrite against an instrument that changes its mind. Determinism is the minimum requirement for underwriting, reserving, audit, and regulatory scrutiny. It is the difference between a rating and a vibe.
A statistical model improves because every resolved matter is scored against its original prediction and fed back as a label. A language model has no mechanism for this. It cannot report its historical error rate on case duration because it has never been measured against one.
Language models are excellent at what they were built for: drafting, summarizing, extracting. But generating language about risk and measuring risk are different problems. One writes. The other predicts. Know the outcome before capital moves.