Lodlina measures how well language models do real U.S. public-sector work — benefits adjudication, records redaction, classification review, deployment readiness, policy Q&A — using defensible automated graders: every score traces to a labeled gold value or an exact string operation. Synthetic data, real use-case shapes, open methodology.
View the leaderboard Read the methodology| # | Model | Lodlina Score |
|---|---|---|
| 1 | claude-opus-4-8 | 3,576 / 3,700 |
| 2 | claude-opus-4-7 | 3,457 / 3,700 |
| 3 | claude-haiku-4-5 | 3,003 / 3,700 |
The score is difficulty-weighted and open-ended — the maximum grows as harder packs are added, so the scale never saturates and is never reset. Scores are comparable only within a suite version.
Headline metrics are deterministic: leak rates against labeled spans, determinations against rule-derived gold, citations checked verbatim. Model-graded metrics use a cross-family jury, deterministically gated, with per-juror votes published.
Not trivia — the work itself: FOIA redaction with over-redaction traps, SNAP-like eligibility math, NOFORN/REL-TO releasability calls, deployment-readiness determinations, abstention when the source can't answer.
Counterfactual name-flips on binding determinations. False answers to unanswerable questions. Spillage of classified facts restated in (U)-marked text. Generational abstention gaps invisible to accuracy-only boards.
Every pack is procedurally generated and reseedable: official boards run on fresh, never-published holdout items — the psychometric model (methods public, item bank private).
Eight mission buckets from the public content taxonomy; packs ship across Benefits, Records & Privacy, Citizen Services, Authoritative Q&A, National-Security Information Protection, and Defense Mission Support.