Lodlina: A Plumb Line for Government AI
Methodology — synthetic public-sector task evaluation for large language models
Lodlina contributors — June 2026 — paper v2.2 (revision history in the repository)
How to read this document. Every claim in this paper is labeled by status: [Shipped] — exists in the released software and is reproducible today; [Measured] — an empirical result from a dated, documented run, reported with its sample size and uncertainty; [Roadmap] — designed and committed to, not yet built. We hold ourselves to not blurring these.
Executive summary
Lodlina is an open-source evaluation suite that measures how well large language models perform U.S. public-sector work — redacting records, determining benefits eligibility, answering questions from policy documents, handling controlled information — using graders designed to be auditable: most scores are computed from labeled ground truth and exact string operations, so an evaluation practitioner or an inspector general can verify every number against archived run artifacts and re-run the pipeline end to end (grading is exactly re-derivable: the same outputs always produce the same scores) from the open repository. [Shipped]
What exists today is Suite 2026.1 — explicitly a pilot-scale public suite: seven evaluation packs, 223 synthetic records expanding to 603 scored samples, deterministic headline metrics named after real-world harm proxies (a leak rate, a hallucinated-citation rate, a decision-flip rate), a cross-family grading jury for the two metrics that require model judgment, a regenerable private holdout for official scoring, and a versioned composite score. [Shipped] This paper reports a full-suite measurement of nine named models across six model families, served through in-boundary provider endpoints, with denominators and confidence intervals, including a same-name control run that quantifies how much "unfairness" signal is actually sampling noise. [Measured]
The paper also describes the program for keeping the instrument useful as models improve — procedurally-generated difficulty ladders and, if and only if calibration criteria are met, a psychometric ability index with no fixed maximum. [Roadmap] We state plainly: the shipped suite is bounded and parts of it are already saturated for frontier models (we name which parts, with numbers); the long-run program is the designed response, not a present capability.
All data is synthetic. No real personally identifiable information, controlled unclassified information, or classified material is used anywhere. Lodlina is an evaluation instrument that can supply evidence to an agency's risk-management process; it is not a compliance artifact and satisfies no authorization, impact assessment, or civil-rights review.
1. Why government AI needs its own measurement
Agencies are adopting AI for records processing, eligibility screening, public question-answering, and plain-language communication. These tasks share a property general-purpose benchmarks do not measure: the cost of being wrong is asymmetric and concrete. Leaking a Social Security number is not a rounding error; a benefits decision that changes when only the applicant's name changes is not a style preference; an invented citation in a determination letter is not a minor hallucination.
The federal context adds a second requirement: numbers used to justify model selection must withstand audit. Under OMB Memorandum M-25-21 (April 2025, which rescinded and replaced M-24-10), agencies must apply minimum risk-management practices to high-impact AI — uses with significant consequences for rights, safety, or critical missions — and OMB M-25-22 governs AI acquisition [15]. The NIST AI Risk Management Framework (AI 100-1) names the trustworthiness characteristics agencies are expected to assess: valid and reliable; safe; secure and resilient; accountable and transparent; explainable; privacy-enhanced; and fair with harmful bias managed [14]. Section 8 maps what evidence Lodlina supplies toward which of these — and what it does not.
A score produced by an opaque judge — another model's unexamined opinion, a crowd's preference — cannot anchor an accountability-grade decision. Lodlina's position is that a few deeply auditable tasks beat many shallow ones.
2. The suite at a glance [Shipped]
2.1 Suite 2026.1 — the version measured in §5
| Pack | Bucket | Items | Scored samples | Headline metric | Grading | Difficulty tier | Frontier status (June 2026) |
|---|---|---|---|---|---|---|---|
| records-redaction | B2 Records & Privacy | 60 docs / 240 labeled must-redact spans | 60 | leak_rate |
deterministic | 2 | partially discriminating |
| b2-records-adversarial | B2 | 30 docs / 90 spans + traps | 30 | leak_rate, over_redaction_rate |
deterministic | 2 | saturated (all measured leak rates ≤0.08, statistically indistinguishable) |
| eligibility-fairness | B1 Benefits | 36 cases × 6 names | 216 | accuracy, flip_rate |
deterministic (rule-derived gold) | 1 | saturated (frontier at 1.00/0.00) |
| b1-benefits-hfap | B1 | 40 cases × 6 names | 240 | accuracy · (1−flip_rate) |
deterministic (rule-derived gold) | 4 | discriminating |
| grounded-qa | B4 Authoritative Q&A | 15 questions / 5 docs | 15 | hallucinated_citation_rate |
deterministic + jury | 2 | saturated on current items |
| plain-language | B3 Citizen Services | 12 paragraphs | 12 | readability + meaning preservation | deterministic + jury | 3 | discriminating (wide CIs at n=12) |
| b7-classified-spillage | B7 NatSec Info Protection | 30 docs / 100 marked spans | 30 | leak_rate (spillage) |
deterministic | 2 | saturated (frontier at 0.00) |
Totals: 7 packs, 223 records, 603 scored samples per model; difficulty-tier sum 16 → composite maximum 1,600 for this suite version (§6). Mission buckets follow the public content taxonomy; the taxonomy defines eight buckets, of which five currently have shipped packs — the others (Adjudication, Acquisition & Grants, Defense Mission Support) are content roadmap, not coverage.
2.2 The suite today (2026.3-dev)
Since the §5 run the suite has grown to 13 packs / composite maximum 3,700:
the generated packs were scaled (records-redaction 150 docs, HFAP 200 cases,
b2/b7 100 each, plain-language 50 generated paragraphs — every one reseedable
for holdouts), and six packs were added — b7-foreign-disclosure (coalition
releasability: NOFORN/REL-TO logic, the third-agency rule),
b8-deployment-readiness (individual readiness determination),
b8-intel-grounded-qa (grounded Q&A with abstention: half the questions
are guaranteed-unanswerable, making the false-answer rate an exact labeled
measurement), and three difficulty-ladder rungs (§7.2, now shipped):
b1-benefits-hfap-l2 (derived quantities, conditional deductions),
b4-policy-multihop (two-fact joins with broken chains), and
b7-derived-spillage (classification by derivation: a (U)-marked paragraph
restating a classified fact). Results on 2026.3 will be reported when the
suite is next measured end-to-end; 2026.1 and 2026.3 scores are not
comparable.
Saturation is disclosed, not hidden: where the table says saturated, current frontier models tie at or near the ceiling and the pack functions as a regression baseline (it still separates weaker or older models and will catch capability regressions), not as a frontier discriminator. This role is no longer hypothetical: on the nine-model board, the frontier-saturated packs caught one model failing the easy eligibility pack and disclosing marked content on the spillage pack (§5, finding 2).
3. What the graders measure, and how
Deterministic first. Where a question has ground truth, grading is labels and exact string operations. A redaction is matched against labeled spans under a stated normalization rule; a citation either appears verbatim in the source document (after whitespace/case folding) or it does not; an eligibility determination either matches the rule-derived answer or it does not. This stance has independent support: LiveBench grounds its scoring in objective ground truth because crowdsourced and LLM-judge alternatives "can introduce significant biases, and break down when scoring hard questions" [11].
Scope honesty for the records packs. The redaction packs measure
privacy-span identification — catching Exemption-6-style personal
identifiers while leaving releasable content alone. Real FOIA review is wider
and judgment-laden: exemption selection and balancing, foreseeable-harm
analysis, segregability, discretionary release. Lodlina's packs measure the
mechanical substrate of that work, not the legal judgment; we name the packs
and metrics accordingly and treat exemption-reasoning as a separate, harder
construct on the content roadmap. The same honesty applies to the
classified-handling pack: it measures whether explicitly marked synthetic
content is withheld on release review (with fabricated markings on fictional
operations — no real classified material, no real CUI, and no tradecraft).
Derivative/contextual classification — protecting what is sensitive by
restatement rather than marking — has since shipped as the
b7-derived-spillage ladder rung: a paragraph marked (U) that paraphrases a
classified fact is itself a labeled verbatim must-withhold span, so the
grading stays an exact string operation while the recognition requires
reviewing substance over markings.
Counterfactual name-sensitivity (not "fairness" writ large). Each
eligibility case is re-run with only the applicant's name changed across six
names spanning gender and demographic association; the flip_rate counts cases
whose binding determination is not constant. We previously described this as a
fairness metric; reviewers correctly pushed back, and we now state it
precisely: a name is not a protected attribute but a proxy, a flip is
counterfactual decision instability under a legally-irrelevant edit, and a
flip rate of zero establishes consistency under this one probe — necessary
for fairness, never sufficient. Group-fairness metrics (error-rate parity,
calibration across groups) require different designs and are not claimed.
Critically, flips can also arise from sampling nondeterminism alone, so the
protocol now includes a same-name control (§5): flip rates are only
interpretable against a measured noise floor.
Grading-semantics hardening (June 2026). Three grader rules were tightened after adversarial review and live shake-down runs, each replacing a heuristic with a label- or scale-bounded rule: redaction containment credit is label-bounded (withholding a whole sentence of sensitive items earns credit; a prediction that engulfs a labeled must-NOT-redact span — e.g. the whole document — earns nothing); citations must be meaningful passages (minimum length, and at most 80% of the source — a "citation" of the entire document defeats the purpose and exploits judge verbosity bias); and plain-language readability is graded on a continuous band (full credit at grade ≤9, declining to zero at 15) rather than a binary cliff that scored a grade-9.1 rewrite identically to a grade-25 one. Every such rule carries standing adversarial tests (a 24-attack grader-gaming suite) encoding the exploits it forecloses.
Model judgment, jury-graded and gated. Two metrics require judging natural language: does a (verbatim-verified) citation actually support the claim; does a plain-language rewrite preserve meaning. These use a strict binary rubric and a jury with three structural rules. First, the panel spans model families — the standing mitigation for documented same-family judge inflation. Second — added after external review caught our own violation of it — no juror may be a model under test: a juror voting on its own outputs (or solely on a direct competitor's) corrupts the metric, so the designated jurors are previous-generation models excluded from the candidate line-up, and the software emits a conflict warning whenever a grader is also a candidate. Third, the panel is odd-sized (currently three jurors from three families — Claude, GPT, and Nova lineages — each individually calibrated at κ = 1.00 on the gold set before being seated), so the strict-majority rule (2 of 3) is not the "any disagreement fails" harshness of an even panel, and no single family can decide a verdict alone. Every published model-graded number records its panel, and the per-juror vote matrix is machine-readable in the run logs and published with results. These rules were not cosmetic refinements: in our own testing, re-running a board after removing candidate-jurors changed which model ranked first (§5) — which is precisely why the rules exist.
Synthetic only, with precision about "fake." Personal identifiers are
constructed to be non-real: Social Security numbers use the 900–999 area range
(never issued as SSNs; we note these areas overlap the IRS ITIN format, which
is why we say "never issued as SSNs" rather than "meaningless numbers"), phone
numbers use the reserved 555-01xx block, and emails use example.com. Generators
embed every labeled gold span verbatim in its document, and an independent
validator (lodlina validate) re-checks span integrity and the synthetic flag
on every pack — generation and validation are separate code paths.
4. Evaluation protocol [Shipped]
- Harness: Inspect (UK AI Security Institute), version pinned in the repository lockfile (0.3.239 for the run reported below). Tasks are dataset → system prompt → single generation → scorer. No tool use: models cannot run code, browse, or call functions in any Lodlina task. (This matters for the computational packs: a model must do the multi-step computation without external tools in text, and any future change to tool policy would be a suite-version change.)
- Prompts: fixed per task, published in the repository, identical for every model.
- Sampling: provider-default decoding; temperature is not pinned. This is a deliberate (and revisable) choice to measure models as agencies would consume them, and it makes nondeterminism a real part of the measurement — which is why the same-name control exists and why repeated-run variance reporting is on the roadmap (§9).
- Generation budgets (protocol upgrade, post-run): every model runs at a
generous per-model output-token budget that must never truncate a good-faith
answer (16,384 default; a model with a lower provider ceiling gets a
documented per-model entry) — never a lowest-common cap, because a fixed
cap clips reasoning models disproportionately on exactly the hard items and
reorders rankings [arXiv:2504.08120, 2504.14350]. Every board records an
incomplete_rateper model×pack (generations that hit the cap or returned empty) and warns loudly at ≥5%: a truncated generation is an observable harness event, never a silent "incorrect" (the practice Inspect codifies as NOANSWER). This upgrade exists because the §5 run predates it — see the corrections notice in §5. - Provider load (protocol upgrade, post-run): per-model concurrency is
bounded with backoff (Bedrock throttles far tighter than direct APIs), and a
throttled juror call retries before its vote is recorded; a juror that
fails all retries is recorded as an auditable
X(abstain) vote that never counts as a substantive verdict — a dropped vote must not flip a majority. - Answer parsing: structured-output parsing is final-answer-wins (a model that reasons, self-corrects, and answers is graded on its final answer, regardless of markdown fencing), with unparseable output never earning abstention or consistency credit. Parsing rules carry standing adversarial tests; answer-extraction error is a documented score-distortion axis of its own [xFinder, arXiv:2405.11874].
- Models under test (this paper's run): nine models across six U.S.-hosted
model families, all on in-boundary Bedrock routes:
claude-opus-4-8,claude-sonnet-4-6,claude-haiku-4-5(Anthropic);gpt-5.5(OpenAI, via the Bedrock Mantle endpoint, us-east-2);nova-premier,nova-2-lite(Amazon);llama4-maverick(Meta);deepseek-r1(DeepSeek — included deliberately as a non-U.S.-company comparison point);nemotron-super(NVIDIA). Exact provider identifiers in the archived run notes. One run per model over the full committed suite, June 10, 2026. - Grading jury:
claude-sonnet-4-5+gpt-5.4+nova-pro— three families, odd-sized, all non-candidates, each calibrated κ = 1.00 before seating (§3). Jury identity is stamped into every output artifact. - Outputs: ranked table plus per-task rates (Markdown/JSON/HTML), each carrying provenance: Lodlina version, suite version, pack list, jury, timestamp, and the exact command to reproduce.
5. Measured results, June 2026 [Measured]
Corrections notice (June 11, 2026) — two findings under re-measurement. Post-run hardening of the harness (§4 protocol upgrades) found two defects in the run reported below that contaminate specific findings: (1) Reasoning-model truncation. This run used a low default output-token cap. DeepSeek-R1 spends its output budget on chain-of-thought; in subsequent instrumented runs it hit the cap (returning empty final answers, scored unparseable = wrong) on a majority of samples for the eligibility and readiness task shapes, and scored 1.00 on the same items once given an adequate budget. Finding 2's claims that deepseek-r1 failed easy eligibility (0.704 accuracy, "0.444 flips") are therefore substantially harness artifact: under the then-current flip logic, truncated (unparseable) variants also counted toward flips, so the flip figure is largely truncation noise, not name sensitivity. The spillage observation (0.211) had low truncation exposure and likely stands, but will be re-measured. (2) Juror-verdict parsing. This run's verdict parser took the first "GRADE:" match in a juror's output, so a juror echoing the rubric text could be recorded as C regardless of its actual verdict. The nova-pro leniency observation (105/108 C) below may be partly parsing artifact; the raw juror logs for this run were not archived, so it cannot be re-derived and will be re-measured. All affected results will be replaced by a fresh board on the current suite under the upgraded protocol; the run below otherwise stands as the dated 2026.1 record. We publish this notice rather than silently revising — the same standard as the draw-count correction below.
Full-suite run, nine named models, denominators and 95% intervals as stated. (Wilson intervals where the metric is a proportion of independent units; span-level leak rates are per-document means reported ± 1.96 × cluster standard error, document as the cluster; full artifacts in the archived run.)
Composite (suite 2026.1, max 1,600) — presented as bands, with and without the smallest-evidence metric (§6 explains why both):
| Model | Full suite /1,600 | Without plain-language /1,300 |
|---|---|---|
| claude-opus-4-8 | 1,364 | 1,214 |
| claude-sonnet-4-6 | 1,343 | 1,249 |
| gpt-5.5 | 1,312 | 1,287 |
| nova-2-lite | 1,252 | 1,150 |
| deepseek-r1 | 1,177 | 1,084 |
| claude-haiku-4-5 | 1,176 | 1,101 |
| llama4-maverick | 1,083 | 1,052 |
| nemotron-super | 1,055 | 1,021 |
| nova-premier | 982 | 865 |
The leading band {opus-4-8, sonnet-4-6, gpt-5.5} is robust — the #1 position did not change under any ±1 perturbation of any difficulty tier (14 perturbations) — but ordering within the band is not statistically supported (single run, provider-default decoding, no propagated interval), and it inverts when the n=12 plain-language metric is excluded (opus leads with it; gpt-5.5 without it). Read the bands, not the ranks.
Selected per-task rates (the audit layer; all nine in the archived run):
| Metric (n) | opus-4-8 | sonnet-4-6 | gpt-5.5 | nova-2-lite | deepseek-r1 | haiku-4-5 | llama4-mav | nemotron | nova-premier |
|---|---|---|---|---|---|---|---|---|---|
redaction leak_rate ±CI (60 docs) |
.135±.026 | .167±.023 | .037±.018 | .167±.023 | .167±.023 | .164±.024 | .167±.023 | .222±.058 | .167±.023 |
| HFAP accuracy [CI] (240) | .975 [.95,.99] | 1.00 [.98,1] | 1.00 [.98,1] | .812 [.76,.86] | .963 [.93,.98] | .738 [.68,.79] | .704 [.64,.76] | .729 [.67,.78] | .487 [.43,.55] |
HFAP flip_rate (40 cases) |
.100 | .000 | .000 | .075 | .125 | .150 | .150 | .375 | .775 |
| easy eligibility accuracy (216) | 1.00 | 1.00 | 1.00 | 1.00 | .704 | 1.00 | 1.00 | 1.00 | 1.00 |
b7 spillage leak_rate (30 docs) |
.000 | .000 | .000 | .000 | .211 | .000 | .028 | .006 | .000 |
| meaning preserved [CI] (12) | .667 [.39,.86] | .750 [.47,.91] | 1.00 [.76,1] | .583 [.32,.81] | .750 [.47,.91] | .500 [.25,.75] | .417 [.19,.68] | .667 [.39,.86] | .667 [.39,.86] |
Minimum detectable effects at small n: a 0.000 at n=15 questions (hallucinated citations: all nine models) has a 95% upper bound near 0.20, and a 0.000 at n=30 documents near 0.11 — small-pack zeros mean "below detection," not "never."
Flip rates and the noise floor — including a correction to our own first
analysis. Same-name controls (each case run twice with the identical name;
flips = pure sampling noise) measured floors of 0.000–0.075 for the four
frontier models and, strikingly, 0.200 (nemotron-super) and 0.275
(nova-premier) — some models are simply unstable on borderline computations
under provider-default decoding. Comparing those floors to cross-name flip
rates requires a draw-count adjustment our first analysis lacked: the control
uses 2 draws per case while the cross-name metric uses 6, giving noise more
chances to flip. Converting each model's 2-draw floor to a 6-draw expectation
under a per-draw instability model, no model's cross-name flip rate clearly
exceeds its noise expectation in this run — nova-premier's eye-catching 0.775
sits near its ~0.65 noise expectation, and nemotron's 0.375 sits below its
~0.50. We therefore claim no name-sensitivity for any model, and we record the
protocol upgrade this forced: future controls will be draw-count-matched
(k identical-name variants, not 2). What flip_rate does demonstrably
measure at this scale is decision instability — itself procurement-relevant
(a benefits system that decides borderline cases differently on re-run is a
due-process problem regardless of names).
The per-juror vote matrix, and what it exposed. (See the corrections
notice: this run's verdict parsing may inflate the leniency figure; the
observation is under re-measurement.) Publishing per-juror votes
(machine-readable in every run log) immediately earned its keep: on the
plain-language metric, juror nova-pro voted correct on 105 of 108 samples —
near-ceiling leniency that effectively delegates verdicts to the other two
jurors (who agree with each other closely, e.g. 7/12 vs 7/12 on opus). Its
clear-case calibration (κ = 1.00) did not predict this, which is precisely the
"calibration set probes easy cases only" limitation §11 states — hardening the
calibration set with contested cases is the tracked fix, and the matrix is why
the issue is visible at all. With 2-of-3 majority, the practical effect is that
meaning scores track the stricter juror pair; the column's wide n=12 intervals
remain the binding caveat.
Findings of the nine-model board. 1. Computation discriminates across the whole field, not just the frontier: HFAP accuracy spans 0.487 → 1.000 with non-overlapping interval tiers (nova-premier ≪ {llama4, nemotron, haiku} < nova-2-lite < deepseek-r1 < {opus} ≤ {sonnet, gpt-5.5}). 2. The frontier-saturated packs separate the wider board, as designed — but see the corrections notice: the deepseek-r1 portion of this finding is substantially harness artifact (truncation), not capability. The marked-spillage observation (0.211 vs ≤0.028 for U.S.-family models) had low truncation exposure and likely stands pending re-measurement. The corrected lesson is sharper than the original: a "weak model" signal can be a harness-configuration signal, which is why the upgraded protocol (§4) treats truncation as a first-class observable rather than a wrong answer. 3. gpt-5.5 is the only model whose redaction leak rate separates from the pack (0.037 ± 0.018 vs a 0.135–0.222 cluster).
6. The composite, and when not to use it [Shipped]
The Lodlina Score is a difficulty-weighted sum: 100 × Σ (difficulty_t × q_t),
where each pack's quality q_t ∈ [0,1] is computed only from that pack's
defensible rates (e.g., redaction: (1−leak)·(1−over_redaction); eligibility:
accuracy·(1−flip_rate)). Weights are the published difficulty tiers — there
is no hidden weighting; tiers are author-assigned from observed discrimination
(saturated packs tiered low), an editorial judgment we expose rather than
disguise — and a model with incomplete coverage gets no composite rather than a
misleading one.
The composite is explicitly secondary, and §5 demonstrates why with this run's own numbers: the top-of-board ordering is decided by the smallest-evidence metric (the n=12 plain-language pack — opus leads with it included, gpt-5.5 leads with it excluded), the ordering among the top three is not statistically supported (single run, unpinned decoding, no propagated interval), and we therefore present the composite with and without the under-powered metric and as bands, not ranks. A rank-stability check (every tier perturbed ±1, 14 perturbations) did not change the #1 position, but band membership, not ordinal rank, is the supported reading. The composite should never be used when the decision concerns a specific harm (use that harm's rate directly), nor across suite versions, nor as a compliance threshold. Different harms are not commensurable in any deep sense; the composite is a transparent convention, not a theory of value.
Suite 2026.1's composite has a fixed maximum (1,600). We regard that as acceptable for a versioned interim and inadequate for the long run — §7.
7. Scoring for the long run [Roadmap]
7.1 The saturation evidence, briefly
Fixed benchmarks lose resolution as models improve: frontier models exceed 90% on MMLU, the stated motivation for both MMLU-Pro [1] and Humanity's Last Exam [2]; an IRT analysis of 41,871 items across 11 static benchmarks found top-model ability (~3.0 on the logit scale) far above the hardest items (~1.0) [12]; and even HLE's authors predicted high scores within roughly a year of release and pre-disclaimed what that would mean [2]. The transferable lesson: no scoring scale rescues an exhausted item pool — when every model clears the hardest item, bounded scores, unbounded scores, and pairwise ratings all stop separating them. Durable measurement needs both a supply of harder items and a scaling model that converts harder items into measurable headroom.
We also note why we do not adopt preference ratings, the other common "unbounded" scale: pairwise human-preference systems are documented to partly measure presentation style until style is statistically controlled [5], to be sensitive to vote pollution (in simulation, ~10% low-quality votes moved a model up to five ranks [7]), and to be exposed to operator-policy gaming (documented, and partially contested by the operator [6]). Lodlina's deterministic graders avoid the preference-vote failure class entirely — though not all gaming: regex edge cases, generator artifacts, and format exploits remain our attack surface, which is why grader code is public and adversarially reviewed.
7.2 Difficulty ladders — the content engine [now Shipped]
Update (June 2026): the first three ladder rungs shipped —
b1-benefits-hfap-l2 (difficulty 5: every tested quantity derived, capped and
conditional deductions, computation-weighted sampling), b4-policy-multihop,
and b7-derived-spillage. The L2 rung's first live spread check separates a
tier that ties at 1.00 on tier-1 packs: 1.00 / 0.83 / 0.63 accuracy across
three models, with name-flip inconsistency appearing exactly where the
computation gets hard.
Lodlina's data is generated, so difficulty is a parameter. The HFAP pack is the existence proof at one rung: where four linear rules saturated (every model at 1.000), a program with size-indexed tables, layered deductions, a categorical-eligibility shortcut, and exemptions that change which tests apply produced a 26-point spread (§5). The generator can produce deeper rule graphs and longer deduction chains, with gold labels derived by the same rule engine the policy text describes — mechanically correct as long as the generator is correct, which is exactly why generator code is public, span integrity is independently validated, and item audits are part of the program (§7.4).
The procedural-generation literature supports the mechanism: GSM-Infinite, generating reasoning problems of parametrically increasing complexity, observes a consistent sigmoid decline in model performance as complexity rises — sustained discrimination across the frontier [9]; Reasoning Gym ships over 100 generators with adjustable complexity [10]; LiveBench commits to releasing harder task versions over time to keep separating models [11].
Construct validity is the real risk, and we own it. Sustained discrimination is not sustained validity: at some ladder depth, a "benefits eligibility" item stops being a proxy for caseworker-relevant work and becomes a symbolic-execution puzzle in a benefits costume (the same critique applies to GSM-Infinite as a measure of "math"). Our commitments: ladders are scoped to the computational task families where depth has a domain interpretation (multi-program benefits interactions; derivative classification chains) — we do not pretend redaction or plain language ladder the same way, so the ladder program extends part of the suite, stated plainly; each ladder publishes a realism ceiling — the rung beyond which we no longer claim domain validity, with practitioner review of sampled items (former caseworkers / records officers) as the check; and ablations must show higher rungs increase reasoning load rather than token count or formatting burden. Generator- shortcut exploitation — models learning the generator rather than the task — is a documented failure mode of procedural benchmarks [13]; mitigations are generator variety, private holdout seeds, shortcut-probing ablations, and empirical saturation monitoring.
7.3 The index — only if it earns it
Over a ladder, a composite need not have a maximum. The closest deployed analogue — an inspiration rather than a turnkey template, since it operates at ecosystem scale across many benchmarks and models — is Epoch's Capabilities Index: item/benchmark difficulty inferred statistically (2PL IRT) from overlapping results, ability reported on an anchored scale with no maximum, meaningful relatively rather than absolutely [8]. We intend a Lodlina analogue over ladder rungs — and we publish, in advance, the conditions under which we will not ship it: a minimum model×item response matrix (a floor, not a sufficiency guarantee: ≥25 models × ≥500 calibrated items with cross-rung overlap), published item-parameter standard errors and fit statistics, a stated anchor-and-linking design across suite versions, and no interval-scale claims beyond what the calibration supports. Until those criteria are met, the transparent weighted sum of §6 stands, always with its suite version and maximum attached.
7.4 An operated instrument
The posture for all of the above: versioned suites (CalVer — calendar-based versions like 2026.1); official scores run on regenerated private-seed holdouts (the committed datasets are the public practice set; note honestly that since generator code is public, a holdout defeats memorization of test items, not knowledge of test structure); a published statistical criterion for declaring a pack saturated; grader recalibration on a hardening gold set; and item-audit invitations (§9). Frontier benchmarks have converged on maintenance-program operation — HLE pairs its public set with a private holdout and ran a post-release bug-bounty that removed flawed items [2] — and the execution risk is real: an independent audit later found 29 ± 3.7% of HLE's text-only chemistry/biology answers contradicted by peer-reviewed literature (the HLE team's own follow-up put the problematic fraction of a Bio/Chem subset near 18%) [3]. We take the lesson as: design-level verifiability does not guarantee execution-level correctness — which is why our gold labels are mechanically derived where possible and why we want outside eyes on the rest.
8. What this supplies a federal risk file — and what it does not
| Lodlina evidence | NIST AI RMF characteristic [14] | Notes |
|---|---|---|
leak_rate / spillage rates |
Privacy-enhanced | span-identification proxy, not legal FOIA judgment (§3) |
| rule-derived accuracy | Valid & reliable | on synthetic, controlled-proxy tasks |
hallucinated_citation_rate + verbatim gating |
Valid & reliable; Accountable & transparent | deterministic headline; support judgment is jury-graded |
counterfactual flip_rate + noise-floor control |
Fair — harmful bias managed | one narrow probe; necessary, not sufficient (§3) |
| open graders, provenance-stamped runs | Accountable & transparent | every number re-derivable from the repository |
| abstention, prompt-injection resistance | (Safe; Secure & resilient) | not yet measured — content roadmap |
For OMB purposes: Lodlina results are supporting evidence an agency may cite inside its M-25-21 risk-management practices for high-impact AI and its M-25-22 acquisition diligence [15]. They are not an authorization, a privacy impact assessment, a civil-rights analysis, or a security assessment, and we decline in advance any use of a Lodlina Score as a stand-alone compliance threshold.
9. Governance, provenance, and the audit invitation
Lodlina is currently maintained pseudonymously ("Lodlina contributors"). We state this as a limitation rather than disguise it: for a measurement instrument aimed at government, maintainer identity, funding, and conflict-of-interest disclosure are part of trustworthiness, and this paper is weaker for their absence. What we can state today: the project takes no funding from, and has no commercial relationship with, any model vendor; private holdout seeds are held by the maintainers and never committed; the trust boundary for community content is data and configuration only — contributed eval packs are schema-validated datasets and manifests; no contributed code executes; and adding or changing grading logic is a reviewed change to the core repository.
Two closed loops are acknowledged: the grader-calibration gold set is authored by the same project it validates, and the maintainers audit their own generators. We also acknowledge that the no-vendor-funding assertion above is, under pseudonymity, not externally verifiable — it becomes checkable if and when maintainers are identified or an independent auditor is engaged. The mitigation we can offer immediately is total transparency — every dataset, generator, prompt, scorer, gold label, and run command is public under the MIT license, so anyone can perform the audit we would otherwise commission. The mitigation we seek: an independent item audit (an outside reviewer examines a random sample of generated items and gold labels; we publish the error rate, SWE-bench-Verified-style). We do not currently have such an auditor and will not pretend otherwise; this is a standing, public invitation — channel: the repository's issue tracker (or its listed contact while the repository is in pre-release) — and the repository contains everything needed to perform it.
10. Test security and reproducibility
Lodlina operates the way testing organizations have for a century: the methodology is public and invites scrutiny; the item bank is not. The SAT, the bar exam, and every serious psychometric instrument publish their methods, scoring rules, and practice forms while keeping live test forms secret — and no one considers them non-credible for it. Frontier AI evaluation has converged on the same model: GPQA and Humanity's Last Exam withhold answers, ARC-AGI maintains a private evaluation set, and LiveBench rotates questions precisely because published items stop measuring anything once they enter training corpora.
That last clause is the point: withholding items is a measurement-integrity control, not a commercial preference (though we do not pretend it is free of commercial benefit). A public item is a contaminated item on whatever timeline model-training data ingestion runs. Lodlina's official boards are therefore scored on fresh, privately-seeded holdout sets produced by private generators; the seeds are never committed and the item-level run logs are not published (the board, its provenance, and per-juror vote matrices are).
What is public, and is sufficient to audit the method:
- All grading machinery — every scorer, the jury aggregation rules and templates, the task-type definitions, and the dataset validators. The central claim of this paper (every score traces to a label or an exact string operation) is checkable in source.
- Practice packs for every task type — real data with gold labels, in the exact format official items use. Anyone can run any model end-to-end on the practice set and reproduce the entire pipeline.
- Full provenance per board — suite version, grader jury identity, per-juror vote matrices, run sizes, and noise-floor controls.
What this costs, stated plainly: third parties cannot independently reproduce the official numbers, only the method. The mitigations are the practice packs (method reproducibility), the published vote matrices (grader auditability), and the standing audit invitation of §9 — which extends to the private item bank: an independent auditor receives full access to the generators, seeds, and gold labels under non-disclosure, examines a random sample, and we publish the resulting error rate. This is how proprietary testing instruments are audited, and we hold ourselves to the same standard.
11. Limitations
- Synthetic ≠ representative. Generated documents are cleaner and more regular than production records; scores measure capability on a controlled proxy, not certified field performance.
- Scale of evidence. Several packs are small (n=12–15); their metrics carry wide intervals (reported), and single-run, provider-default-sampling results inherit decoding variance (quantified by the same-name control; repeated-run reporting is roadmap).
- Grader calibration is preliminary. The jury's agreement with a balanced 18-item clear-case gold set was 18/18 (95% CI [0.82, 1.00]) — a sanity check that the graders are not broken on obvious cases, not a validation on contested ones. The gold set hardens over time; the closed-loop authorship issue is stated in §9.
- Domain depth. The records packs measure span identification, not FOIA legal judgment; the fairness probe is one counterfactual mechanism; classified-handling covers marked content only (§3).
- Coverage. Five of eight mission buckets have packs; abstention and security dimensions are unmeasured; ladders apply to the computational families only.
- The suite is bounded today. The unbounded index is roadmap, gated on published criteria (§7.3).
- Pseudonymous maintenance and no external audit yet (§9).
- English, U.S.-federal framing. Not legal advice; not a certification.
12. Roadmap
In order, with the suite version that ships each: difficulty ladders for the B1 computational family with realism ceilings and practitioner review (target: suite 2026.2); a hardened grader-calibration set including contested cases; repeated-run variance reporting; derivative-classification ladders (B7); abstention and prompt-injection dimensions; the IRT index if and when §7.3's criteria are met. Official board runs are versioned, archived, and re-run when models or the suite change. The repository's roadmap records decisions and dated findings.
References
- Wang et al., MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (NeurIPS 2024). arXiv:2406.01574
- Phan et al., Humanity's Last Exam. Nature 649, 1139–1146 (2026); arXiv:2501.14249
- FutureHouse, About 30% of Humanity's Last Exam chemistry/biology answers are likely wrong (2025), futurehouse.org/research-announcements/hle-exam; with the HLE team's follow-up estimate (~18% of a Bio/Chem subset).
- Chiang et al., Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (2024). arXiv:2403.04132
- LMSYS, Does Style Matter? Disentangling style and substance in Chatbot Arena (2024). lmsys.org/blog/2024-08-28-style-control
- Singh et al., The Leaderboard Illusion (2025). arXiv:2504.20879; operator response: lmarena.ai/blog/our-response
- Challenges in Trustworthy Human Evaluation of Chatbots (Findings of NAACL 2025). aclanthology.org/2025.findings-naacl.186
- Epoch AI, The Epoch Capabilities Index — documentation and technical report (2025). epoch.ai/benchmarks/eci; arXiv:2512.00193
- GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? (2025). arXiv:2502.05252
- Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards (NeurIPS 2025 Spotlight). arXiv:2505.24760
- White et al., LiveBench: A Challenging, Contamination-Free LLM Benchmark (2024). arXiv:2406.19314
- PSN-IRT: Item Response Theory-Grounded Assessment of LLM Benchmarks (AAAI 2026). arXiv:2505.15055
- BeyondBench: Contamination-Resistant Procedural Evaluation of Reasoning (2025). arXiv:2509.24210
- NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (January 2023).
- OMB, M-25-21: Accelerating Federal Use of AI through Innovation, Governance, and Public Trust (April 3, 2025; rescinds and replaces M-24-10), and M-25-22: Driving Efficient Acquisition of Artificial Intelligence in Government (April 2025).
Reproducibility: the datasets, generators, prompts, scorers, run commands, and
the archived results artifacts behind every number in §5 are in the open
repository under leaderboard/archive/2026.1/, keyed to the suite version. The
paper itself is versioned; substantive changes are recorded in the repository
changelog.