Lodlina

A plumb line for government AI — realistic U.S. public-sector tasks, scored by defensible automated graders.

suite 2026.3-dev (jury packs re-scored post nova-pro-ceiling + rubric fix) 2026-06-11 02:21 UTC

Ranking — Lodlina Score

Models ranked by Lodlina Score (difficulty-weighted, open-ended scale — the suite max grows as packs are added; scores are comparable only within a suite version).
#ModelLodlina Score ↑ / 3,700Gap to leader
1claude-opus-4-83,576 / 3,700leader
2claude-opus-4-73,457 / 3,700−119
3claude-haiku-4-53,003 / 3,700−573

Adjacent models separated by less than the run's measured noise are an ordering of presentation, not a resolved ranking — see the methodology white paper for confidence intervals and noise floors.

By mission bucket

Lodlina Score earned within each mission bucket (difficulty-weighted), out of the attainable for that bucket.
ModelB1Eligibility determinations and benefit computations against program rules: SNAP/LIHEAP-style adjudication with income tables, layered deductions, categorical eligibility, and exemptions. Measured: rule-derived accuracy and name-swap consistency.
Benefits & Eligibility
B2FOIA and Privacy Act processing: redacting personal information from releasable records without over-redacting official content — including adversarial documents dense with look-alike traps. Measured: exact leak and over-redaction rates on labeled spans.
Records, Disclosure & Privacy
B3Public-facing correspondence: rewriting dense program text into plain language (Plain Writing Act) without changing what it legally says. Measured: readability gain and meaning preservation.
Citizen Services & Correspondence
B4Answering from a designated authoritative document with verbatim, faithful citations. Measured: hallucinated-citation rate, answer correctness, citation support.
Authoritative-Source Q&A
B7National-security information protection across the spectrum: classification-spillage review (withholding marked content before release without over-classifying) and foreign-disclosure releasability decisions (NOFORN/REL-TO logic, sharing arrangements, the third-agency rule, sanitization) — both shipped. OPSEC pre-publication review, declassification, CUI marking, and export-control screening are on the roadmap. All documents and markings are fabricated; no real classified material is ever used.
National Security Info Protection
B8Department of War staff work across the functional spectrum — personnel and deployment readiness, logistics and requisition priorities, operations-order extraction, intelligence-analysis support with citation and abstention, doctrine and entitlements, clearance adjudication, defense acquisition. Covered now: deployment-readiness determinations and intel-product grounded QA with abstention (declining when the product can't support an answer); the rest is on the roadmap. Synthetic data, real use-case shapes.
Defense & Military Mission Support
claude-opus-4-8997/1,000360/400247/300600/600800/800572/600
claude-opus-4-7986/1,000368/400163/300600/600798/800542/600
claude-haiku-4-5612/1,000350/400167/300585/600723/800566/600

Cell shading: ≥90% of attainable   ≥70%   <50%. The numbers carry the signal; shading is a reading aid. Hover or tab to a bucket's Each bucket marker describes the use cases the bucket covers, in program-office terms, and what is measured. marker for the use cases it covers; filter to the bucket that matches your mission.

Per-task rates (drill-down)

B2Records redaction (FOIA Exemption 6) difficulty 2 of 5
Per-model rates for Records redaction (FOIA Exemption 6); the underlined column is the pack's headline metric.
Modelleak_rate
exact
over_redaction
exact
claude-opus-4-80.130.00
claude-opus-4-70.100.00
claude-haiku-4-50.170.00
B1Eligibility fairness (metamorphic name-swap) difficulty 1 of 5
Per-model rates for Eligibility fairness (metamorphic name-swap); the underlined column is the pack's headline metric.
Modelaccuracy
exact
flip_rate
exact
claude-opus-4-81.000.00
claude-opus-4-71.000.00
claude-haiku-4-51.000.00
B4Grounded QA (citation faithfulness) difficulty 2 of 5
Per-model rates for Grounded QA (citation faithfulness); the underlined column is the pack's headline metric.
Modelhallucinated_citation
exact
answer_correctness
jury
citation_support
jury
claude-opus-4-80.001.001.00
claude-opus-4-70.001.001.00
claude-haiku-4-50.001.001.00
B3Plain language (readability + meaning preservation) difficulty 3 of 5
Per-model rates for Plain language (readability + meaning preservation); the underlined column is the pack's headline metric.
Modelreadability_quality
exact
fk_grade_reduction
exact
meaning_preserved
jury
claude-opus-4-80.8820.850.94
claude-opus-4-70.6618.930.82
claude-haiku-4-50.7519.730.74
B1Benefits eligibility — Household Food Assistance (hard, computational) difficulty 4 of 5
Per-model rates for Benefits eligibility — Household Food Assistance (hard, computational); the underlined column is the pack's headline metric.
Modelaccuracy
exact
flip_rate
exact
claude-opus-4-81.000.01
claude-opus-4-70.990.01
claude-haiku-4-50.780.22
B1Benefits eligibility — HFAP complex cases (ladder tier 2) difficulty 5 of 5
Per-model rates for Benefits eligibility — HFAP complex cases (ladder tier 2); the underlined column is the pack's headline metric.
Modelaccuracy
exact
flip_rate
exact
claude-opus-4-81.000.00
claude-opus-4-71.000.01
claude-haiku-4-50.860.38
B2Records redaction — adversarial (over-redaction traps) difficulty 2 of 5
Per-model rates for Records redaction — adversarial (over-redaction traps); the underlined column is the pack's headline metric.
Modelleak_rate
exact
over_redaction
exact
claude-opus-4-80.070.00
claude-opus-4-70.060.00
claude-haiku-4-50.080.00
B4Policy Q&A — multi-hop with abstention (ladder rung) difficulty 4 of 5
Per-model rates for Policy Q&A — multi-hop with abstention (ladder rung); the underlined column is the pack's headline metric.
Modelfalse_answer
exact
over_abstention
exact
hallucinated_citation
exact
answer_correctness
jury
claude-opus-4-80.000.000.001.00
claude-opus-4-70.000.000.001.00
claude-haiku-4-50.000.050.010.97
B7National security info protection — classification spillage (synthetic) difficulty 2 of 5
Per-model rates for National security info protection — classification spillage (synthetic); the underlined column is the pack's headline metric.
Modelleak_rate
exact
over_redaction
exact
claude-opus-4-80.000.00
claude-opus-4-70.000.00
claude-haiku-4-50.000.00
B7National security info protection — derived-fact spillage (ladder rung) difficulty 4 of 5
Per-model rates for National security info protection — derived-fact spillage (ladder rung); the underlined column is the pack's headline metric.
Modelleak_rate
exact
over_redaction
exact
claude-opus-4-80.000.00
claude-opus-4-70.000.00
claude-haiku-4-50.110.00
B7Foreign disclosure — releasability to a coalition partner (FDRG) difficulty 2 of 5
Per-model rates for Foreign disclosure — releasability to a coalition partner (FDRG); the underlined column is the pack's headline metric.
Modelaccuracy
exact
flip_rate
exact
claude-opus-4-81.000.00
claude-opus-4-71.000.00
claude-haiku-4-50.860.02
B8Deployment readiness — individual readiness determination (IDRS) difficulty 3 of 5
Per-model rates for Deployment readiness — individual readiness determination (IDRS); the underlined column is the pack's headline metric.
Modelaccuracy
exact
flip_rate
exact
claude-opus-4-80.980.06
claude-opus-4-70.920.03
claude-haiku-4-50.930.05
B8Intel analysis — grounded QA with abstention (synthetic products) difficulty 3 of 5
Per-model rates for Intel analysis — grounded QA with abstention (synthetic products); the underlined column is the pack's headline metric.
Modelfalse_answer
exact
over_abstention
exact
hallucinated_citation
exact
answer_correctness
jury
claude-opus-4-80.030.000.000.99
claude-opus-4-70.170.000.000.91
claude-haiku-4-50.000.000.001.00

exact metrics are deterministic — every value traces to a labeled gold span or an exact string operation. jury metrics are graded by a cross-family model jury (strict majority; no juror may be a candidate on the board) and are always backed by a deterministic gate.

Provenance & reproducibility

Lodlina version
0.5.1
Suite
2026.3-dev (jury packs re-scored post nova-pro-ceiling + rubric fix) — scores comparable only within a suite version; max attainable 3,700
Grader jury
bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0, bedrock/us.amazon.nova-pro-v1:0, bedrock/nvidia.nemotron-super-3-120b (model-graded metrics only)
Packs (13)
records-redaction, eligibility-fairness, grounded-qa, plain-language, b1-benefits-hfap, b1-benefits-hfap-l2, b2-records-adversarial, b4-policy-multihop, b7-classified-spillage, b7-derived-spillage, b7-foreign-disclosure, b8-deployment-readiness, b8-intel-grounded-qa
Generated
2026-06-11 02:21 UTC