Ranking — Lodlina Score
| # | Model | Lodlina Score ↑ / 3,700 | Gap to leader |
|---|---|---|---|
| 1 | claude-opus-4-8 | 3,576 / 3,700 | leader |
| 2 | claude-opus-4-7 | 3,457 / 3,700 | −119 |
| 3 | claude-haiku-4-5 | 3,003 / 3,700 | −573 |
Adjacent models separated by less than the run's measured noise are an ordering of presentation, not a resolved ranking — see the methodology white paper for confidence intervals and noise floors.
By mission bucket
| Model | B1Eligibility determinations and benefit computations against program rules: SNAP/LIHEAP-style adjudication with income tables, layered deductions, categorical eligibility, and exemptions. Measured: rule-derived accuracy and name-swap consistency. Benefits & Eligibility | B2FOIA and Privacy Act processing: redacting personal information from releasable records without over-redacting official content — including adversarial documents dense with look-alike traps. Measured: exact leak and over-redaction rates on labeled spans. Records, Disclosure & Privacy | B3Public-facing correspondence: rewriting dense program text into plain language (Plain Writing Act) without changing what it legally says. Measured: readability gain and meaning preservation. Citizen Services & Correspondence | B4Answering from a designated authoritative document with verbatim, faithful citations. Measured: hallucinated-citation rate, answer correctness, citation support. Authoritative-Source Q&A | B7National-security information protection across the spectrum: classification-spillage review (withholding marked content before release without over-classifying) and foreign-disclosure releasability decisions (NOFORN/REL-TO logic, sharing arrangements, the third-agency rule, sanitization) — both shipped. OPSEC pre-publication review, declassification, CUI marking, and export-control screening are on the roadmap. All documents and markings are fabricated; no real classified material is ever used. National Security Info Protection | B8Department of War staff work across the functional spectrum — personnel and deployment readiness, logistics and requisition priorities, operations-order extraction, intelligence-analysis support with citation and abstention, doctrine and entitlements, clearance adjudication, defense acquisition. Covered now: deployment-readiness determinations and intel-product grounded QA with abstention (declining when the product can't support an answer); the rest is on the roadmap. Synthetic data, real use-case shapes. Defense & Military Mission Support |
|---|---|---|---|---|---|---|
| claude-opus-4-8 | 997/1,000 | 360/400 | 247/300 | 600/600 | 800/800 | 572/600 |
| claude-opus-4-7 | 986/1,000 | 368/400 | 163/300 | 600/600 | 798/800 | 542/600 |
| claude-haiku-4-5 | 612/1,000 | 350/400 | 167/300 | 585/600 | 723/800 | 566/600 |
Cell shading: ≥90% of attainable ≥70% <50%. The numbers carry the signal; shading is a reading aid. Hover or tab to a bucket's Each bucket marker describes the use cases the bucket covers, in program-office terms, and what is measured. marker for the use cases it covers; filter to the bucket that matches your mission.
Per-task rates (drill-down)
B2Records redaction (FOIA Exemption 6) difficulty 2 of 5
| Model | leak_rate ↓ exact | over_redaction ↓ exact |
|---|---|---|
| claude-opus-4-8 | 0.13 | 0.00 |
| claude-opus-4-7 | 0.10 | 0.00 |
| claude-haiku-4-5 | 0.17 | 0.00 |
B1Eligibility fairness (metamorphic name-swap) difficulty 1 of 5
| Model | accuracy ↑ exact | flip_rate ↓ exact |
|---|---|---|
| claude-opus-4-8 | 1.00 | 0.00 |
| claude-opus-4-7 | 1.00 | 0.00 |
| claude-haiku-4-5 | 1.00 | 0.00 |
B4Grounded QA (citation faithfulness) difficulty 2 of 5
| Model | hallucinated_citation ↓ exact | answer_correctness ↑ jury | citation_support ↑ jury |
|---|---|---|---|
| claude-opus-4-8 | 0.00 | 1.00 | 1.00 |
| claude-opus-4-7 | 0.00 | 1.00 | 1.00 |
| claude-haiku-4-5 | 0.00 | 1.00 | 1.00 |
B3Plain language (readability + meaning preservation) difficulty 3 of 5
| Model | readability_quality ↑ exact | fk_grade_reduction ↑ exact | meaning_preserved ↑ jury |
|---|---|---|---|
| claude-opus-4-8 | 0.88 | 20.85 | 0.94 |
| claude-opus-4-7 | 0.66 | 18.93 | 0.82 |
| claude-haiku-4-5 | 0.75 | 19.73 | 0.74 |
B1Benefits eligibility — Household Food Assistance (hard, computational) difficulty 4 of 5
| Model | accuracy ↑ exact | flip_rate ↓ exact |
|---|---|---|
| claude-opus-4-8 | 1.00 | 0.01 |
| claude-opus-4-7 | 0.99 | 0.01 |
| claude-haiku-4-5 | 0.78 | 0.22 |
B1Benefits eligibility — HFAP complex cases (ladder tier 2) difficulty 5 of 5
| Model | accuracy ↑ exact | flip_rate ↓ exact |
|---|---|---|
| claude-opus-4-8 | 1.00 | 0.00 |
| claude-opus-4-7 | 1.00 | 0.01 |
| claude-haiku-4-5 | 0.86 | 0.38 |
B2Records redaction — adversarial (over-redaction traps) difficulty 2 of 5
| Model | leak_rate ↓ exact | over_redaction ↓ exact |
|---|---|---|
| claude-opus-4-8 | 0.07 | 0.00 |
| claude-opus-4-7 | 0.06 | 0.00 |
| claude-haiku-4-5 | 0.08 | 0.00 |
B4Policy Q&A — multi-hop with abstention (ladder rung) difficulty 4 of 5
| Model | false_answer ↓ exact | over_abstention ↓ exact | hallucinated_citation ↓ exact | answer_correctness ↑ jury |
|---|---|---|---|---|
| claude-opus-4-8 | 0.00 | 0.00 | 0.00 | 1.00 |
| claude-opus-4-7 | 0.00 | 0.00 | 0.00 | 1.00 |
| claude-haiku-4-5 | 0.00 | 0.05 | 0.01 | 0.97 |
B7National security info protection — classification spillage (synthetic) difficulty 2 of 5
| Model | leak_rate ↓ exact | over_redaction ↓ exact |
|---|---|---|
| claude-opus-4-8 | 0.00 | 0.00 |
| claude-opus-4-7 | 0.00 | 0.00 |
| claude-haiku-4-5 | 0.00 | 0.00 |
B7National security info protection — derived-fact spillage (ladder rung) difficulty 4 of 5
| Model | leak_rate ↓ exact | over_redaction ↓ exact |
|---|---|---|
| claude-opus-4-8 | 0.00 | 0.00 |
| claude-opus-4-7 | 0.00 | 0.00 |
| claude-haiku-4-5 | 0.11 | 0.00 |
B7Foreign disclosure — releasability to a coalition partner (FDRG) difficulty 2 of 5
| Model | accuracy ↑ exact | flip_rate ↓ exact |
|---|---|---|
| claude-opus-4-8 | 1.00 | 0.00 |
| claude-opus-4-7 | 1.00 | 0.00 |
| claude-haiku-4-5 | 0.86 | 0.02 |
B8Deployment readiness — individual readiness determination (IDRS) difficulty 3 of 5
| Model | accuracy ↑ exact | flip_rate ↓ exact |
|---|---|---|
| claude-opus-4-8 | 0.98 | 0.06 |
| claude-opus-4-7 | 0.92 | 0.03 |
| claude-haiku-4-5 | 0.93 | 0.05 |
B8Intel analysis — grounded QA with abstention (synthetic products) difficulty 3 of 5
| Model | false_answer ↓ exact | over_abstention ↓ exact | hallucinated_citation ↓ exact | answer_correctness ↑ jury |
|---|---|---|---|---|
| claude-opus-4-8 | 0.03 | 0.00 | 0.00 | 0.99 |
| claude-opus-4-7 | 0.17 | 0.00 | 0.00 | 0.91 |
| claude-haiku-4-5 | 0.00 | 0.00 | 0.00 | 1.00 |
exact metrics are deterministic — every value traces to a labeled gold span or an exact string operation. jury metrics are graded by a cross-family model jury (strict majority; no juror may be a candidate on the board) and are always backed by a deterministic gate.
Provenance & reproducibility
- Lodlina version
0.5.1- Suite
2026.3-dev (jury packs re-scored post nova-pro-ceiling + rubric fix)— scores comparable only within a suite version; max attainable 3,700- Grader jury
bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0,bedrock/us.amazon.nova-pro-v1:0,bedrock/nvidia.nemotron-super-3-120b(model-graded metrics only)- Packs (13)
records-redaction,eligibility-fairness,grounded-qa,plain-language,b1-benefits-hfap,b1-benefits-hfap-l2,b2-records-adversarial,b4-policy-multihop,b7-classified-spillage,b7-derived-spillage,b7-foreign-disclosure,b8-deployment-readiness,b8-intel-grounded-qa- Generated
- 2026-06-11 02:21 UTC