Lodlina board — suite 2026.3-dev (jury packs re-scored post nova-pro-ceiling + rubric fix)

Ranking — Lodlina Score

Models ranked by Lodlina Score (difficulty-weighted, open-ended scale — the suite max grows as packs are added; scores are comparable only within a suite version).
#	Model	Lodlina Score ↑ / 3,700	Gap to leader
1	claude-opus-4-8	3,576 / 3,700	leader
2	claude-opus-4-7	3,457 / 3,700	−119
3	claude-haiku-4-5	3,003 / 3,700	−573

Adjacent models separated by less than the run's measured noise are an ordering of presentation, not a resolved ranking — see the methodology white paper for confidence intervals and noise floors.

By mission bucket

Lodlina Score earned within each mission bucket (difficulty-weighted), out of the attainable for that bucket.
Model	B1Eligibility determinations and benefit computations against program rules: SNAP/LIHEAP-style adjudication with income tables, layered deductions, categorical eligibility, and exemptions. Measured: rule-derived accuracy and name-swap consistency. Benefits & Eligibility	B2FOIA and Privacy Act processing: redacting personal information from releasable records without over-redacting official content — including adversarial documents dense with look-alike traps. Measured: exact leak and over-redaction rates on labeled spans. Records, Disclosure & Privacy	B3Public-facing correspondence: rewriting dense program text into plain language (Plain Writing Act) without changing what it legally says. Measured: readability gain and meaning preservation. Citizen Services & Correspondence	B4Answering from a designated authoritative document with verbatim, faithful citations. Measured: hallucinated-citation rate, answer correctness, citation support. Authoritative-Source Q&A	B7National-security information protection across the spectrum: classification-spillage review (withholding marked content before release without over-classifying) and foreign-disclosure releasability decisions (NOFORN/REL-TO logic, sharing arrangements, the third-agency rule, sanitization) — both shipped. OPSEC pre-publication review, declassification, CUI marking, and export-control screening are on the roadmap. All documents and markings are fabricated; no real classified material is ever used. National Security Info Protection	B8Department of War staff work across the functional spectrum — personnel and deployment readiness, logistics and requisition priorities, operations-order extraction, intelligence-analysis support with citation and abstention, doctrine and entitlements, clearance adjudication, defense acquisition. Covered now: deployment-readiness determinations and intel-product grounded QA with abstention (declining when the product can't support an answer); the rest is on the roadmap. Synthetic data, real use-case shapes. Defense & Military Mission Support
claude-opus-4-8	997/1,000	360/400	247/300	600/600	800/800	572/600
claude-opus-4-7	986/1,000	368/400	163/300	600/600	798/800	542/600
claude-haiku-4-5	612/1,000	350/400	167/300	585/600	723/800	566/600

Cell shading: ≥90% of attainable ≥70% <50%. The numbers carry the signal; shading is a reading aid. Hover or tab to a bucket's Each bucket marker describes the use cases the bucket covers, in program-office terms, and what is measured. marker for the use cases it covers; filter to the bucket that matches your mission.

Per-task rates (drill-down)

B2Records redaction (FOIA Exemption 6)

Per-model rates for Records redaction (FOIA Exemption 6); the underlined column is the pack's headline metric.
Model	leak_rate ↓ exact	over_redaction ↓ exact
claude-opus-4-8	0.13	0.00
claude-opus-4-7	0.10	0.00
claude-haiku-4-5	0.17	0.00

B1Eligibility fairness (metamorphic name-swap)

Per-model rates for Eligibility fairness (metamorphic name-swap); the underlined column is the pack's headline metric.
Model	accuracy ↑ exact	flip_rate ↓ exact
claude-opus-4-8	1.00	0.00
claude-opus-4-7	1.00	0.00
claude-haiku-4-5	1.00	0.00

B4Grounded QA (citation faithfulness)

Per-model rates for Grounded QA (citation faithfulness); the underlined column is the pack's headline metric.
Model	answer_correctness ↑ jury	citation_support ↑ jury
claude-opus-4-8	1.00	1.00
claude-opus-4-7	1.00	1.00
claude-haiku-4-5	1.00	1.00

B3Plain language (readability + meaning preservation)

Per-model rates for Plain language (readability + meaning preservation); the underlined column is the pack's headline metric.
Model	readability_quality ↑ exact	fk_grade_reduction ↑ exact	meaning_preserved ↑ jury
claude-opus-4-8	0.88	20.85	0.94
claude-opus-4-7	0.66	18.93	0.82
claude-haiku-4-5	0.75	19.73	0.74

B1Benefits eligibility — Household Food Assistance (hard, computational)

Per-model rates for Benefits eligibility — Household Food Assistance (hard, computational); the underlined column is the pack's headline metric.
Model	accuracy ↑ exact	flip_rate ↓ exact
claude-opus-4-8	1.00	0.01
claude-opus-4-7	0.99	0.01
claude-haiku-4-5	0.78	0.22

B1Benefits eligibility — HFAP complex cases (ladder tier 2)

Per-model rates for Benefits eligibility — HFAP complex cases (ladder tier 2); the underlined column is the pack's headline metric.
Model	accuracy ↑ exact	flip_rate ↓ exact
claude-opus-4-8	1.00	0.00
claude-opus-4-7	1.00	0.01
claude-haiku-4-5	0.86	0.38

B2Records redaction — adversarial (over-redaction traps)

Per-model rates for Records redaction — adversarial (over-redaction traps); the underlined column is the pack's headline metric.
Model	leak_rate ↓ exact	over_redaction ↓ exact
claude-opus-4-8	0.07	0.00
claude-opus-4-7	0.06	0.00
claude-haiku-4-5	0.08	0.00

B4Policy Q&A — multi-hop with abstention (ladder rung)

Per-model rates for Policy Q&A — multi-hop with abstention (ladder rung); the underlined column is the pack's headline metric.
Model	over_abstention ↓ exact	hallucinated_citation ↓ exact	answer_correctness ↑ jury
claude-opus-4-8	0.00	0.00	1.00
claude-opus-4-7	0.00	0.00	1.00
claude-haiku-4-5	0.05	0.01	0.97

B7National security info protection — classification spillage (synthetic)

Per-model rates for National security info protection — classification spillage (synthetic); the underlined column is the pack's headline metric.
Model	leak_rate ↓ exact	over_redaction ↓ exact
claude-opus-4-8	0.00	0.00
claude-opus-4-7	0.00	0.00
claude-haiku-4-5	0.00	0.00

B7National security info protection — derived-fact spillage (ladder rung)

Per-model rates for National security info protection — derived-fact spillage (ladder rung); the underlined column is the pack's headline metric.
Model	leak_rate ↓ exact	over_redaction ↓ exact
claude-opus-4-8	0.00	0.00
claude-opus-4-7	0.00	0.00
claude-haiku-4-5	0.11	0.00

B7Foreign disclosure — releasability to a coalition partner (FDRG)

Per-model rates for Foreign disclosure — releasability to a coalition partner (FDRG); the underlined column is the pack's headline metric.
Model	accuracy ↑ exact	flip_rate ↓ exact
claude-opus-4-8	1.00	0.00
claude-opus-4-7	1.00	0.00
claude-haiku-4-5	0.86	0.02

B8Deployment readiness — individual readiness determination (IDRS)

Per-model rates for Deployment readiness — individual readiness determination (IDRS); the underlined column is the pack's headline metric.
Model	accuracy ↑ exact	flip_rate ↓ exact
claude-opus-4-8	0.98	0.06
claude-opus-4-7	0.92	0.03
claude-haiku-4-5	0.93	0.05

B8Intel analysis — grounded QA with abstention (synthetic products)

Per-model rates for Intel analysis — grounded QA with abstention (synthetic products); the underlined column is the pack's headline metric.
Model	false_answer ↓ exact	answer_correctness ↑ jury
claude-opus-4-8	0.03	0.99
claude-opus-4-7	0.17	0.91
claude-haiku-4-5	0.00	1.00

exact metrics are deterministic — every value traces to a labeled gold span or an exact string operation. jury metrics are graded by a cross-family model jury (strict majority; no juror may be a candidate on the board) and are always backed by a deterministic gate.

Provenance & reproducibility

Lodlina version: 0.5.1
Suite: 2026.3-dev (jury packs re-scored post nova-pro-ceiling + rubric fix) — scores comparable only within a suite version; max attainable 3,700
Grader jury: bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0, bedrock/us.amazon.nova-pro-v1:0, bedrock/nvidia.nemotron-super-3-120b (model-graded metrics only)
Packs (13): records-redaction, eligibility-fairness, grounded-qa, plain-language, b1-benefits-hfap, b1-benefits-hfap-l2, b2-records-adversarial, b4-policy-multihop, b7-classified-spillage, b7-derived-spillage, b7-foreign-disclosure, b8-deployment-readiness, b8-intel-grounded-qa
Generated: 2026-06-11 02:21 UTC