LUNGUAGE GitHub

LUNGUAGE: A Benchmark for Structured and Longitudinal Chest X-ray Interpretation

We present the first unified pipeline for evaluating radiology report generation in both single and sequential settings. It integrates three core components — a fine-grained benchmark dataset (LUNGUAGE), a structuring framework that links findings across time, and a clinically grounded evaluation metric — enabling faithful, patient-level assessment of AI-generated reports. This unified approach captures nuanced diagnostic reasoning, structural organization, and temporal progression, establishing a new standard for clinical AI evaluation.

Figure 1. Overview of the LUNGUAGE framework
Overview pipeline: dataset → structuring framework → longitudinal grouping → LunguageScore evaluation

Evaluation pipeline for radiology report generation. We introduce the first evaluation framework for radiology report generation, enabling both detailed single-report assessment and comprehensive patient-level trajectory evaluation. On the left, we release LUNGUAGE, a radiologist-annotated benchmark of structured single and sequential chest X-ray reports. On the right, we develop a two-stage structuring framework that converts free-text into schema-aligned structures at both single and sequential levels. At the bottom, we present LunguageScore, a clinically validated metric that jointly measures semantic accuracy, structural fidelity, and temporal alignment, providing clinically faithful evaluation.

① Benchmark Dataset

1,473 single reports (230 patients) annotated with 17,949 entities and 23,307 relation–attribute pairs, plus 80 sequential reports capturing 41,122 longitudinal observations.

② Structuring Framework

Transforms free‑text reports into structured entity–relation–attribute triplets, aligning with the LUNGUAGE schema and linking findings across time for patient‑level interpretation.

③ LunguageScore Metric

Evaluates semantic accuracy, structural fidelity, and temporal coherence between generated and reference reports — a clinically interpretable standard for AI evaluation.

Abstract

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. Code: https://github.com/SuperSupermoon/Lunguage

The LUNGUAGE Benchmark Dataset

LUNGUAGE defines a fine‑grained, clinically validated dataset for structured and longitudinal report evaluation. It provides reliable reference annotations that enable both single‑study interpretation and temporal analysis across patient timelines.

Schema

The schema defines entities, relations, and attributes at multiple levels of granularity. It extends beyond single reports by introducing ENTITYGROUP and TEMPORALGROUP, which link identical findings across studies and segment diagnostic episodes, supporting patient‑level clinical reasoning.

Annotation Process

The annotation follows a three‑stage workflow: (1) Initial Structuring to design the schema and collect candidate vocabulary, (2) Schema & Vocabulary Curation through blinded expert review and UMLS mapping, and (3) Report‑level Expert Validation ensuring temporal and contextual coherence, including cross‑sentence links (ASSOCIATE, EVIDENCE).

Figure 2. Annotation workflow
Annotation workflow: Stage1 Initial Structuring → Stage2 Schema & Vocabulary → Stage3 Report Annotation

Three stages: Initial Structuring → Schema & Vocabulary → Report Annotation.

Structuring Framework

The structuring framework converts free‑text reports into schema‑aligned triplets and performs longitudinal linking across studies following the LUNGUAGE schema. It produces structured single and sequential representations that support scalable, patient‑level evaluation.

Figure 3. Single and sequential schema
Single schema vs sequential schema

Single schema captures intra‑report structure; The figure shows two reports from the same patient at day 10 and day 90. For the single report schema (within each report), gray solid lines connect entities to attributes, while pink and blue solid lines represent inter-entity reasoning relations (ASSOCIATE, EVIDENCE). For the sequential schema (across reports), black solid lines denote entities in the same ENTITYGROUP (same clinical finding over time) and TEMPORALGROUP (same diagnostic episodes), while black dashed lines show entities in the same ENTITYGROUP but different TEMPORALGROUPS (different diagnostic episodes).

Evaluation Metric: LunguageScore

LunguageScore integrates semantic, structural, and temporal fidelity into a single interpretable metric. It aligns with radiologist judgment and generalizes across report generation models.

MatchScore = Semantic × Temporal × Structural

Semantic: cosine(embedding_pred, embedding_ref)
Temporal: weight(same_study, same_TEMPORALGROUP)
Structural: Σ_attr w_attr × sim(attr_pred, attr_ref)

Resources