Identifying Undiagnosed Rare Disease Patients Using Multi-Model Large Language Frameworks on Real-World Physician Notes

Speakers:

Shantanu Seth

Identifying Undiagnosed Rare Disease Patients Using Multi-Model Large Language Frameworks on Real-World Physician Notes

Date:

Wednesday, May 6, 2026

Time:

4:10 pm

Summary:

Timely identification of rare disease patients remains a significant challenge, as early phenotypic clues are often embedded in unstructured physician documentation across multiple specialties. Shantanu presents a multi-model large language framework that leverages a large-scale corpus of de-identified physician notes—representing over 300 million patient lives—to uncover potentially undiagnosed rare disease patients through layered text analytics and AI-driven reasoning.

Axtria’s approach combines deterministic information retrieval with contextual inference across three complementary LLMs — GPT-5, GPT-5 Mini, and Sonnet 4 — orchestrated in a sequential discovery pipeline:

1. Targeted String Search: They first execute high-precision string and semantic pattern searches to isolate note fragments containing disease-relevant terms, symptom mentions, and phenotype descriptors.

2. Patient Linking: Each relevant note is mapped back to the originating patient to establish a cohort of individuals with preliminary indicators of the target disease.

3. Comprehensive Contextual Review: For these patients of interest, all available clinical notes—spanning primary, specialty, and ancillary care—are aggregated to capture longitudinal context and diagnostic evolution.

4. Cross-Domain Triangulation: Using GPT-5 for deep contextual reasoning and Sonnet 4 for meta-analysis, the framework triangulates laboratory results, medication histories, symptom progressions, and physician impressions to identify patients exhibiting consistent disease signatures despite lacking explicit diagnostic codes.

This multi-model ensemble produces interpretable outputs with traceable rationale chains, enhancing clinician trust and enabling focused chart review. Across pilot rare-disease cohorts, the framework achieved a 4–5× improvement in patient identification rates versus ICD-based baselines, with physician validation confirming >80% contextual accuracy.

By fusing deterministic filtering with hierarchical LLM reasoning, this framework demonstrates a scalable, privacy-preserving approach for surfacing undiagnosed patients, accelerating early intervention, and informing precision outreach strategies. Future extensions will integrate claims, genomic, and laboratory data to further strengthen predictive fidelity and clinical utility.

Speakers: