Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

基本信息

关键图示

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims Figure 1
Figure 1
Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims Figure 2
Figure 2
Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims Figure 3
Figure 3

摘要

English

Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific LightGBM (66.3%) and the transformer-based Delphi model (69.4%), with the largest gains for rare diseases. These advantages held across retrospective and prospective evaluations and in external validation on two independent datasets. Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone. Beyond disease prediction, ReClaim captured financial outcomes and improved real-world evidence (RWE) analyses: for healthcare expenditure forecasting it increased explained variance from 0.28 to 0.37 relative to LightGBM, and in a target trial emulation it reduced systematic bias by 72% on average relative to Delphi.

中文

来自大规模真实世界数据的证据越来越多地影响监管评估和医疗决策。行政索赔提供了人口级别的医疗利用、支出以及诊断、程序和药物的详细编码的纵向记录,但其作为医疗基础模型基础的潜力尚未被充分探索。本文提出 ReClaim,一个从头在 MarketScan 索赔数据(2008-2022 年,超 2 亿参保人、438 亿医疗事件)上训练的生成式 Transformer。ReClaim 对诊断、程序、药物和支出的纵向轨迹进行建模,并扩展到 1.4 亿、7 亿和 17 亿参数。在 1000+ 疾病发病预测任务上,ReClaim 平均 AUC 达 75.6%,大幅优于疾病特异性 LightGBM(66.3%)和基于 Transformer 的 Delphi 模型(69.4%),罕见疾病的提升最大。这些优势在回顾性和前瞻性评估以及两个独立数据集的外部验证中均成立。性能随规模单调提升,训练后相比仅预训练增加 13.8 个百分点。除疾病预测外,ReClaim 还捕获了财务结果——医疗支出预测中将解释方差从 0.28 提升到 0.37(相对 LightGBM),目标试验模拟中平均减少 72% 的系统偏差(相对 Delphi)。

核心贡献

  1. 医疗索赔基础模型 ReClaim:首次将生成式 Transformer 大规模应用于全国性医疗索赔数据(438 亿事件),证明了行政索赔作为基础模型训练基质的可行性。
  2. 规模定律验证:在 140M → 700M → 1.7B 参数三个规模上验证了性能单调提升,确认了医疗基础模型的规模收益。
  3. 全面临床预测:在 1000+ 疾病发病预测任务上取得 SOTA(75.6% AUC),特别是罕见疾病(LightGBM 处理不好)获得最大收益。
  4. 真实世界证据改进:展示了基础模型在医疗支出预测(解释方差 0.28→0.37)和目标试验模拟(偏差减少 72%)中的实际价值。
  5. 跨时间跨数据源泛化:在回顾性/前瞻性评估和两个独立外部数据集上均验证了泛化能力。

方法概述

ReClaim 基于 GPT 风格的 decoder-only Transformer 架构,从头在 MarketScan 医疗索赔数据上训练。关键设计选择:

实验结果

局限性与注意点

相关概念(详细)

相关概念


导入时间: 2026-05-05 06:01 来源: arXiv Daily Wiki Update 2026-05-05