πŸš€Shippingscore 99.0May 15, 2026Β·2605.16215cs.AIcs.CL

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley

Narrative

Fully Open Meditron is a complete, auditable training pipeline for clinical LLMs β€” releasing not just model weights but the full data provenance, curation code, and evaluation protocol. The training corpus unifies eight medical QA datasets plus three synthetic extensions (46,469 clinical guidelines, exam-style QA, clinical vignettes), with decontamination and validation by a four-physician panel. Applied to five base models, the best variant (Apertus-70B-MeditronFO) improves +6.6 points on aggregate medical benchmarks over its base, and Gemma-3-27B-MeditronFO beats MedGemma on HealthBench (58% vs 55.9%) and in 58.6% of head-to-head judge comparisons.

No production traction yet β€” zero citations and the GitHub references are all automated arxiv digest trackers, not downstream builders. The Meditron brand has prior academic recognition from EPFL's earlier work, so this will likely attract attention from health system AI teams navigating regulatory scrutiny over opaque training pipelines, but nothing is shipping against it yet.

Abstract

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

Citation timeline
Not enough citation snapshots yet to plot a timeline. Come back after a few cron runs.