Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, Zahra Shamsi
An LLM-guided tree search system autonomously writes, runs, and refines epidemiological forecasting code β iterating on model candidates without human intervention. In a live, real-time evaluation during the 2025β2026 US respiratory season, the ensemble of machine-generated models matched or beat the CDC's human-curated hub ensembles for influenza, COVID-19, and RSV out-of-sample. Key engineering details: log-scale reward metrics prevent reward hacking, and an automated judge enforces that generated code adheres to epidemiological theory rather than just fitting data patterns.
No production traction yet β zero citations and the GitHub references are all AI news aggregators, not implementations or forks. The work comes out of a team with ties to Harvard (Brenner) and UMass (Reich, who runs FluSight), which gives it credibility in the CDC forecasting ecosystem, and the prospective real-world evaluation is stronger evidence than most ML-for-epidemiology papers offer. Worth watching for whether it integrates into CDC FluSight infrastructure or spawns an open toolkit, but nothing is shipping today.
Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.