FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman
FORGE is a training-free memory evolution protocol for LLM agents that runs a population of instances in parallel, has a reflection agent convert failed trajectories into reusable text artifacts (rules, few-shot examples, or both), then broadcasts the best-performing instance's memory to all others between stages. Tested on CybORG CAGE-2, a stochastic partial-observability network defense benchmark where all four tested LLM families start with deeply negative zero-shot rewards, FORGE delivers 1.7β7.7Γ improvement over zero-shot and 29β72% over single-stream Reflexion across all 12 model-representation combinations, with major failure rates dropping to ~1%. The key mechanism is the population broadcast itself β graduation just trims compute.
No production traction yet. The GitHub repos referencing it are all arXiv aggregators and RSS scrapers, none implementing or extending the method. Zero citations on Semantic Scholar. The evaluation is also single-environment (CAGE-2 B-line only), so generalization claims are explicitly directional β builders should treat this as a promising pattern for adversarial, long-horizon agent tasks rather than a validated framework ready to drop into production.
Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.