What do these badges mean?
- 🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
- 📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
- 💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
- 🎭HypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.
- 2605.18703·May 18, 2026·~11 mincs.CLcs.LG
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li, +11
ELI5A system that automatically creates realistic practice environments and training scenarios for AI agents to learn how to use tools and APIs. Instead of manually building fake environments or relying on expensive real APIs, it explores actual software systems and generates natural multi-turn conversations that teach agents to reason like humans.
Problem solvedTraining tool-use agents is expensive and data-scarce: real APIs cost money, LLM simulators hallucinate, and existing synthetic data is either single-turn or too instruction-like. EnvFactory automates both environment discovery and realistic trajectory generation, cutting the number of required environments by 5x while improving agent performance.
- 2605.18592·May 18, 2026·~14 mincs.LGcs.AIcs.CL
AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, +3
ELI5A system that remembers what went wrong during AI training and uses those memories to improve how it grades future attempts. Instead of starting fresh each time, it builds up a library of evaluation insights that help it catch recurring problems and guide the model better.
Problem solvedRL fine-tuning with rubrics keeps rediscovering the same evaluation principles and missing recurring failure patterns because it doesn't retain diagnostic information between training steps. AMARIS fixes this by maintaining persistent memory of what's been learned about model behavior, avoiding wasted recomputation and enabling curriculum-like progression.
- 2605.18591·May 18, 2026·~7 mincs.LGcs.AI
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation
Mingfei Sun
ELI5A cheaper way to compute natural policy gradients—which help RL agents learn faster—by transforming the reward signal instead of explicitly building and inverting a huge matrix. Think of it as solving the problem backward through your neural network rather than doing expensive linear algebra.
Problem solvedNatural policy gradients are theoretically better for RL but prohibitively expensive in practice because they require computing, storing, and inverting the Fisher matrix. RAT makes them practical by avoiding that matrix altogether, letting practitioners use better optimization without massive computational overhead.
- 2605.18580·May 18, 2026·~8 mincs.AIcs.LG
When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State
Peiying Zhu, Sidi Chang
ELI5When training AI agents to make decisions (like pricing), just checking if they hit the business goal isn't enough—they might break important rules along the way. This paper shows how to evaluate whether an agent actually behaves like it should by checking its full sequence of actions, not just the final outcome.
Problem solvedCompanies deploying RL agents discover too late that policies hit revenue targets while violating compliance rules or competitive norms. Current evaluation misses these behavioral failures because it only checks outcomes, not whether the agent preserves the discipline and patterns of the system it's replacing.
- 2605.18529·May 18, 2026·~9 mincs.AI
AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
Zhenlin Wei, Pu Jian, Yingzhuo Deng, Xiaohan Wang, +5
ELI5When training AI models to solve hard problems with rewards, the model doesn't know which tokens (words) actually helped get the right answer. This paper fixes that by having the model reflect on its mistakes and generate hints to itself, then use those hints to credit the right tokens instead of all of them equally.
Problem solvedTraining LLMs with reinforcement learning hits a wall: reward signals come at the sequence level, so every token gets equal credit even though only some mattered. This causes late-stage training collapse and wasted learning. AMR-SD pinpoints which tokens actually contributed to success.
- 2605.18508·May 18, 2026·~10 mincs.LGcs.AI
DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization
Chengpeng Hu, Yingqian Zhang, Hendrik Baier
ELI5Instead of black-box neural networks, this method learns policies as readable programs (like code), and keeps them naturally discrete during training so you don't lose performance when converting from continuous math to actual code.
Problem solvedPrograms are interpretable and editable, but existing methods train them as soft continuous versions then convert to discrete code, causing performance drops and requiring extra fine-tuning. DiPRL stays discrete throughout training.
- 2605.18500·May 18, 2026·~9 mincs.CL
Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning
Li Wang, Xiaohan Wang, Xiaodong Lu, Zipeng Zhang, +4
ELI5Instead of using tools immediately when an LLM thinks it needs one, this method lets the model plan out all its tool requests first, then execute them in the right order. It's like writing a shopping list before going to the store rather than buying items one at a time.
Problem solvedLLMs lose their reasoning flow when they stop mid-thought to execute a tool, hurting their ability to solve complex math problems. This approach keeps the model thinking clearly by separating the decision to use a tool from actually running it.
- 2605.18449·May 18, 2026·~11 mincs.LGcs.AI
Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights
Ken Ming Lee, Paul Barde, Maxime C. Cohen, Derek Nowrouzezahrai
ELI5Instead of assuming customers take the shortest path through a store (which they don't), this method uses AI that learns realistic customer behavior patterns to predict where people actually walk and what they'll buy—helping retailers optimize shelf placement without needing expensive tracking cameras.
Problem solvedRetailers need to know how customers move through stores to place products profitably, but collecting real trajectory data is expensive. Simple math-based shortcuts like shortest-path algorithms miss the messy reality of how people actually shop, leading to poor layout decisions.
- 2605.18437·May 18, 2026·~8 mincs.LGcs.DC
Heterogeneous Tasks Offloading in Vehicular Edge Computing: A Federated Meta Deep Reinforcement Learning Approach
Yaorong Huang, Jingtao Luo, Xuechao Wang
ELI5When cars need to offload computing tasks to nearby servers, this system learns how to route those tasks efficiently while keeping data private—it uses graph neural networks to understand task dependencies and federated learning so each server learns without sharing raw data.
Problem solvedVehicles generate complex, interdependent computational tasks that need fast offloading decisions, but sharing data across distributed edge servers for training raises privacy concerns and is slow. This framework trains collaboratively without exposing data while handling complex task structures.
- 2605.18374·May 18, 2026·~14 mincs.LGcs.AI
Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers
Soheyl Massoudi, Gabriel Apaza, Milad Habibi, Mark Fuge
ELI5Instead of asking an AI to solve each hard puzzle from scratch, this paper trains it to learn and write reusable solver code that works for an entire family of similar problems—like teaching someone to write a working algorithm rather than just guessing answers.
Problem solvedSolving combinatorial puzzles with LLMs is expensive because you need many tries per problem. This learns solvers that work reliably across problem families, cutting inference costs by 91× while keeping quality high.
- 2605.18299·May 18, 2026·~13 mincs.AIcs.CLcs.IR
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, +5
ELI5A search-augmented AI agent learns to write better search queries by comparing itself to a smarter version of itself that knows how previous attempts turned out. Instead of just getting one reward at the end, the agent gets feedback on each individual search decision.
Problem solvedSearch-augmented reasoning agents struggle to learn which queries are worth making because they only get a single reward signal at the end of a rollout, not credit for individual search decisions. Previous fixes required expensive teacher models or manual annotations.
- 2605.18261·May 18, 2026·~8 mincs.CL
Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains
Zhonghang Yuan, Zhefan Wang, Fang Hu, Zihong Chen, +6
ELI5A method that helps AI models get better at answering questions in knowledge-heavy fields (like history or science) by automatically creating practice problems with checkable answers, then training the model to reason through them step-by-step.
Problem solvedLLMs struggle with knowledge-intensive domains because there aren't enough verified training examples to learn from, and current training methods only check if final answers are right—missing flawed reasoning along the way. This fixes both problems.
- 2605.18246·May 18, 2026·~7 mincs.LGcs.AI
Privacy Preserving Reinforcement Learning with One-Sided Feedback
Lin William Cong, Guangyan Gan, Hanzhang Qin, Zhenzhen Yan
ELI5A robot learns to make decisions in complex environments while keeping its observations private—it only sees partial information and gets reward feedback for some moves, not all. The researchers show you can learn effectively even with these privacy restrictions.
Problem solvedReal-world RL agents often can't reveal what they observe or how they're being trained (medical data, personal preferences, proprietary sensors). This paper proves you don't have to sacrifice learning speed to protect privacy in high-dimensional problems.
- 2605.18191·May 18, 2026·~11 mincs.AI
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
Guining Cao, Jiaxin Peng, Chu Zeng, Yu Zhao, +2
ELI5A new training method that teaches AI models to generate more diverse and creative responses by comparing pairs of outputs instead of scoring them individually, while explicitly encouraging the model to produce varied answers rather than repeating the same thing.
Problem solvedCurrent RL methods either require expensive human scoring for open-ended tasks like creative writing, or they collapse into repetitive outputs. This method works with pairwise comparisons (easier to collect) and actively pushes the model to generate diverse responses instead of defaulting to safe, stereotypical answers.
- 🚀Shipping2605.16143·May 15, 2026·~9 mincs.AIcs.CL
Look Before You Leap: Autonomous Exploration for LLM Agents
Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, +5
⭐ 159 stars / 10 repos📚 0 citesELI5LLM agents jump to conclusions too fast in new environments instead of poking around first. This paper teaches them to systematically explore and map out what's possible before trying to solve tasks, like learning the layout before cooking dinner in an unfamiliar kitchen.
Problem solvedLLM-based agents fail in novel environments because they rely on pre-training rather than gathering real info about what's actually possible. Teams need agents that can adapt to new situations instead of confidently doing the wrong thing.
- 🚀Shipping2605.16103·May 15, 2026·~7 mincs.AI
Sign-Separated Finite-Time Error Analysis of Q-Learning
Donghwan Lee
⭐ 208 stars / 6 repos📚 0 citesELI5Researchers figured out why Q-learning (a way to teach AI agents) makes mistakes in a lopsided way: it overestimates some values but underestimates others. They split the error into positive and negative parts and showed the negative part shrinks faster, explaining where the asymmetry comes from.
Problem solvedQ-learning's convergence guarantees were incomplete—practitioners didn't understand why it systematically overestimates certain values or how fast errors actually shrink. This analysis reveals the asymmetry and provides tighter, more predictive bounds for finite-time behavior.