Create Next App

All 50 🚀 Shipping 4 📈 Climbing 0 💤 Quiet 46 Unscored 0

What do these badges mean?

🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
🎭HypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.

💤Quiet2607.09641·Jul 10, 2026·~8 mincs.LGcs.AI
Semantic Pareto-DQN: A Multi-Objective Reinforcement Learning Framework for Financial Anomaly Detection
Cláudio Lúcio do Val Lopes, Lucca Machado da Silva
⭐ 0 stars / 0 repos📚 0 cites
ELI5A fraud detection system that uses AI to write short stories about transactions, then learns to catch suspicious ones without annoying legitimate customers—balancing two conflicting goals instead of picking just one.
Problem solvedFraud detection systems normally fail at catching fraud because legitimate transactions vastly outnumber fraudulent ones. This forces a painful trade-off: catch more fraud and block real customers, or avoid blocks and miss fraud. This approach lets you adjust that trade-off dynamically.
💤Quiet2607.09590·Jul 10, 2026·~9 mincs.ROcs.AI
PAC-ACT: Post-training Actor-Critic for Action Chunking Transformers
Yujie Pang, Zudong Li
⭐ 0 stars / 0 repos📚 0 cites
ELI5A method to improve robot policies that predict multiple action steps at once by using reinforcement learning instead of just copying human demonstrations, while keeping the model fast and memory-efficient for real factory work.
Problem solvedIndustrial robots trained on human examples alone fail when conditions change slightly or when they need to apply precise force (like assembly tasks); this method lets them learn safer, more reliable behaviors through trial-and-error while staying practical for real-time control.
💤Quiet2607.08724·Jul 9, 2026·~9 mincs.LGcs.RO
Latent Memory Palace: Reasoning for Control as Autoregressive Variational Inference
Chuning Zhu, Eva Xu, Jose Barreiros, Krishnan Srinivasan, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5A robot learns to solve tasks by thinking through a series of steps in a hidden 'thought space' rather than speaking out loud. It can spend more time thinking on hard problems and less on easy ones, similar to how humans deliberate differently for different decisions.
Problem solvedRobot policies struggle with tasks needing multi-step reasoning and precise spatial control. Language-based reasoning is too coarse for continuous movements, and existing control methods don't adapt their computation based on problem difficulty.
💤Quiet2607.08703·Jul 9, 2026·~8 mincs.LG
MPFlow: Learning Budgeted Max-Flow Optimization on the Lightning Network with Deep Graph Reinforcement Learning
Harrison Rush, Vincent Davis, Simone Antonelli, Vikash Singh, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5A reinforcement learning agent learns where to open new payment channels in Bitcoin's Lightning Network to maximize transaction capacity, like deciding which roads to build in a city to handle the most traffic flow.
Problem solvedLightning Network nodes waste money opening channels to popular hubs that don't improve routing capacity. This tool recommends smarter channel placements that maximize payment throughput for a fixed budget, helping operators use capital more efficiently.
💤Quiet2607.08444·Jul 9, 2026·~13 minstat.MLcs.LG
Statistical Efficiency and Inference of Quantile Distributional Reinforcement Learning
Zijie Cheng, Yang Peng, Zhihua Zhang
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper figures out how many samples you need to accurately learn what distribution of rewards an AI policy will get, and proves that using quantiles (percentiles) is a mathematically optimal way to do it.
Problem solvedWhen training RL agents, you want to know not just the average reward but the full distribution of possible outcomes. This paper proves quantile-based methods can learn this distribution efficiently and enables valid statistical inference on the results.
💤Quiet2607.08443·Jul 9, 2026·~7 mincs.NIcs.AI
ADORN: Adaptive Drift handling for Open RAN using Reinforcement Learning
Ashit Kumar Subudhi, Bhargav Chirumamilla, Shubham Vaishnav, Mduduzi C. Hlophe, +4
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that learns when to retrain AI models for predicting network traffic in wireless networks, balancing accuracy against the cost of retraining—like knowing when to update your weather forecast model without doing it every hour.
Problem solvedAI models for network forecasting degrade as traffic patterns change, but retraining them constantly is expensive and slows down the system. This learns the optimal retraining schedule to keep accuracy high while minimizing computational waste.
💤Quiet2607.08409·Jul 9, 2026·~8 mincs.CLcs.AI
When Synthetic Speech Is All You Have: Better Call GRPO
Shashi Kumar, Yanis Labrak, Hasindri Watawana, Sergio Burdisso, +4
⭐ 0 stars / 0 repos📚 0 cites
ELI5When training speech recognition for banks, you can't use real customer calls for privacy reasons. This paper shows that using fake AI-generated speech with a reinforcement learning technique (GRPO) works 40% better than traditional training methods, because it teaches the model to be more careful about when to stop listening and how to match words to audio.
Problem solvedBanks and regulated industries need speech recognition but can't collect real customer audio due to privacy laws. Synthetic speech is the only option, but models trained on it perform poorly on real speech. This work shows RL can fix the gap without needing any real data.
💤Quiet2607.07690·Jul 8, 2026·~12 mincs.LGcs.AIcs.CL
Agon: Competitive Cross-Model RL with Implicit Rival Grading of Reasoning
Vladislav Beliaev
⭐ 0 stars / 0 repos📚 0 cites
ELI5Two AI models compete by solving math problems together: one drafts a solution, the other reads it and tries to solve the problem too. Whoever gets the right answer 'wins,' which forces both models to think better because they're competing against each other.
Problem solvedCurrent reasoning training only grades final answers, so models learn to write longer solutions rather than think better. Agon grades the thinking process implicitly by having models compete, improving reasoning on hard math and code problems without needing labeled reasoning steps.
💤Quiet2607.07674·Jul 8, 2026·~9 mincs.LGcs.CL
Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems
Vladislav Beliaev
⭐ 0 stars / 0 repos📚 0 cites
ELI5When training AI models on hard math problems, most attempts fail and teach nothing. This method gives the model a partial correct answer to start with, gradually making the help harder to remove, so it learns from problems that would otherwise be too difficult.
Problem solvedGRPO training wastes compute on unsolvable problems—when all rollouts fail, the model learns nothing. This keeps hard problems from contributing to training, even though those are where models need improvement most.
💤Quiet2607.07646·Jul 8, 2026·~13 mincs.AIcs.CL
RL Post-Training Builds Compositional Reasoning Strategies
Azwar Abdulsalam, Nishil Patel, Andrew Saxe
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers trained an AI model on simple rewriting tasks, then used reinforcement learning to solve harder problems. They found RL doesn't just do more of the same thing—it combines simple rewrite rules into new, more complex strategies that work reliably on unseen problems.
Problem solvedIt was unclear whether RL actually teaches models to think in new ways or just amplifies existing skills. This matters for understanding whether RL can create genuine reasoning abilities versus just brute-force searching. The paper shows RL can genuinely compose simpler skills into reusable complex strategies.
💤Quiet2607.06514·Jul 7, 2026·~5 mincs.AIcs.GT
FootsiesGym: A Fighting Game Benchmark for Two-Player Zero-Sum Imperfect-Information Games
Chase McDonald, Nathan Tsang, Wesley N. Kerr
⭐ 0 stars / 0 repos📚 0 cites
ELI5A training ground for AI to learn fighting game strategy where two players can't see each other's full moves ahead of time. It's simple enough to run quickly on regular computers, but complex enough to require real strategic thinking.
Problem solvedMost game benchmarks are either too simple (checkers) or too complex (full games requiring massive compute). This gives researchers an efficient way to study imperfect-information competitive AI without needing specialized hardware.
💤Quiet2607.06489·Jul 7, 2026·~8 mincs.AI
Multi-Agent Deep Reinforcement Learning for Multi Objective Battery Management in Dairy Farms
Marcos Eduardo Cruz Victorio, Karl Mason
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that uses AI agents to manage batteries on dairy farms, deciding when to store renewable energy and when to use it based on electricity prices—like having a smart coordinator that maximizes profits from buying and selling power.
Problem solvedDairy farms struggle to integrate solar/wind renewable energy profitably while staying within grid rules. This system automatically decides when to charge/discharge batteries to capture price differences and use more clean energy without destabilizing the local grid.
💤Quiet2607.05394·Jul 6, 2026·~14 mincs.LGcs.AIcs.CL
Weak-to-Strong Generalization via Direct On-Policy Distillation
Shiyuan Feng, Huan-ang Gao, Haohan Chi, Hanlin Wu, +6
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of running expensive reasoning training on every new large model, train a smaller model cheaply, then figure out which actions the training made that small model prefer, and teach the large model to make those same preference shifts on its own data.
Problem solvedRunning reinforcement learning on large language models is prohibitively expensive—you need thousands of rollouts per training step. This work lets you do the expensive RL once on a cheap small model, then transfer those gains to a larger model without repeating the costly RL process.
💤Quiet2607.05391·Jul 6, 2026·~14 mincs.AIcs.CLcs.LG
LLM-as-a-Verifier: A General-Purpose Verification Framework
Jacky Kwok, Shulu Li, Pranav Atreya, Yuejiang Liu, +5
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of asking an LLM 'is this answer right or wrong?', this framework lets it output a probability distribution over correctness, giving you a precise confidence score. You can then use these scores to pick the best solution from multiple attempts, or feed them into AI training loops.
Problem solvedCurrent LLM judges give you yes/no answers, making it hard to pick between mediocre solutions or train agents effectively. This gives you granular confidence scores so you can rank solutions accurately and provide rich feedback signals for AI systems to learn from.
💤Quiet2607.05378·Jul 6, 2026·~10 mincs.LG
CompactionRL: Reinforcement Learning with Context Compaction for Long-Horizon Agents
Yujiang Li, Zhenyu Hou, Yi Jing, Jie Tang, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5A technique that teaches AI agents to solve long tasks by automatically summarizing their past actions when they run out of context space, letting them continue working with a compressed memory instead of getting stuck.
Problem solvedLong-horizon agents hit context limits mid-task and can't proceed. This method trains them to summarize their work on-the-fly and continue, enabling agents to complete tasks that were previously impossible due to token constraints.
💤Quiet2607.05339·Jul 6, 2026·~14 mincs.LGcs.AIstat.ML
TREK: Distill to Explore, Reinforce to Refine
Yuanda Xu, Zhengze Zhou, Kayhan Behdin, Jelena Markovic-Voronov, +9
⭐ 0 stars / 0 repos📚 0 cites
ELI5A training method that uses a teacher model to show a student model good solutions to hard problems, then pulls those solutions into the student's own thinking process before fine-tuning. Like showing someone a worked example before asking them to solve similar problems themselves.
Problem solvedCurrent reinforcement learning for reasoning gets stuck on hard problems because the student model never tries the right approach on its own. TREK fixes this by having a teacher demonstrate solutions first, so the student learns to try those approaches before optimizing further.
💤Quiet2607.02490·Jul 2, 2026·~9 mincs.CLcs.CV
Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
Liyan Tang, Fangcong Yin, Greg Durrett
⭐ 0 stars / 0 repos📚 0 cites
ELI5Vision-language models can get better at fixing their own mistakes by looking at images while they think through problems. This work teaches them to do this by showing them messed-up situations they have to recover from, making them actually use visual information instead of just talking about it.
Problem solvedVision-language models fail when images look different from training data because they don't properly use visual clues when correcting mistakes. Teams need models that can genuinely reference and learn from what they see, not just generate text about corrections.
💤Quiet2607.02440·Jul 2, 2026·~8 mincs.AIcs.CL
EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments
Zhilin Wang, Han Song, Runzhe Zhan, Jusen Du, +12
⭐ 0 stars / 0 repos📚 0 cites
ELI5A benchmark that tests how well AI agents can iteratively improve robot/game-playing policies by editing code and learning from feedback, rather than just solving tasks once. It measures how agents allocate their limited attempts to get better results.
Problem solvedWe lacked a standardized way to measure whether AI systems can actually improve their own policies over time through feedback—most benchmarks just score final performance or reward, missing the realistic challenge of iterative refinement under real-world constraints.
💤Quiet2607.02431·Jul 2, 2026·~11 mincs.ROcs.AI
WorldSample: Closed-loop Real-robot RL with World Modelling
Yuquan Xue, Le Xu, Zeyi Liu, Zhenyu Wu, +4
⭐ 0 stars / 0 repos📚 0 cites
ELI5A robot learns tasks faster by mixing real-world practice with simulated experiences from a learned world model, using smart filtering to avoid learning from the model's mistakes.
Problem solvedReal robots are expensive to train because each physical trial costs time and money. This method reduces the number of costly real interactions needed by ~60% while actually improving performance, making robot RL practical.
💤Quiet2607.02390·Jul 2, 2026·~10 mincs.LG
DecompRL: Solving Harder Problems by Learning Modular Code Generation
Juliette Decugis, Fabian Gloeckle, Francis Bach, Taco Cohen, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of trying harder to generate correct code in one shot, this method teaches AI to break hard problems into smaller pieces, write solutions for each piece separately, then mix-and-match them. This creates many more possible answers while using way less computing power.
Problem solvedLLMs hit a wall on hard coding problems—sampling more solutions wastes GPU money, and training with reinforcement learning doesn't help if the model has almost no chance of getting it right. DecompRL solves this by shifting work from expensive inference to cheap recombination, cutting compute costs while solving problems that standard approaches can't reach.
💤Quiet2606.30520·Jun 29, 2026·~10 minquant-phcs.LG
Staged Hybridisation for Visual Quantum Reinforcement Learning via Knowledge Distillation
Javier Lazaro, Juan-Ignacio Vazquez, Pablo Garcia-Bringas
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers use a teaching trick called knowledge distillation to train quantum computers to play visual games. Instead of teaching a quantum agent directly from raw pixels (which is hard), they first train a regular AI teacher to see and play the game, then transfer just the decision-making part to a tiny quantum circuit that uses the teacher's pre-learned vision.
Problem solvedQuantum computers struggle with visual reinforcement learning because they can't handle high-dimensional pixel data and are finicky to train. This approach makes it practical by breaking the problem into stages: classical vision first, then quantum decision-making, rather than forcing quantum circuits to learn everything at once.
💤Quiet2606.30420·Jun 29, 2026·~9 mincs.LG
Experience Augmented Policy Optimization for LLM Reasoning
Jinda Lu, Kexin Huang, Junkang Wu, Shuo Yang, +6
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of training an AI to solve math problems from scratch each time, this method reuses hints from a previously trained AI—but adjusts those hints based on what the current AI is doing, so they stay useful as the AI improves.
Problem solvedTraining LLMs to reason better via reinforcement learning is expensive and wastes past experience. Old methods either train from scratch (costly) or reuse old solutions that no longer match how the improved model behaves (misaligned).
💤Quiet2606.30414·Jun 29, 2026·~9 mincs.LG
Diffusion Fine-tuning with Rewarded Moment Matching Distillation
Alexis Jacq, Guillaume Couairon, Valentin De Bortoli, Quentin Berthet, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new method that teaches diffusion models (image generators) to be both faster and better at what humans want, by combining two training techniques that usually work separately: making them smaller/faster while tuning them toward reward signals.
Problem solvedDiffusion models are slow and hard to optimize for specific goals. Previous methods either make them fast but sacrifice quality, or optimize rewards but lose the clean outputs that distillation provides. This combines both without sacrificing either.
💤Quiet2606.30406·Jun 29, 2026·~8 mincs.CLcs.LG
MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training
Wenhan Ma, Jianyu Wei, Liang Zhao, Hailin Zhang, +9
⭐ 0 stars / 0 repos📚 0 cites
ELI5When you want an AI model to be good at multiple things (like math, coding, and writing), training it all at once makes it worse at each. This paper trains separate specialist models for each skill, then teaches a single model to copy all of them at the same time using their own practice attempts.
Problem solvedCompanies need LLMs that excel at multiple tasks, but combining different training methods for different skills causes them to interfere with each other and lose performance. This method lets teams develop specialists independently and merge them cleanly without degradation.
💤Quiet2606.30376·Jun 29, 2026·~9 mincs.LGcs.CV
FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification
Zheming Fu, Ruizhe He, Wei Shang, Xiaoxiao Ma, +3
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new method to improve image generation by training flow models with rewards (like human preferences) without the messy math that usually gets in the way. It reformulates the problem as learning to predict better 'directions' for the generation process, making training faster and simpler.
Problem solvedFine-tuning generative models on rewards is slow and technically complicated—existing methods need workarounds like Classifier-Free Guidance and introduce inconsistencies between training and generation. FlowAWR eliminates these friction points and trains 2–5× faster while maintaining quality.
💤Quiet2606.30345·Jun 29, 2026·~13 mincs.LGcs.AI
DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training
Haisen Luo, Yiwei Liu, Haoning Wang, Dan Liu, +12
⭐ 0 stars / 0 repos📚 0 cites
ELI5A training method that helps AI models improve themselves by smartly routing problems based on difficulty—easy ones get less attention, hard ones get special handling—and focusing learning energy on the most critical reasoning steps.
Problem solvedSelf-training methods waste effort on easy problems while struggling with hard ones, and they update the model uniformly everywhere instead of concentrating on crucial decision points. DRIFT fixes this by identifying which problems need what kind of help and focusing updates where they matter most.
💤Quiet2606.30316·Jun 29, 2026·~10 mincs.LG
Toward an Energy-Optimized Operation of Data Centers Located in Wind Farms Using Reinforcement Learning
Jan Stenner, Alexander Kilian, Sebastian Peitz, Hermann de Meer
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system learns to automatically shift computing work in data centers based on wind availability and electricity prices, using AI agents that practice making decisions repeatedly to maximize use of free wind energy and minimize costs.
Problem solvedData centers near wind farms waste renewable energy or pay high prices by not timing their workloads smartly—this automates that scheduling to cut energy costs and emissions without needing perfect advance weather forecasts.
💤Quiet2606.27369·Jun 25, 2026·~12 mincs.LG
Reinforcement Learning without Ground-Truth Solutions can Improve LLMs
Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang, +5
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new method trains AI coding assistants using continuous score feedback (like "your solution got 60/100 points") instead of requiring correct/incorrect answers. It automatically adjusts reward signals to prevent some scores from dominating others, making the training work better on both scored and exact-match coding tasks.
Problem solvedMost RL training for code generation requires knowing the perfect answer upfront, but many real tasks (competitive programming, optimization) only have partial scores. RiVER eliminates this bottleneck by learning from incomplete feedback, which is cheaper to obtain and more widely available.
💤Quiet2606.27180·Jun 25, 2026·~14 mincs.LGcs.AIcs.RO
Automating Potential-based Reward Shaping with Vision Language Model Guidance
Henrik Müller, Daniel Kudenko
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of manually designing reward signals for robot learning, this system uses a lightweight vision-language model to look at images and say which state looks closer to success, then builds a 'potential function' that safely guides the robot without tricking it into exploiting the reward system.
Problem solvedRobots trained with sparse rewards (only getting feedback at the end) explore inefficiently, and hand-crafted reward shaping often causes them to game the system instead of actually solving tasks. This automates reward design using AI feedback while mathematically guaranteeing the robot still learns the real solution.
💤Quiet2606.27163·Jun 25, 2026·~8 mincs.ROcs.AIcs.LG
Learning to Fold: prizewinning solution at LeHome Challenge 2026 (1st place online, 2nd offline)
Ilia Larchenko
⭐ 0 stars / 0 repos📚 0 cites
ELI5A robot learns to fold clothes by watching itself try and fail, using a single AI network to both decide what to do next and judge how well it's doing. The system combines existing learning techniques with engineering tricks like simulation-to-reality transfer and online optimization to achieve competitive performance at a robotics competition.
Problem solvedBimanual garment folding is one of the hardest manipulation tasks for robots — it requires precise two-arm coordination, real-time adaptation, and understanding of fabric dynamics. This work shows how to train a practical system that works both in simulation and on real hardware by combining reinforcement learning with pragmatic engineering.
💤Quiet2606.27112·Jun 25, 2026·~6 mincs.LGcs.AI
Heavy-Ball Q-Learning with Residual Weighting Correction
Donghwan Lee
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper speeds up Q-learning (a common reinforcement learning algorithm) by adding momentum—like a heavy ball rolling downhill that keeps its speed even on flat ground. The researchers prove when this actually works and extend it to settings where the agent uses function approximation.
Problem solvedStandard Q-learning converges slowly, which wastes computation in real RL applications. This work provides a theoretically-grounded way to accelerate convergence and tells you exactly when the speedup kicks in.
💤Quiet2606.26080·Jun 24, 2026·~10 mincs.LGcs.AI
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Changdae Oh, Wendi Li, Seongheon Park, Samuel Yeh, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of training a separate system to judge each step an AI agent takes, this paper shows you can reuse the policy the agent was already trained with—its log-probability difference tells you which actions were good or bad, acting as a free step-level scorecard.
Problem solvedBuilding reward models that evaluate long-horizon agent behavior step-by-step is expensive (requires annotation or simulation). This method extracts step-level signals from standard RL training at no extra cost, enabling better test-time scaling and error detection without additional labeled data.
💤Quiet2606.26027·Jun 24, 2026·~10 mincs.CLcs.LG
Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It
Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5When AI models learn to use tools through reinforcement learning, they sometimes suddenly forget how to do it properly because they get confused about which tokens control the tool-calling process. Adding human guidance signals during training—like showing correct examples alongside the learning—fixes this collapse and keeps the model stable.
Problem solvedRL training for tool-using agents fails catastrophically in production, causing models to lose structured behavior mid-task. This makes deployed agentic systems unreliable. The paper shows why this happens and proves that mixing supervised examples into RL training prevents these breakdowns.
💤Quiet2606.26006·Jun 24, 2026·~10 mincs.ROcs.AI
FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation
Shuyi Zhang, Yunfan Lou, Hongyang Cheng, Yichen Guo, +7
⭐ 0 stars / 0 repos📚 0 cites
ELI5A method to teach robots to do better than their training data by using reinforcement learning, but without the usual instability and sample waste. It warms up the robot's value estimates first, then filters exploration to only learn from good actions.
Problem solvedRL fine-tuning of robot vision-language models is sample-inefficient and unstable—robots forget what they learned and explore badly. FORCE cuts training time by 32% and removes the need for human babysitting during RL fine-tuning.
💤Quiet2606.23680·Jun 22, 2026·~11 mincs.ROcs.AIcs.LG
CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation
Sikai Li, Shuning Li, Zhenyu Wei, Yunchao Yao, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5A robot learns to walk and do complex hand manipulation at the same time by breaking the problem into two coordinated pieces: one for body movement, one for detailed finger control. Instead of stopping to pick something up, it manipulates objects while walking.
Problem solvedHumanoid robots typically stop walking to manipulate objects with simple grippers. This work enables continuous dexterous manipulation (complex multi-finger hand tasks) while locomoting, making robots useful for realistic picking, carrying, and interaction tasks without constant stopping.
💤Quiet2606.23678·Jun 22, 2026·~10 mincs.CVcs.AI
AIR: Adaptive Interleaved Reasoning with Code in MLLMs
Cong Han, Xiaohan Lan, Haibo Qiu, Yujie Zhong
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper teaches multimodal AI models to think step-by-step while writing and running code to solve math and numerical problems. It trains the model to decide when to use code tools adaptively, like a student choosing when to grab a calculator vs. working through logic.
Problem solvedExisting multimodal models struggle with numerical computation and rely on fixed rules for tool-use. This approach enables models to flexibly reason and compute answers to complex problems involving numbers and visual data simultaneously.
💤Quiet2606.23640·Jun 22, 2026·~11 mincs.LGcs.AIcs.RO
Learning Process Rewards via Success Visitation Matching for Efficient RL
Raymond Tsao, Andrew Wagenmaker, Sergey Levine
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of waiting until a robot completes a task to give it a reward, this method learns what successful attempts *look like* and rewards the robot for acting similarly to those successes — turning a single end-goal reward into constant feedback on the entire path.
Problem solvedSparse rewards in robotics make learning inefficient because agents get almost no feedback until they randomly succeed. This method creates dense guidance by learning from successful trajectories, letting robots train much faster without changing what the optimal behavior actually is.
💤Quiet2606.20411·Jun 18, 2026·~6 mincs.LG
Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning
Hsiao-Ru Pan, Bernhard Schölkopf
⭐ 0 stars / 0 repos📚 0 cites
ELI5A technique that estimates how good an action is by modeling how the world changes, instead of just guessing from past rewards. This paper makes it work with incomplete information and faster computation, so it can train AI agents using fewer real-world attempts.
Problem solvedDeep RL agents waste lots of real-world interactions to learn. Previous advantage estimation needed full game state visibility and expensive world models. This makes the technique practical for partial observability and scales to complex visual environments.
💤Quiet2606.20357·Jun 18, 2026·~7 mincs.LG
On the Variance of Temporal Difference Learning and its Reduction Using Control Variates
Hsiao-Ru Pan, Bernhard Schölkopf
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper explains why temporal difference learning (a core RL technique) is good at reducing noise in estimates, and shows how to make it even better using a statistical trick called control variates—similar to how averaging multiple measurements reduces measurement error.
Problem solvedRL practitioners don't have clear guidance on variance-bias tradeoffs between different estimation methods, making it hard to choose between TD, Monte Carlo, and advantage estimation approaches. This work provides theoretical guarantees and practical insights to pick the right method.
💤Quiet2606.20324·Jun 18, 2026·~10 mincs.SEcs.LG
A Model-Driven Approach for Developing Families of Reinforcement Learning Environments
Xiaoran Liu, Istvan David
⭐ 0 stars / 0 repos📚 0 cites
ELI5A tool that automatically generates families of similar training environments for AI agents by using genetic algorithms and model transformations, rather than manually coding each variant—like having a template that can spawn hundreds of tweaked versions of a game level.
Problem solvedCreating diverse but related RL training environments is currently manual and error-prone, but agents need many environment variants to learn well. This automates that tedious process, saving time and reducing mistakes when building curriculum learning scenarios or domain-randomized simulators.
💤Quiet2606.20280·Jun 18, 2026·~13 mincs.IRcs.AI
ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval
Yuhan Liu, Pei Fu, Hang Li, Yukun Qi, +7
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that helps AI models better understand complex search queries by learning to rank similar results differently rather than treating all non-matching results the same. It uses reinforcement learning rules instead of traditional reward models to improve multimodal search (text + images).
Problem solvedCurrent multimodal search systems treat all wrong answers equally, missing subtle differences in what a complex query is actually asking for. This causes them to miss relevant results when queries have multiple layers of meaning or specific details.
💤Quiet2606.19236·Jun 17, 2026·~10 mincs.LGcs.AIcs.CL
STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, +3
⭐ 0 stars / 0 repos📚 0 cites
ELI5When training AI models to solve hard problems using reinforcement learning, they often stop exploring different solutions and collapse into repetitive patterns. This paper identifies which specific tokens (words) cause this collapse and selectively adjusts how much they influence training to keep the model exploring.
Problem solvedRL training of LLMs for reasoning tasks currently suffers from policy entropy collapse—the model stops exploring and repeats the same outputs, hurting performance. Existing methods don't address which tokens drive this collapse, making it hard to stabilize training without sacrificing reward.
💤Quiet2606.19222·Jun 17, 2026·~6 mincs.LGcs.AI
Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning
Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou
⭐ 0 stars / 0 repos📚 0 cites
ELI5A method to selectively erase specific reasoning patterns a language model learned during reinforcement learning training, without accidentally breaking other skills. It works by identifying and updating only the most important neural network components tied to the unwanted pattern.
Problem solvedWhen you want to remove a specific learned behavior from an LLM (like a reasoning shortcut it picked up during RL training), standard unlearning damages lots of other capabilities. This method removes the target behavior while keeping everything else intact.
💤Quiet2606.19199·Jun 17, 2026·~12 mincs.LGcs.AI
Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times
Giuseppe Gabriele, Fabio Pavirani, Seyed Soroush Karimi Madahi, Chris Develder
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of predicting when an EV will leave and then using that prediction separately, this method trains the prediction and charging control together so the prediction focuses only on what matters for making good charging decisions.
Problem solvedEV charging systems need to know when cars will leave, but that info isn't always available. Standard forecasting optimizes for prediction accuracy, but errors still break the charging control—this method cuts those errors' impact by 55% by training forecaster and controller as one unit.
💤Quiet2606.18216·Jun 16, 2026·~13 mincs.CL
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, +7
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of forcing a small AI model to copy a large teacher's answers directly, this method keeps the teacher's help inside the question prompts themselves—showing the student model contrasting right and wrong answers, and highlighting patterns in its own failures, only on questions it struggles with.
Problem solvedSmall language models trained by imitating large teachers either copy the teacher's quirks and fail on new problems, or learn from their own mistakes but get stuck when every attempt fails. This method lets the teacher guide without breaking the learning process, especially when the student has zero success rate.
💤Quiet2606.18183·Jun 16, 2026·~5 minstat.MLcs.LGmath.PR
A Diffusion Approximation for Temporal-Difference Learning with Linear Features under Markovian Noise
M. Forzo, E. Monzio Compagnoni, A. Russo, A. Pacchiano
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper explains why TD learning algorithms don't converge perfectly—they get stuck at an error floor. It does this by modeling the algorithm as a noisy process (using math that tracks randomness) rather than assuming perfect averaging, showing how the specific way samples arrive affects final accuracy.
Problem solvedTD learning is widely used in reinforcement learning, but practitioners don't fully understand why it settles into a constant error rather than reaching zero. This work explains the mechanisms causing that floor, helping developers predict and potentially reduce remaining error in their agents.
🚀Shipping2606.17053·Jun 15, 2026·~11 mincs.CLcs.CV
Context-Aware RL for Agentic and Multimodal LLMs
Peiyang Xu, Bangzheng Li, Sijia Liu, Karthik R. Narasimhan, +3
⭐ 373 stars / 10 repos📚 0 cites
ELI5This method trains AI models to better spot the exact pieces of evidence they need in long documents or images by making them practice picking the right supporting context from two similar options — like learning to find the one detail that actually matters.
Problem solvedLLMs struggle to locate key evidence buried in long tool traces or subtle image details, causing reasoning failures. This trains models to ground their answers in specific, relevant context rather than guessing.
🚀Shipping2606.17043·Jun 15, 2026·~12 mincs.ROcs.LG
Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes
Tongyan Fang, Siyuan Huang, Naiyu Fang, Ganlong Zhao, +5
⭐ 551 stars / 15 repos📚 0 cites
ELI5When a robot learns to do tasks through trial-and-error, each attempt only tells you if it succeeded or failed. This paper teaches the robot to separate two learning goals—first get good at completing the task, then get fast at completing it—and smartly switches between them as it improves.
Problem solvedRobot fine-tuning from sparse outcomes conflates success with efficiency, wasting learning signal once basic success happens. Mixing autonomous and intervention segments causes wrong credit assignment. HABC separates viability and efficiency learning, doubling success rates on real contact-heavy manipulation tasks.
🚀Shipping2606.17029·Jun 15, 2026·~12 mincs.CL
DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents
Minghang Zhu, Chuyang Wei, Junhao Xu, Yilin Cheng, +2
⭐ 502 stars / 14 repos📚 0 cites
ELI5Instead of asking an AI to guess what criteria should evaluate a research report, this method builds a tree of questions and evidence first, then creates matching evaluation rubrics—like writing the answer key before the test. It trains research agents 13x faster this way.
Problem solvedTraining research agents with reinforcement learning is slow and unreliable when rubrics don't match what the query actually needs. This method ensures rubrics align perfectly with the information requested, cutting training time dramatically while keeping quality high.
🚀Shipping2606.17024·Jun 15, 2026·~12 mincs.LG
ExpRL: Exploratory RL for LLM Mid-Training
Violet Xiang, Amrith Setlur, Chase Blagden, Nick Haber, +1
⭐ 458 stars / 13 repos📚 0 cites
ELI5Instead of manually teaching language models intermediate reasoning skills before doing reinforcement learning, this paper uses reference answers as a grading rubric to reward partial progress and good reasoning steps—letting the model learn useful strategies automatically from question-answer pairs.
Problem solvedCurrent RL for LLMs requires expensive manual curation of reasoning traces to teach primitive skills, and it's unclear if these skills are enough for hard problems. This automates that prep stage by extracting signal from existing Q&A data to better prime models before sparse-reward RL.

Semantic Pareto-DQN: A Multi-Objective Reinforcement Learning Framework for Financial Anomaly Detection

PAC-ACT: Post-training Actor-Critic for Action Chunking Transformers

Latent Memory Palace: Reasoning for Control as Autoregressive Variational Inference

MPFlow: Learning Budgeted Max-Flow Optimization on the Lightning Network with Deep Graph Reinforcement Learning

Statistical Efficiency and Inference of Quantile Distributional Reinforcement Learning

ADORN: Adaptive Drift handling for Open RAN using Reinforcement Learning

When Synthetic Speech Is All You Have: Better Call GRPO

Agon: Competitive Cross-Model RL with Implicit Rival Grading of Reasoning

Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems

RL Post-Training Builds Compositional Reasoning Strategies

FootsiesGym: A Fighting Game Benchmark for Two-Player Zero-Sum Imperfect-Information Games

Multi-Agent Deep Reinforcement Learning for Multi Objective Battery Management in Dairy Farms

Weak-to-Strong Generalization via Direct On-Policy Distillation

LLM-as-a-Verifier: A General-Purpose Verification Framework

CompactionRL: Reinforcement Learning with Context Compaction for Long-Horizon Agents

TREK: Distill to Explore, Reinforce to Refine

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

WorldSample: Closed-loop Real-robot RL with World Modelling

DecompRL: Solving Harder Problems by Learning Modular Code Generation

Staged Hybridisation for Visual Quantum Reinforcement Learning via Knowledge Distillation

Experience Augmented Policy Optimization for LLM Reasoning

Diffusion Fine-tuning with Rewarded Moment Matching Distillation

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification

DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training

Toward an Energy-Optimized Operation of Data Centers Located in Wind Farms Using Reinforcement Learning

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

Automating Potential-based Reward Shaping with Vision Language Model Guidance

Learning to Fold: prizewinning solution at LeHome Challenge 2026 (1st place online, 2nd offline)

Heavy-Ball Q-Learning with Residual Weighting Correction

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation

CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

Learning Process Rewards via Success Visitation Matching for Efficient RL

Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning

On the Variance of Temporal Difference Learning and its Reduction Using Control Variates

A Model-Driven Approach for Developing Families of Reinforcement Learning Environments

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

A Diffusion Approximation for Temporal-Difference Learning with Linear Features under Markovian Noise

Context-Aware RL for Agentic and Multimodal LLMs

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

ExpRL: Exploratory RL for LLM Mid-Training