Create Next App

All 50 🚀 Shipping 0 📈 Climbing 0 💤 Quiet 50 Unscored 0

What do these badges mean?

🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
🎭HypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.

💤Quiet2607.09623·Jul 10, 2026·~11 mincs.CLcs.AI
Task-Specific Multimodal Question Answering Agents via Confidence Calibration and Incremental Reasoning for QANTA 2026
Nirjhar Das, Md. Al-Mamun Provath
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that answers trivia questions from partial clues (text + images) by using two specialized AI agents—one decides when to buzz in on tossup questions, the other carefully selects answers on bonus questions—using confidence scoring and reasoning rules instead of brute-force retrieval.
Problem solvedMultimodal trivia systems need to work fast with limited compute while handling two different question types with opposite constraints: tossup requires risk-aware timing (answer too soon = wrong, too late = someone else wins), bonus requires accuracy. This system wins the QANTA competition by building task-specific strategies rather than one generic approach.
💤Quiet2607.09600·Jul 10, 2026·~8 mincs.AIcs.CL
Agora: Enhancing LLM Agent Reasoning Via Auction-Based Task Allocation
Kaiji Zhou, Ales Leonardis, Yue Feng
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper builds a smarter task dispatcher for AI agents that works like an auction—instead of just picking the first tool that matches a job, it has multiple expert models bid on each reasoning step, and the most genuinely capable one wins the work. This prevents overconfident models from taking on tasks they'll bungle.
Problem solvedLLM agents often waste time and money by routing tasks to the first available tool that sounds relevant, or picking overconfident models that fail. You need a way to dynamically match each reasoning step to whichever expert is actually best at it, accounting for both performance and cost.
💤Quiet2607.09560·Jul 10, 2026·~14 mincs.AIcs.LG
Beyond Fixed Representations: The Vocabulary and Verifier Gaps in Open-Ended AI
Yuan Cao, Haiqian Yang
⭐ 0 stars / 0 repos📚 0 cites
ELI5Today's AI systems are stuck working within a fixed rulebook—they can reason and solve problems really well, but can't invent new concepts or tools that would let them tackle fundamentally different kinds of problems. This paper says true innovation requires AI to create and stabilize new building blocks that change the game itself.
Problem solvedCurrent AI hits a wall on open-ended tasks because it can only remix existing ideas, not invent new ones that unlock whole classes of solutions. Without the ability to create and trust new conceptual primitives, AI systems can't do the kind of foundational innovation humans do.
💤Quiet2607.08763·Jul 9, 2026·~11 mincs.CVcs.AI
OpenCoF: Learning to Reason Through Video Generation
Xinyan Chen, Ziyu Guo, Renrui Zhang, Dongzhi Jiang, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of writing out step-by-step reasoning like ChatGPT does, this system learns to reason by generating videos frame-by-frame—each frame shows the next logical step, like watching a solution unfold visually rather than reading it.
Problem solvedVideo models today generate realistic videos but can't reason through complex problems. This work shows that by training on reasoning-focused videos and giving models special tokens to track logical steps, they can actually use video generation as a reasoning tool—useful for math, planning, and logic tasks.
💤Quiet2607.08758·Jul 9, 2026·~12 mincs.AI
Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation
Yifan Zhou, Qihao Yang, Yan Li, Donggang Li, +13
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new benchmark that tests whether AI can understand how scientific ideas evolve from previous work — tracking what researchers inherit, fix, combine, or invent new — and whether AI can generate ideas that fit logically into a scientific lineage.
Problem solvedWe don't know if AI systems understand scientific progress as building on the past. Current benchmarks don't measure whether AI can trace idea evolution, spot gaps in reasoning chains, or propose genuinely novel work that still coheres with prior research.
💤Quiet2607.08724·Jul 9, 2026·~9 mincs.LGcs.RO
Latent Memory Palace: Reasoning for Control as Autoregressive Variational Inference
Chuning Zhu, Eva Xu, Jose Barreiros, Krishnan Srinivasan, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5A robot learns to solve tasks by thinking through a series of steps in a hidden 'thought space' rather than speaking out loud. It can spend more time thinking on hard problems and less on easy ones, similar to how humans deliberate differently for different decisions.
Problem solvedRobot policies struggle with tasks needing multi-step reasoning and precise spatial control. Language-based reasoning is too coarse for continuous movements, and existing control methods don't adapt their computation based on problem difficulty.
💤Quiet2607.08716·Jul 9, 2026·~10 mincs.AIcs.CL
Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents
Yifan Wu, Lizhu Zhang, Yuhang Zhou, Mingyi Wang, +4
⭐ 0 stars / 0 repos📚 0 cites
ELI5An AI agent that needs to complete long tasks often forgets important details buried in its history. This paper adds a separate 'memory coach' that watches what's happening, decides what's worth remembering, and proactively reminds the main agent exactly when it matters—like a teammate tapping you on the shoulder to say 'hey, remember you tried that before'.
Problem solvedLong-horizon tasks fail because relevant information gets lost in huge context windows or pushed out entirely—the agent can't maintain focus on scattered facts, prior attempts, and open goals. This causes mistakes that could be avoided if the right detail resurfaced at the right time.
💤Quiet2607.08662·Jul 9, 2026·~12 mincs.CLcs.AIcs.MA
WebSwarm: Recursive Multi-Agent Orchestration for Deep-and-Wide Web Search
Xiaoshuai Song, Liancheng Zhang, Kangzhi Zhao, Yutao Zhu, +7
⭐ 0 stars / 0 repos📚 0 cites
ELI5WebSwarm is a system that breaks down complex research questions into subtasks, then spawns multiple AI agents that collaborate recursively—some solving their part directly, others delegating to child agents—to search the web both deeply and widely, and gradually build up answers from the bottom up.
Problem solvedSingle AI search agents get stuck trying to answer complex research questions because they can't hold enough context or explore both depth and breadth at once. Existing multi-agent systems run tasks in parallel but don't collaborate well or dig deeper when needed. WebSwarm fixes this by letting agents spawn and coordinate child agents dynamically.
💤Quiet2607.08456·Jul 9, 2026·~14 mincs.CLcs.AI
Two Axes of LLM Abstention: Answer Correctness and Question Answerability
Benedikt J. Wagner
⭐ 0 stars / 0 repos📚 0 cites
ELI5LLMs need to refuse in two different ways: saying 'I don't know' when they'd get it wrong, and saying 'I can't answer that' when the question itself is broken (unanswerable or based on false facts). This paper shows these are two separate signals hiding in the model, and you can pull them out separately to refuse correctly.
Problem solvedToday's LLMs use one confidence score to refuse everything, but they can't distinguish between 'my answer might be wrong' and 'this question doesn't make sense.' This causes them to either answer broken questions confidently or refuse good ones. The paper fixes this by extracting two separate refusal signals from model internals.
💤Quiet2607.07702·Jul 8, 2026·~11 mincs.CL
From Noisy Traces to Root Causes: Structural Trajectory Analysis and Causal Extraction for Agent Optimization
Ying Chang, Jiahang Xu, Xuan Feng, Chenyuan Yang, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5When AI agents fail at tasks, you get messy logs full of irrelevant steps. This method automatically finds the actual root causes by filtering out noise and tracing what actually caused each failure, so the agent can learn from the real problem instead of random junk.
Problem solvedLLM-based agents get stuck on tasks but their failure logs are huge, redundant, and full of irrelevant details—making it hard to figure out what actually went wrong and fix it. Naive log cleanup loses important clues. This makes learning from failures slow and unreliable.
💤Quiet2607.07690·Jul 8, 2026·~12 mincs.LGcs.AIcs.CL
Agon: Competitive Cross-Model RL with Implicit Rival Grading of Reasoning
Vladislav Beliaev
⭐ 0 stars / 0 repos📚 0 cites
ELI5Two AI models compete by solving math problems together: one drafts a solution, the other reads it and tries to solve the problem too. Whoever gets the right answer 'wins,' which forces both models to think better because they're competing against each other.
Problem solvedCurrent reasoning training only grades final answers, so models learn to write longer solutions rather than think better. Agon grades the thinking process implicitly by having models compete, improving reasoning on hard math and code problems without needing labeled reasoning steps.
💤Quiet2607.07674·Jul 8, 2026·~9 mincs.LGcs.CL
Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems
Vladislav Beliaev
⭐ 0 stars / 0 repos📚 0 cites
ELI5When training AI models on hard math problems, most attempts fail and teach nothing. This method gives the model a partial correct answer to start with, gradually making the help harder to remove, so it learns from problems that would otherwise be too difficult.
Problem solvedGRPO training wastes compute on unsolvable problems—when all rollouts fail, the model learns nothing. This keeps hard problems from contributing to training, even though those are where models need improvement most.
💤Quiet2607.07646·Jul 8, 2026·~13 mincs.AIcs.CL
RL Post-Training Builds Compositional Reasoning Strategies
Azwar Abdulsalam, Nishil Patel, Andrew Saxe
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers trained an AI model on simple rewriting tasks, then used reinforcement learning to solve harder problems. They found RL doesn't just do more of the same thing—it combines simple rewrite rules into new, more complex strategies that work reliably on unseen problems.
Problem solvedIt was unclear whether RL actually teaches models to think in new ways or just amplifies existing skills. This matters for understanding whether RL can create genuine reasoning abilities versus just brute-force searching. The paper shows RL can genuinely compose simpler skills into reusable complex strategies.
💤Quiet2607.06527·Jul 7, 2026·~7 mincs.CLcs.AI
RSF-GLLM: Bridging the Semantic Gap in Multi-Hop Knowledge Graph QA via Recurrent Soft-Flow and Decoupled LLM Generation
Sambaran Bandyopadhyay, Ananth Muppidi
⭐ 0 stars / 0 repos📚 0 cites
ELI5When answering questions that require jumping between multiple facts in a knowledge graph, this system learns which path to take by using soft probability scores that gradually sharpen into concrete steps, then hands those steps to an LLM to generate the final answer.
Problem solvedMulti-hop QA systems usually can't learn good retrieval paths when there's no word overlap between the question and intermediate facts—this paper fixes that by making the path-finding differentiable and decoupling it from answer generation so each part can be optimized separately.
💤Quiet2607.06522·Jul 7, 2026·~8 mincs.AIcs.CV
Bridging Physical Reasoning and Task Generalization via Visual Action Outcome Reasoning Alignment
Han-Jun Ko, Jr-Jen Chen, Haobo Yuan, Hsin-Ying Lee, +3
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper fixes a problem where AI models that can see and reason about physics often make up false explanations and don't actually follow through on what they claim to do. The solution is a scoring system that rewards the model for reasoning that matches what it actually sees, and for explanations that match what the model's actions actually cause to happen.
Problem solvedVision-language models fail when asked to reason about physics in new situations—they hallucinate explanations that contradict reality and their stated reasoning doesn't match their actual behavior. This breaks them when deployed on unseen tasks or environments where physical reasoning is critical (robotics, interactive agents).
💤Quiet2607.06507·Jul 7, 2026·~11 mincs.CLcs.IR
DynaKRAG: A Unified Framework for Learnable Evidence Control in Multi-Hop Retrieval-Augmented Generation
Yaqi Wu, Xiaolei Guo, Chenyu Zhou, Jiaqi Huang, +6
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that learns when to retrieve new documents, fix questions, or stop searching while answering multi-step questions. Instead of following a fixed recipe, it decides dynamically what action to take based on what evidence it already has.
Problem solvedMulti-hop question answering requires multiple retrieval steps, but existing systems either rigidly follow preset pipelines or waste time retrieving irrelevant documents. This learns the optimal sequence of retrieval, query fixing, and stopping decisions for each question.
💤Quiet2607.05394·Jul 6, 2026·~14 mincs.LGcs.AIcs.CL
Weak-to-Strong Generalization via Direct On-Policy Distillation
Shiyuan Feng, Huan-ang Gao, Haohan Chi, Hanlin Wu, +6
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of running expensive reasoning training on every new large model, train a smaller model cheaply, then figure out which actions the training made that small model prefer, and teach the large model to make those same preference shifts on its own data.
Problem solvedRunning reinforcement learning on large language models is prohibitively expensive—you need thousands of rollouts per training step. This work lets you do the expensive RL once on a cheap small model, then transfer those gains to a larger model without repeating the costly RL process.
💤Quiet2607.05391·Jul 6, 2026·~14 mincs.AIcs.CLcs.LG
LLM-as-a-Verifier: A General-Purpose Verification Framework
Jacky Kwok, Shulu Li, Pranav Atreya, Yuejiang Liu, +5
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of asking an LLM 'is this answer right or wrong?', this framework lets it output a probability distribution over correctness, giving you a precise confidence score. You can then use these scores to pick the best solution from multiple attempts, or feed them into AI training loops.
Problem solvedCurrent LLM judges give you yes/no answers, making it hard to pick between mediocre solutions or train agents effectively. This gives you granular confidence scores so you can rank solutions accurately and provide rich feedback signals for AI systems to learn from.
💤Quiet2607.05346·Jul 6, 2026·~7 mincs.AIcs.MA
OptiAgent: End-to-End Optimization Modeling via Multi-Agent Iterative Refinement
Adriana Laurindo Monteiro, Nayse Fagundes, Gabriel Mattos Langeloh, Gustavo de Oliveira Kanno, +3
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that reads English descriptions of optimization problems (like supply chain or scheduling puzzles) and automatically generates both the mathematical equations and working code to solve them, using multiple AI agents that check each other's work.
Problem solvedTranslating real-world optimization problems into solver-ready mathematical models is slow, error-prone, and requires expertise. This automates the entire pipeline from natural language to executable code, reducing expert time and catching mistakes through built-in validation.
💤Quiet2607.05339·Jul 6, 2026·~14 mincs.LGcs.AIstat.ML
TREK: Distill to Explore, Reinforce to Refine
Yuanda Xu, Zhengze Zhou, Kayhan Behdin, Jelena Markovic-Voronov, +9
⭐ 0 stars / 0 repos📚 0 cites
ELI5A training method that uses a teacher model to show a student model good solutions to hard problems, then pulls those solutions into the student's own thinking process before fine-tuning. Like showing someone a worked example before asking them to solve similar problems themselves.
Problem solvedCurrent reinforcement learning for reasoning gets stuck on hard problems because the student model never tries the right approach on its own. TREK fixes this by having a teacher demonstrate solutions first, so the student learns to try those approaches before optimizing further.
💤Quiet2607.05316·Jul 6, 2026·~11 mincs.CLcs.LG
How Much is Left? LLMs Linearly Encode Their Remaining Output Length
Mohamed Amine Merzouk, Dmitri Carpov, Mirko Bronzi, Damiano Fornasiere, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5LLMs seem to have an internal sense of how long their answer will be, which you can read out from their hidden states even before they start writing. Think of it like a writer mentally knowing their essay will be 5 pages before putting pen to paper.
Problem solvedUnderstanding what's happening inside LLMs is hard. This reveals that models track output length internally, which could help debug why they ramble, stop too early, or behave inconsistently—and suggests they're doing some form of planning.
💤Quiet2607.02509·Jul 2, 2026·~10 mincs.AI
ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning
Yanjun Zhao, Ruizhong Qiu, Tianxin Wei, Yuanchen Bei, +5
⭐ 0 stars / 0 repos📚 0 cites
ELI5When LLMs read long documents, they often miss relevant information that's actually in there. This method helps by replaying the most relevant parts of the context before generating answers, like reminding the model what matters most — no retraining needed.
Problem solvedLLMs with long context windows still fail to use relevant evidence effectively, hurting performance on tasks requiring reasoning over 100K+ tokens. This method fixes that without retraining or external tools.
💤Quiet2607.02504·Jul 2, 2026·~8 mincs.CLcs.AIcs.CV
Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas
Yuxuan Li, Lingxi Xie, Xinyue Huo, Jihao Qiu, +5
⭐ 0 stars / 0 repos📚 0 cites
ELI5When watching TV dramas, it's hard to figure out who's speaking in complex scenes. This paper trains an AI model that reasons through audio, text, and video clues together to accurately match voices to characters—like a detective piecing together clues from what it hears, sees, and knows about the story.
Problem solvedCurrent video understanding systems struggle to identify speakers in long TV shows, especially for short lines where voice recognition alone fails. This makes it hard to automatically caption, analyze, or index dramatic content accurately.
💤Quiet2607.02502·Jul 2, 2026·~12 mincs.LGcs.AI
DemoPSD: Disagreement-Modulated Policy Self-Distillation
Yunhe Li, Hao Shi, Wenhao Liu, Mengzhe Ruan, +4
⭐ 0 stars / 0 repos📚 0 cites
ELI5A technique for training AI models to reason better by having them learn from their own teacher version, but only copying advice where they actually disagree—avoiding the trap of blindly memorizing shortcuts that won't work at test time.
Problem solvedSelf-distillation methods train weaker students from stronger teachers using privileged information, but students end up memorizing answer shortcuts they won't have access to later, and stop exploring alternatives. This hurts generalization to new domains.
💤Quiet2607.02491·Jul 2, 2026·~13 mincs.AI
G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models
Timo Bertram, Sidhant Bhavnani, Richard Freinschlag, Erich Kobler, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5A neural network learns to suggest good guesses for solving puzzles like Sudoku, then passes those hints to traditional solvers (like backtracking or SAT solvers) that verify and fix any mistakes. The neural part gets better at larger problems, and the combination solves puzzles much faster when the solver can override bad hints.
Problem solvedSymbolic solvers can get stuck exploring huge search spaces and become very slow. This hybrid approach uses a neural network to intelligently prune the search space with hints, accelerating solving by 30x+ on hard puzzles—but only when the solver can adaptively override incorrect neural suggestions.
💤Quiet2607.02490·Jul 2, 2026·~9 mincs.CLcs.CV
Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
Liyan Tang, Fangcong Yin, Greg Durrett
⭐ 0 stars / 0 repos📚 0 cites
ELI5Vision-language models can get better at fixing their own mistakes by looking at images while they think through problems. This work teaches them to do this by showing them messed-up situations they have to recover from, making them actually use visual information instead of just talking about it.
Problem solvedVision-language models fail when images look different from training data because they don't properly use visual clues when correcting mistakes. Teams need models that can genuinely reference and learn from what they see, not just generate text about corrections.
💤Quiet2607.02390·Jul 2, 2026·~10 mincs.LG
DecompRL: Solving Harder Problems by Learning Modular Code Generation
Juliette Decugis, Fabian Gloeckle, Francis Bach, Taco Cohen, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of trying harder to generate correct code in one shot, this method teaches AI to break hard problems into smaller pieces, write solutions for each piece separately, then mix-and-match them. This creates many more possible answers while using way less computing power.
Problem solvedLLMs hit a wall on hard coding problems—sampling more solutions wastes GPU money, and training with reinforcement learning doesn't help if the model has almost no chance of getting it right. DecompRL solves this by shifting work from expensive inference to cheap recombination, cutting compute costs while solving problems that standard approaches can't reach.
💤Quiet2607.02374·Jul 2, 2026·~10 mincs.AI
DRIFTLENS: Measuring Memory-Induced Reasoning Drift in Personalized Language Models
Xi Fang, Weijie Xu, Yingqiang Ge, Yuhui Xu, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5When AI systems remember user details and use them to personalize responses, they don't just change what they say—they change how they reason to reach that answer. This paper measures how much this happens and whether we can reduce it.
Problem solvedPersonalized AI assistants may silently alter their reasoning based on stored user attributes, potentially embedding biases or inconsistencies into explanations without users noticing. This matters for trust and fairness when a model justifies the same answer differently to different users.
💤Quiet2606.30616·Jun 29, 2026·~11 mincs.CL
Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent
Lei Bai, Zongsheng Cao, Yang Chen, Zhiyao Cui, +46
⭐ 0 stars / 0 repos📚 0 cites
ELI5A 35-billion-parameter AI agent matches the performance of trillion-parameter models on complex tasks by making much longer chains of thoughts and actions (up to 45K tokens) rather than being bigger. Think of it like a smaller person solving harder problems by being more methodical and thoughtful.
Problem solvedBuilding capable AI agents currently requires massive models (1 trillion+ parameters), which are expensive to run and deploy. This shows you can get similar task performance with a 35B model by focusing on longer reasoning horizons and better training, making agents much more practical and affordable.
💤Quiet2606.30578·Jun 29, 2026·~10 mincs.CLcs.LG
Uncertainty-Aware Generation and Decision-Making Under Ambiguity
Nico Daheim, Iryna Gurevych
⭐ 0 stars / 0 repos📚 0 cites
ELI5When LLMs tackle subjective tasks like tutoring or peer review, they're often uncertain about the right answer. This paper teaches models to acknowledge that uncertainty and make better decisions by considering multiple possible correct answers rather than just picking one.
Problem solvedLLM outputs in subjective tasks are unreliable because models pick one answer without recognizing they might be wrong. Users need to know when the model is uncertain and get outputs that account for that ambiguity—especially in high-stakes tasks like grading or teaching.
💤Quiet2606.30573·Jun 29, 2026·~13 mincs.LG
SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions
Mohit Raghavendra, Anisha Gunjal, Aakash Sabharwal, Yunzhong He
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of giving coding agents a complete task description once, this benchmark simulates a real developer's workflow where a user starts with vague instructions, gradually reveals requirements, and gives feedback until the task is done. It tests whether AI can figure out what a human actually wants and adapt as things change.
Problem solvedCurrent coding benchmarks measure single-shot task completion, but real developers work iteratively with unclear requirements that shift over time. This tests the actual experience: agents that seem good at isolated tasks often fail when they have to negotiate ambiguous goals, accept feedback, and refine work across multiple turns.
💤Quiet2606.30481·Jun 29, 2026·~9 mincs.CYcs.AIcs.CL
Situation Perception: A Necessary Primitive to Artificial Superintelligence
Ziqin Yuan, Jaymari Chua
⭐ 0 stars / 0 repos📚 0 cites
ELI5Today's AI language models are just very good at pattern-matching in text. To reach true superintelligence, they need to build internal simulations of the world—imagining 'what if' scenarios over long periods and learning from them to achieve their own goals.
Problem solvedLLMs can generate coherent text but can't actually understand cause-and-effect, predict future consequences of actions, or pursue long-term goals the way humans do. This gap means they'll never become truly intelligent without fundamentally new capabilities.
💤Quiet2606.30420·Jun 29, 2026·~9 mincs.LG
Experience Augmented Policy Optimization for LLM Reasoning
Jinda Lu, Kexin Huang, Junkang Wu, Shuo Yang, +6
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of training an AI to solve math problems from scratch each time, this method reuses hints from a previously trained AI—but adjusts those hints based on what the current AI is doing, so they stay useful as the AI improves.
Problem solvedTraining LLMs to reason better via reinforcement learning is expensive and wastes past experience. Old methods either train from scratch (costly) or reuse old solutions that no longer match how the improved model behaves (misaligned).
💤Quiet2606.30345·Jun 29, 2026·~13 mincs.LGcs.AI
DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training
Haisen Luo, Yiwei Liu, Haoning Wang, Dan Liu, +12
⭐ 0 stars / 0 repos📚 0 cites
ELI5A training method that helps AI models improve themselves by smartly routing problems based on difficulty—easy ones get less attention, hard ones get special handling—and focusing learning energy on the most critical reasoning steps.
Problem solvedSelf-training methods waste effort on easy problems while struggling with hard ones, and they update the model uniformly everywhere instead of concentrating on crucial decision points. DRIFT fixes this by identifying which problems need what kind of help and focusing updates where they matter most.
💤Quiet2606.30335·Jun 29, 2026·~8 mincs.AI
BayesEvolve: Explicit Belief States for Autonomous Scientific Discovery
Xuening Wu, Shan Yu, Qianya Xu, Shenqin Yin
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that helps AI discover new scientific ideas by maintaining a statistical model of what it's learned so far, rather than just remembering past experiments—like a scientist who keeps updating their intuition about what works instead of just a notebook of tried things.
Problem solvedLLM-based discovery systems waste experiments by relying on simple memory or archives instead of reasoning about uncertainty. BayesEvolve uses probabilistic beliefs to make smarter bets on which hypotheses to test next, saving evaluations in expensive experimental domains.
💤Quiet2606.30247·Jun 29, 2026·~9 mincs.CL
Grounding LLM Reasoning under Incomplete Graph Evidence
Jiaqi Li, Fanghui Song
⭐ 0 stars / 0 repos📚 0 cites
ELI5When an LLM reasons using a knowledge graph, that graph is always incomplete—some facts are missing. This paper figures out how to adjust the LLM's answers to match what the graph says while still allowing reasoning about things not in the graph, without pretending the graph is complete truth.
Problem solvedKnowledge graph systems can't just follow what's in the graph (it's incomplete) or ignore the graph entirely (then why have it). This work provides a principled way to blend the LLM's knowledge with partial graph evidence, crucial for RAG and graph-based QA systems to avoid hallucinations while staying useful.
💤Quiet2606.28294·Jun 26, 2026·~9 mincs.LGcs.MA
Democratic ICAI: Debating Our Way to Steering Principles from Preferences
Kevin Kingslin, Anish Natekar, Ashutosh Ranjan, Vivek Srivastava, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of just asking an AI why it prefers one answer over another, have multiple AI personas debate the decision from different angles. This captures the hidden reasons behind preferences way better than a single explanation, letting you steer AI behavior based on richer, more balanced principles.
Problem solvedCurrent alignment methods ask AI to explain preferences in one pass, missing the real trade-offs and nuance in complex decisions. This leaves you with shallow steering principles that don't actually capture what matters, making it hard to reliably guide AI behavior on subjective tasks.
💤Quiet2606.28277·Jun 26, 2026·~12 mincs.LGcs.AIcs.CL
Towards Automating Scientific Review with Google's Paper Assistant Tool
Rajesh Jayaram, Drew Tyler, David Woodruff, Corinna Cortes, +3
⭐ 0 stars / 0 repos📚 0 cites
ELI5Google built an AI tool that reads entire research papers and spots errors in math, experiments, and reasoning—like having a super-careful colleague review your work before you submit it. It uses multiple passes of thinking to catch problems a single quick check would miss.
Problem solvedScientific papers are being produced faster than human reviewers can keep up, especially with AI-generated research. Early error detection saves reviewer time and helps authors fix problems before submission, reducing the review bottleneck.
💤Quiet2606.27359·Jun 25, 2026·~10 minstat.MLcs.LG
When are likely answers right? On Sequence Probability and Correctness in LLMs
Johannes Zenn, Jonas Geiping
⭐ 0 stars / 0 repos📚 0 cites
ELI5The paper checks whether LLMs that output text with higher probability (more likely according to the model's math) are actually more likely to be correct. They find this is true when comparing different answers to the same question, but false when trying to improve answers by tweaking how the model generates text.
Problem solvedDecoding methods try to find better answers by pushing LLMs toward higher-probability outputs, but there's been no clear picture of when this actually works. This research shows these methods help in some settings but fail in others—letting teams avoid wasting effort on approaches that won't improve accuracy.
💤Quiet2606.27306·Jun 25, 2026·~9 mincs.CL
Multilingual Reasoning Cascades Need More Context
Arnav Mazumder, Dengjia Zhang, Shuyue Stella Li, Yulia Tsvetkov, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5When AI answers questions in other languages, it usually translates to English, reasons there, then translates back — but this throws away useful clues. This paper shows that just keeping the original question visible through the whole process fixes many mistakes, especially cultural and phrasing ones.
Problem solvedTranslation-based multilingual reasoning loses information at each step, causing worse answers in non-English languages. A cheap fix (adding context) makes answers better across 285 languages without retraining, solving a real bottleneck for global AI systems.
💤Quiet2606.27288·Jun 25, 2026·~13 mincs.AIcs.LG
When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models
Josef Chen
⭐ 0 stars / 0 repos📚 0 cites
ELI5When you combine multiple AI models (like having them vote or route requests), their accuracy ceiling is determined by how often all models fail on the same question. The paper shows this hidden limit across 67 frontier models—and it's often much lower than expected, meaning combining models helps less than people think.
Problem solvedTeams build expensive multi-model systems expecting big accuracy gains, but don't know upfront how much improvement is actually possible. This paper reveals the hard ceiling before you invest in routing or ensemble logic—showing that on many tasks, combining models barely beats just using the single best one.
💤Quiet2606.27268·Jun 25, 2026·~13 mincs.ROcs.AI
E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation
Wen Ye, Peiyan Li, Tingyu Yuan, Yuan Xu, +6
⭐ 0 stars / 0 repos📚 0 cites
ELI5A robot uses past experience and reasoning to improve its actions in real-time during a task. Instead of committing to its first plan, it iteratively refines both its reasoning and movements by checking them against what actually happened before.
Problem solvedRobots fail at long-horizon tasks because they only look at the current moment and commit to plans without adapting. This method lets robots reason through problems and adjust their actions mid-task using memory of what's happened so far, without needing retraining.
💤Quiet2606.27237·Jun 25, 2026·~9 mincs.CL
LMs as Task-Specific Knowledge Bases: An Interpretability Analysis
Amit Elhelo, Amir Globerson, Mor Geva
⭐ 0 stars / 0 repos📚 0 cites
ELI5Language models don't store facts like a traditional database with one 'true' answer. Instead, they encode the same fact differently depending on the task, like having separate filing cabinets for the same information—which means they can give inconsistent answers depending on context.
Problem solvedWe don't really understand how language models store and retrieve facts, making it risky to rely on them as knowledge sources. This work shows why the same fact can produce different outputs in different tasks, helping explain reliability issues and why LMs can seem to 'know' something in one context but fail in another.
💤Quiet2606.27199·Jun 25, 2026·~7 mincs.CLcs.LG
Forecasting With LLMs: Improved Generalization Through Feature Steering
Humzah Merchant, Bradford Levy
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers used a tool to peek inside LLMs' brains and found specific switches that control whether the model reasons about time realistically or accidentally cheats by looking at future information. By flipping these switches, they made LLMs better at forecasting tasks across different domains.
Problem solvedLLMs tend to 'look ahead' when forecasting—using future information they shouldn't have access to. This makes them appear better at prediction than they actually are. By identifying and controlling the internal features causing this, forecasts become genuinely more reliable.
💤Quiet2606.27187·Jun 25, 2026·~11 mincs.CVcs.CL
HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models
Jiajun Wu, Haoyu Kang, Yining Sun, Jiacheng Hou, +12
⭐ 0 stars / 0 repos📚 0 cites
ELI5A test for AI models that watch videos and decide if they're harmful—but instead of just yes/no, it asks deeper questions to see if models understand *why* a video is bad or need to look at surrounding context.
Problem solvedContent moderation AI often flags harmful videos for shallow reasons (one bad frame) rather than understanding real harm. This benchmark forces models to explain their reasoning and handle multi-layered context, catching shortcuts that fail in real moderation.
💤Quiet2606.27154·Jun 25, 2026·~11 mincs.AI
OpenRCA 2.0: From Outcome Labels to Causal Process Supervision
Aoyang Fang, Yifan Yang, Jin'ao Shang, Qisheng Lu, +6
⭐ 0 stars / 0 repos📚 0 cites
ELI5When systems break, AI needs to find what caused it and trace how the problem spread. This paper creates a test where AI must show its work—not just guess the root cause, but prove the chain of events that led to the symptom.
Problem solvedCurrent RCA benchmarks only check if an AI guesses the right root cause, hiding whether it actually understands how failures propagate. This lets AI get lucky with pattern matching rather than true reasoning, making it unsafe for production debugging.
💤Quiet2606.27136·Jun 25, 2026·~11 mincs.AI
Joint Learning of Experiential Rules and Policies for Large Language Model Agents
Shicheng Ye, Chao Yu
⭐ 0 stars / 0 repos📚 0 cites
ELI5An AI agent learns two things together from its experience: short rules it can look up when making decisions (like 'check inventory first'), and permanent improvements to its core decision-making. This keeps the rules useful as the agent gets smarter.
Problem solvedLLM agents struggle to learn from trial-and-error in multi-step tasks. Either they use rules that become outdated, or they fine-tune the model which is slow and doesn't fix specific mistakes. JERP makes both learn together so rules stay useful while the model improves.
💤Quiet2606.27103·Jun 25, 2026·~15 mincs.CL
The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans
Bella Fascendini, Kathryn McGregor, Max D. Gupta, Thomas L. Griffiths
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers created tricky word problems that look like riddles but have literal answers to test whether LLMs actually reason flexibly or just pattern-match. LLMs did the opposite of humans: they excelled at real riddles (which need creative thinking) but flopped on literal ones, suggesting they're memorizing rather than reasoning.
Problem solvedIt's hard to tell if LLMs are actually reasoning or just retrieving patterns from training data. This test reveals that LLMs may be fooling us with outputs that look thoughtful but are really just surface-level pattern matching, which matters for trusting them on novel problems.
💤Quiet2606.27068·Jun 25, 2026·~8 mincs.GTcs.AIcs.LG
Parametric Open Source Games
Aleksandar Todorov, Jesse ten Napel, Alexander Müller
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of agents picking fixed strategies, they pick numbers (parameters) that feed into a shared decoder, which converts those numbers into game moves. This lets agents influence each other's behavior through their parameter choices, and the paper shows when this leads to cooperation even in selfish settings.
Problem solvedStandard game theory assumes agents pick moves independently; real AI systems can see and learn from each other's weights or code. This framework models that transparency, showing how parameter sharing can flip outcomes from mutual betrayal to cooperation without explicit coordination.
💤Quiet2606.27047·Jun 25, 2026·~10 mincs.CLcs.AI
NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models
Henry Shaowu Yuchi, Michal Kucer, Benjamin H. Sims, Selma Peterson, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5A test suite with 1,240 nuclear engineering questions that checks whether AI models can answer facts, do math, and explain concepts correctly—revealing that LLMs struggle more with calculations and deep understanding than with simple facts.
Problem solvedNuclear engineers need to know if AI systems are actually reliable for technical problem-solving, not just trivia. Current benchmarks miss domain-specific gaps in quantitative reasoning and conceptual depth that could matter in safety-critical work.

Task-Specific Multimodal Question Answering Agents via Confidence Calibration and Incremental Reasoning for QANTA 2026

Agora: Enhancing LLM Agent Reasoning Via Auction-Based Task Allocation

Beyond Fixed Representations: The Vocabulary and Verifier Gaps in Open-Ended AI

OpenCoF: Learning to Reason Through Video Generation

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

Latent Memory Palace: Reasoning for Control as Autoregressive Variational Inference

Remember When It Matters: Proactive Memory Agent for Long-Horizon Agents

WebSwarm: Recursive Multi-Agent Orchestration for Deep-and-Wide Web Search

Two Axes of LLM Abstention: Answer Correctness and Question Answerability

From Noisy Traces to Root Causes: Structural Trajectory Analysis and Causal Extraction for Agent Optimization

Agon: Competitive Cross-Model RL with Implicit Rival Grading of Reasoning

Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems

RL Post-Training Builds Compositional Reasoning Strategies

RSF-GLLM: Bridging the Semantic Gap in Multi-Hop Knowledge Graph QA via Recurrent Soft-Flow and Decoupled LLM Generation

Bridging Physical Reasoning and Task Generalization via Visual Action Outcome Reasoning Alignment

DynaKRAG: A Unified Framework for Learnable Evidence Control in Multi-Hop Retrieval-Augmented Generation

Weak-to-Strong Generalization via Direct On-Policy Distillation

LLM-as-a-Verifier: A General-Purpose Verification Framework

OptiAgent: End-to-End Optimization Modeling via Multi-Agent Iterative Refinement

TREK: Distill to Explore, Reinforce to Refine

How Much is Left? LLMs Linearly Encode Their Remaining Output Length

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

DemoPSD: Disagreement-Modulated Policy Self-Distillation

G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

DecompRL: Solving Harder Problems by Learning Modular Code Generation

DRIFTLENS: Measuring Memory-Induced Reasoning Drift in Personalized Language Models

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Uncertainty-Aware Generation and Decision-Making Under Ambiguity

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

Situation Perception: A Necessary Primitive to Artificial Superintelligence

Experience Augmented Policy Optimization for LLM Reasoning

DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training

BayesEvolve: Explicit Belief States for Autonomous Scientific Discovery

Grounding LLM Reasoning under Incomplete Graph Evidence

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

Towards Automating Scientific Review with Google's Paper Assistant Tool

When are likely answers right? On Sequence Probability and Correctness in LLMs

Multilingual Reasoning Cascades Need More Context

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation

LMs as Task-Specific Knowledge Bases: An Interpretability Analysis

Forecasting With LLMs: Improved Generalization Through Feature Steering

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

OpenRCA 2.0: From Outcome Labels to Causal Process Supervision

Joint Learning of Experiential Rules and Policies for Large Language Model Agents

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

Parametric Open Source Games

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models