What do these badges mean?
- 🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
- 📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
- 💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
- 🎭HypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.
- 🚀Shipping2605.18747·May 18, 2026·~13 mincs.CLcs.AI
Code as Agent Harness
Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, +38
⭐ 1.3k stars / 9 repos📚 0 citesELI5Instead of treating code as just the output LLMs produce, this survey shows how code can be the central operating system for AI agents—the glue that lets them think, act, remember, and verify their work in a way humans can actually understand and check.
Problem solvedCurrent AI agents are hard to make reliable, debuggable, and controllable. Using code as the core infrastructure lets you write agent logic you can read, test, and fix—solving the black-box nature of pure neural approaches and making agents deployable in real systems.
- 2605.18663·May 18, 2026·~14 mincs.AIcs.CLcs.LG
GIM: Evaluating models via tasks that integrate multiple cognitive domains
Rohit Patel, Alexandre Rezende, Steven McClain
ELI5A new benchmark that tests whether AI models can handle realistic tasks requiring multiple reasoning steps at once—like juggling constraints, tracking state, and knowing when to be uncertain. Instead of just piling on hard facts to memorize or abstract puzzles with no context, it measures integration of different thinking skills on grounded problems.
Problem solvedExisting benchmarks either test memorization (unfair) or pure abstraction (unrealistic). This creates a benchmark where difficulty comes from coordinating multiple reasoning types on practical tasks, and provides a statistical framework (IRT) to fairly compare models even when they fail differently or use different compute budgets.
- 2605.18572·May 18, 2026·~10 mincs.CL
MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion
Dingyi Zhang, Ziqing Zhuang, Linhai Zhang, Ziyang Gao, +1
ELI5A system that acts like a persuasive negotiator by reading between the lines of what someone is saying, figuring out what they really think and want, then adapting its persuasion strategy on the fly rather than giving canned responses.
Problem solvedCurrent AI persuasion systems generate generic responses and struggle to adapt across different domains (sales, counseling, negotiation). This framework infers hidden beliefs and emotions, then executes targeted strategies that actually work.
- 2605.18565·May 18, 2026·~13 mincs.CLcs.AI
LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, +2
ELI5This benchmark tests whether AI agents can remember and reason about information correctly when facts keep changing and interfering with each other over very long conversations or documents—like tracking a character's location across a 200-page novel where they move around repeatedly.
Problem solvedAI agents fail at real-world tasks where information gets updated frequently and facts overlap (like a chatbot managing a customer's changing order status). Existing tests only check single facts in isolation, not messy, interconnected memories where old and new information conflict.
- 2605.18549·May 18, 2026·~12 mincs.CLcs.CR
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
Maciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński, +1
ELI5Researchers found that by watching how a language model's internal thoughts evolve step-by-step during reasoning, you can predict what it will actually do better than just looking at its final explanation. It's like noticing subtle shifts in someone's tone throughout a conversation to predict their real decision.
Problem solvedChain-of-thought explanations from AI models aren't always honest about why they made decisions, making them unreliable for safety monitoring. This method uses the hidden patterns during reasoning to catch what the model will really do, even when its explanation is misleading.
- 2605.18548·May 18, 2026·~10 mincs.CLcs.AI
STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics
Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, +4
ELI5A benchmark that tests whether AI assistants can notice when the world changes mid-task and successfully adjust their plans—like a robot realizing a door locked while it was walking toward it and finding another route instead.
Problem solvedReal-world AI agents fail when their plans break mid-execution due to unexpected changes. Existing benchmarks only test *detecting* changes, not *fixing* broken plans, leaving a major gap in testing production-ready agents.
- 2605.18535·May 18, 2026·~12 mincs.LGcs.MA
Beyond Scaling: Agents Are Heading to the Edge
Chunlin Tian, Dongqi Cai, Wanru Zhao, Nicholas D. Lane
ELI5Instead of sending everything to the cloud, smart assistants should run directly on your device because they need instant access to your files, real-time sensor data, and immediate feedback to work well—like how a co-pilot needs to be in the cockpit, not calling a distant headquarters.
Problem solvedCloud-based agents lose critical local context (files, sensor data, OS state) in transmission and suffer latency delays, making them unreliable for real-time personal tasks. Edge deployment preserves ground-truth data and enables instant feedback loops that actually improve agent performance over time.
- 2605.18529·May 18, 2026·~9 mincs.AI
AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
Zhenlin Wei, Pu Jian, Yingzhuo Deng, Xiaohan Wang, +5
ELI5When training AI models to solve hard problems with rewards, the model doesn't know which tokens (words) actually helped get the right answer. This paper fixes that by having the model reflect on its mistakes and generate hints to itself, then use those hints to credit the right tokens instead of all of them equally.
Problem solvedTraining LLMs with reinforcement learning hits a wall: reward signals come at the sequence level, so every token gets equal credit even though only some mattered. This causes late-stage training collapse and wasted learning. AMR-SD pinpoints which tokens actually contributed to success.
- 2605.18500·May 18, 2026·~9 mincs.CL
Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning
Li Wang, Xiaohan Wang, Xiaodong Lu, Zipeng Zhang, +4
ELI5Instead of using tools immediately when an LLM thinks it needs one, this method lets the model plan out all its tool requests first, then execute them in the right order. It's like writing a shopping list before going to the store rather than buying items one at a time.
Problem solvedLLMs lose their reasoning flow when they stop mid-thought to execute a tool, hurting their ability to solve complex math problems. This approach keeps the model thinking clearly by separating the decision to use a tool from actually running it.
- 2605.18445·May 18, 2026·~13 mincs.CVcs.AIcs.CL
What is Holding Back Latent Visual Reasoning?
André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann
ELI5Vision-language models that try to reason through imaginary visual steps actually ignore those steps and just use the original image. The researchers show this by replacing the imaginary steps with random gibberish—the model's answers don't change.
Problem solvedVision-language models aren't actually using intermediate reasoning steps like they claim. This blocks progress on complex visual reasoning tasks where breaking problems into steps should help, but currently doesn't because models bypass the intermediate representations.
- 2605.18407·May 18, 2026·~10 mincond-mat.mes-hallcond-mat.mtrl-scics.AI
Qumus: Realization of An Embodied AI Quantum Material Experimentalist
Lihan Shi, Zhaoyi Joy Zheng, Xinzhe Juan, Yimin Wang, +13
ELI5A robot lab that uses AI to independently run physics experiments—it generates ideas, plans procedures, executes them with robotic hands, analyzes results, and learns from mistakes, successfully creating graphene and nanoscale devices on its own.
Problem solvedScientists spend months on repetitive materials experiments with slow feedback loops. This system compresses that cycle by having an AI agent autonomously execute, monitor, and refine experiments in real time without human intervention between steps.
- 2605.18380·May 18, 2026·~8 mincs.AI
QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
Anthony G. Cohn, Robert E. Blackwell
ELI5A benchmark that tests whether AI language models can reason about spatial and temporal relationships—like 'if A is left of B and B is left of C, what's the relationship between A and C?'—across different mathematical systems designed for this type of logic.
Problem solvedWe don't have a standard way to measure whether LLMs can actually do spatial and temporal reasoning, which matters for robotics, navigation, and planning tasks. This benchmark fills that gap and reveals that current models struggle with complex spatial calculi.
- 2605.18352·May 18, 2026·~9 mincs.CL
Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs
Tara Azin, Yongan Yu, Raj Singh, Olessia Jouravlev
ELI5Researchers tested whether AI language models understand how assumptions work in sentences like 'If the king is bald, the queen is sad'—specifically, whether they grasp that we presume the king exists. They compared AI outputs to what 120 humans said and found AIs often match human answers by accident, not by actually reasoning through the logic.
Problem solvedWe don't know if language models truly understand pragmatics (the unspoken rules of language) or just memorize patterns. This matters for building trustworthy AI: a model might give right answers for wrong reasons, which fails on novel language tasks.
- 2605.18327·May 18, 2026·~10 mincs.AI
Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows
Dhairya Dalal, Endre Sara, Ben Yemini, Christine Miller, +1
ELI5Instead of feeding AI agents raw monitoring data and making them figure out what's broken, this system pre-builds a map of how your systems are connected and what causes what—so when something fails, the AI agent can diagnose the problem 63% faster and use way fewer tokens.
Problem solvedSRE teams use AI agents to debug production issues, but without structured knowledge of system dependencies, agents waste time and tokens parsing raw telemetry. This causes slow diagnosis, high API costs, and unreliable answers. A causal model fixes this by giving agents pre-computed relationships.
- 2605.18313·May 18, 2026·~10 mincs.CVcs.AI
Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering
Luca Hagen, Johanna P. Müller, Weitong Zhang, Mengyun Qiao, +1
ELI5A technique that helps small medical AI models give more reliable answers to questions about images by having multiple candidate answers 'vote' on the best response using semantic similarity instead of exact word matching, improving accuracy without slowing things down.
Problem solvedSmall vision-language models deployed in hospitals need to be fast and private, but they often produce plausible-sounding but wrong medical answers. This method makes them more trustworthy and efficient for clinical use without requiring larger, harder-to-deploy models.
- 2605.18303·May 18, 2026·~10 mincs.LGcs.AIcs.CV
PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics
Xueyu Luan, Chenwei Shi
ELI5Instead of learning physics from scratch, this method teaches a world model to follow physics principles like energy conservation. It imagines future states more accurately by embedding real physical laws into its internal reasoning, like adding guardrails that keep simulations realistic.
Problem solvedAI agents learning through simulation often produce jerky, inefficient movements and unrealistic imagined futures because they don't respect physics principles. This causes poor real-world transfer and wasted energy. The method fixes this by baking in conservation laws, making agents plan smoother, more realistic behavior.
- 2605.18299·May 18, 2026·~13 mincs.AIcs.CLcs.IR
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, +5
ELI5A search-augmented AI agent learns to write better search queries by comparing itself to a smarter version of itself that knows how previous attempts turned out. Instead of just getting one reward at the end, the agent gets feedback on each individual search decision.
Problem solvedSearch-augmented reasoning agents struggle to learn which queries are worth making because they only get a single reward signal at the end of a rollout, not credit for individual search decisions. Previous fixes required expensive teacher models or manual annotations.
- 2605.18261·May 18, 2026·~8 mincs.CL
Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains
Zhonghang Yuan, Zhefan Wang, Fang Hu, Zihong Chen, +6
ELI5A method that helps AI models get better at answering questions in knowledge-heavy fields (like history or science) by automatically creating practice problems with checkable answers, then training the model to reason through them step-by-step.
Problem solvedLLMs struggle with knowledge-intensive domains because there aren't enough verified training examples to learn from, and current training methods only check if final answers are right—missing flawed reasoning along the way. This fixes both problems.
- 2605.18176·May 18, 2026·~11 mincs.CVcs.AI
MARS: Technical Report for the CASTLE Challenge at EgoVis 2026
Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, +3
ELI5A system that answers questions about real-world activities by smartly choosing which of many data sources to look at—videos, transcripts, photos, eye gaze, heart rate—rather than trying to process everything at once.
Problem solvedEgocentric video understanding requires reasoning over massive amounts of multimodal data (4 days, 15 camera angles, multiple sensor streams). Models can't fit all of it in memory, so you need an agent that knows which evidence to pull and when to answer.
- 2605.18162·May 18, 2026·~8 mincs.CVcs.AI
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency
Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang, +3
ELI5A method that teaches vision-language models to maintain consistent answers when images are transformed (rotated, flipped, etc.) in predictable ways—like ensuring if 'A is left of B' is true, then after rotation the answer should still logically hold.
Problem solvedVLMs fail at spatial reasoning tasks when inputs are slightly transformed, even though the logical answer should follow predictable rules. This inconsistency makes models unreliable for real applications requiring spatial understanding.
- 🚀Shipping2605.16238·May 15, 2026·~9 mincs.AI
Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, +2
⭐ 171 stars / 10 repos📚 0 citesELI5An AI system uses a language model to automatically design and test disease forecast models by searching through combinations of mathematical approaches, then picks the best ones to predict flu, COVID, and RSV—matching expert predictions without needing humans to build the models.
Problem solvedDisease forecasting currently requires expert teams to manually build and tune models for each pathogen and location, which is slow and doesn't scale. This system automates that work so forecasts can be deployed quickly for new diseases or regions without waiting for scarce modeling expertise.
- 💤Quiet2605.16233·May 15, 2026·~13 mincs.AIcs.CLcs.LG
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, +2
⭐ 75 stars / 10 repos📚 0 citesELI5A system that lets AI agents learn from their mistakes by writing down lessons learned (rules or examples) without changing the model's weights. Multiple agents share the best tips discovered so far, improving their ability to make decisions in complex, uncertain situations.
Problem solvedLLM agents struggle with stochastic, long-horizon tasks and fail catastrophically without fine-tuning. This approach lets agents improve through natural-language memory sharing alone, cutting failure rates dramatically without gradient updates or access to stronger teacher models.
- 🚀Shipping2605.16217·May 15, 2026·~13 mincs.CLcs.AIcs.IR
Argus: Evidence Assembly for Scalable Deep Research Agents
Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, +6
⭐ 123 stars / 23 repos📚 0 citesELI5A research AI system where one agent searches for evidence pieces while another agent tracks what's been found, spots what's missing, and assembles everything into a final answer—like coordinating a team to complete a jigsaw puzzle instead of having everyone solve it separately.
Problem solvedCurrent AI research agents waste compute by running parallel searches that duplicate effort instead of finding new information, and they struggle to fit all the results into context windows. This system makes parallel searching actually efficient by tracking what's been gathered and targeting searches at gaps.
- 🚀Shipping2605.16207·May 15, 2026·~8 mincs.AIcs.CL
Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, +2
⭐ 439 stars / 22 repos📚 0 citesELI5Researchers tested whether AI tutors can actually tell the difference between correct answers, partially correct answers, and wrong answers—and found they're surprisingly bad at catching subtle mistakes that real tutors should catch.
Problem solvedSchools and education platforms are replacing human tutors with AI, but we didn't know if these AI tutors could actually diagnose student mistakes well enough to give useful feedback. This matters because bad diagnosis leads to bad teaching.
- 🚀Shipping2605.16205·May 15, 2026·~14 mincs.AIcs.CLcs.LG
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, +2
⭐ 348 stars / 28 repos📚 0 citesELI5Researchers tested different ways to build AI agents that play a cyber defense game where they can't see the full situation. They compared three design choices: what information to show the agent, how much the agent should think things through, and whether to use one big agent or split it into smaller specialist agents. They found that clean data representation and task splitting work best, but adding too much internal reasoning actually makes things worse.
Problem solvedTeams building AI agents for complex, partial-information tasks don't know which design patterns actually improve performance versus just burning compute. This study quantifies the cost-benefit tradeoffs of context, reasoning depth, and hierarchical decomposition so builders can stop guessing and start optimizing.
- 💤Quiet2605.16153·May 15, 2026·~9 mincs.AI
An Algebraic Exposition of the Theory of Dyadic Morality
Kush R. Varshney
⭐ 63 stars / 9 repos📚 0 citesELI5Researchers formalize how people judge right and wrong using algebra and causal diagrams. They show humans simplify moral questions into agent-versus-victim scenarios, then use that insight to help AI systems make better policy decisions.
Problem solvedAI systems struggle to align with human moral reasoning because we lack a precise, tractable model of how people actually judge morality. This gives AI builders a mathematical framework to embed human moral cognition into their systems and predict where policies might cause conflicts.
- 🚀Shipping2605.16142·May 15, 2026·~15 mincs.AIcs.LG
Property-Guided LLM Program Synthesis for Planning
Augusto B. Corrêa, André G. Pereira, Jendrik Seipp
⭐ 156 stars / 10 repos📚 0 citesELI5Instead of telling an AI program-writer 'your code got 3 out of 10 tests right, try again,' this method checks if the code breaks a specific rule and shows exactly where it fails. The AI learns faster because it gets concrete feedback on what's wrong, not just a score.
Problem solvedLLMs waste compute by generating and testing many program candidates blindly. This approach provides early stopping and targeted feedback—when a program violates a formal property, evaluation halts immediately and the LLM sees a concrete counterexample, cutting candidate generation 7x and evaluation cost by orders of magnitude.
- 🚀Shipping2605.16117·May 15, 2026·~9 mincs.CL
SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation
Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, +1
⭐ 199 stars / 10 repos📚 0 citesELI5A system that helps AI language models answer tricky questions by first building a small, focused map of relevant facts from a knowledge base, then walking through that map step-by-step to reach a reliable answer.
Problem solvedLanguage models often hallucinate or give inconsistent answers on complex reasoning tasks because they're working from just their training data. This grounds them in real, structured facts and makes their reasoning process traceable and verifiable.