What do these badges mean?
- 🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
- 📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
- 💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
- ðŸŽHypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.
- 2605.18703·May 18, 2026·~11 mincs.CLcs.LG
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li, +11
ELI5A system that automatically creates realistic practice environments and training scenarios for AI agents to learn how to use tools and APIs. Instead of manually building fake environments or relying on expensive real APIs, it explores actual software systems and generates natural multi-turn conversations that teach agents to reason like humans.
Problem solvedTraining tool-use agents is expensive and data-scarce: real APIs cost money, LLM simulators hallucinate, and existing synthetic data is either single-turn or too instruction-like. EnvFactory automates both environment discovery and realistic trajectory generation, cutting the number of required environments by 5x while improving agent performance.
- 2605.18693·May 18, 2026·~14 mincs.AI
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang, +7
ELI5A benchmark that tests whether AI agents can write their own reusable instructions (skills) by learning from code repositories and documents, then using those skills to solve new tasks—like learning a recipe and actually being able to cook.
Problem solvedCurrent benchmarks only test if agents can use pre-made skills or solve tasks from raw data, but don't measure the core challenge: can agents actually generate correct, reusable skills themselves? This benchmark isolates and measures that skill-generation capability.
- 2605.18692·May 18, 2026·~13 mincs.AImath.OC
Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches
Tinghan Ye, Arnaud Deza, Ved Mohan, El Mehdi Er Raqabi, +1
ELI5When a real-world optimization problem changes (new rules, constraints, or data), this system lets business users tweak and re-solve complex models by chatting with an AI that acts like an operations expert, picking smart techniques to find good answers fast.
Problem solvedOperations teams can't easily adapt deployed optimization models when business rules change—they're stuck waiting for expert OR consultants. This system lets end users modify and re-solve models through conversation, without needing specialists on standby.
- 2605.18684·May 18, 2026·~13 mincs.SEcs.AI
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents
Sanderson Oliveira de Macedo, Ronaldo Martins da Costa
ELI5A framework that uses AI agents to automatically read old, poorly documented software and write down what it does—creating instruction manuals that newer AI agents can use to safely modify or rewrite that legacy code.
Problem solvedLegacy systems have undocumented rules and behaviors buried in code; AI coding agents need clear specs to modify these systems safely. This bridges the gap by extracting implicit knowledge and turning it into machine-readable specifications.
- 2605.18672·May 18, 2026·~9 mincs.AI
Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment
S. Bensalem, Y. Dong, M. Franzle, X. Huang, +5
ELI5LLM agents need three independent safety checkpoints—one for understanding what the user wanted, one for checking if the action makes sense in the world, and one for ensuring the action is physically feasible—because no single guardrail can verify all three.
Problem solvedCurrent safety systems for AI agents try to catch all problems in one place, but they're structurally blind to certain failure modes. This architecture shows why you need layered, independent checks and how to mathematically guarantee their combined safety.
- 2605.18661·May 18, 2026·~12 mincs.AI
AI for Auto-Research: Roadmap & User Guide
Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, +16
ELI5This guide maps out where AI can actually help scientists do research—from finding papers to writing them up—and where it still makes stuff up. It shows AI is great at organizing information but terrible at catching its own mistakes or coming up with truly new ideas.
Problem solvedAI can now generate research papers cheaply, but it fabricates results and misses errors, creating a credibility crisis. Teams need clear guidance on which research tasks are safe to automate versus which require human judgment to avoid publishing false science.
- 2605.18630·May 18, 2026·~12 mincs.AIphysics.comp-ph
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya, +4
ELI5A benchmark that tests whether AI assistants can ask clarifying questions when scientists describe vague or contradictory problems—like asking 'what material?' when someone says 'simulate stress' without specifying it—before jumping to solutions.
Problem solvedScientists often describe problems incompletely or with contradictions, but current LLM benchmarks assume problems are already well-defined. This tests whether AI can reliably dialog to fix broken problem statements before wasting time on wrong computations.
- 2605.18597·May 18, 2026·~11 mincs.AI
Latent Action Reparameterization for Efficient Agent Inference
Wenhao Huang, Qingwen Zeng, Qiyue Chen, Zijie Guo, +10
ELI5Instead of having AI agents pick one action at a time (like 'click here, then type this, then submit'), this method teaches them to learn multi-step behaviors as single abstract moves. It's like condensing a chess player's 10 micro-decisions into one meaningful move, so they need fewer total moves to finish a task.
Problem solvedLLM agents waste computation tokens by breaking tasks into tiny text actions, forcing long decision sequences. This slows inference and wastes compute. LAR compresses action sequences by learning semantic action bundles, cutting tokens and wall-clock time while keeping task success rates the same or better.
- 2605.18583·May 18, 2026·~14 mincs.SEcs.AIcs.CL
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks
Yubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng, +3
ELI5When you ask an AI coding agent to do something small, it sometimes does way more than you asked—deleting files you didn't mention, changing configs, etc. This paper builds a test suite to measure how often this happens and discovers that agents stop respecting boundaries when you explicitly tell them what they're allowed to do.
Problem solvedAutonomous coding agents with file and network access pose a real safety risk: they expand tasks beyond scope and touch things the user never authorized. There's no good way to measure this behavior, and the measurement itself tricks agents into better compliance by stating rules explicitly.
- 2605.18572·May 18, 2026·~10 mincs.CL
MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion
Dingyi Zhang, Ziqing Zhuang, Linhai Zhang, Ziyang Gao, +1
ELI5A system that acts like a persuasive negotiator by reading between the lines of what someone is saying, figuring out what they really think and want, then adapting its persuasion strategy on the fly rather than giving canned responses.
Problem solvedCurrent AI persuasion systems generate generic responses and struggle to adapt across different domains (sales, counseling, negotiation). This framework infers hidden beliefs and emotions, then executes targeted strategies that actually work.
- 2605.18548·May 18, 2026·~10 mincs.CLcs.AI
STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics
Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, +4
ELI5A benchmark that tests whether AI assistants can notice when the world changes mid-task and successfully adjust their plans—like a robot realizing a door locked while it was walking toward it and finding another route instead.
Problem solvedReal-world AI agents fail when their plans break mid-execution due to unexpected changes. Existing benchmarks only test *detecting* changes, not *fixing* broken plans, leaving a major gap in testing production-ready agents.
- 2605.18535·May 18, 2026·~12 mincs.LGcs.MA
Beyond Scaling: Agents Are Heading to the Edge
Chunlin Tian, Dongqi Cai, Wanru Zhao, Nicholas D. Lane
ELI5Instead of sending everything to the cloud, smart assistants should run directly on your device because they need instant access to your files, real-time sensor data, and immediate feedback to work well—like how a co-pilot needs to be in the cockpit, not calling a distant headquarters.
Problem solvedCloud-based agents lose critical local context (files, sensor data, OS state) in transmission and suffer latency delays, making them unreliable for real-time personal tasks. Edge deployment preserves ground-truth data and enables instant feedback loops that actually improve agent performance over time.
- 2605.18476·May 18, 2026·~10 minstat.COcs.AIcs.LG
AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers
Jungang Zou, Alex Ziyu Jiang, Qixuan Chen
ELI5An AI system that reads English descriptions of statistical models and automatically writes working code to solve them using MCMC sampling—like having an expert programmer translate your research idea into production-ready statistical software.
Problem solvedBayesian statisticians spend weeks writing complex, error-prone sampling code for each new model. This automates that painful translation from research concept to validated, runnable inference code, dramatically speeding up the research-to-production pipeline.
- 2605.18421·May 18, 2026·~10 mincs.CLcs.AIcs.LG
EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective
Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, +6
ELI5A new benchmark that tests how well AI agents can remember and use information over time—both within a single conversation and across multiple conversations. It compares 15 different memory approaches to see which ones actually work best.
Problem solvedCurrent AI agent benchmarks ignore memory, even though real agents need to learn from past interactions and store important information. There was no standard way to measure whether an agent's memory system actually helps it perform better.
- 2605.18414·May 18, 2026·~8 mincs.CRcs.AI
Prompts Don't Protect: Architectural Enforcement via MCP Proxy for LLM Tool Access Control
Rohith Uppala
ELI5LLMs acting as agents can be tricked into using forbidden tools even when told not to. This paper adds a gatekeeper that removes unauthorized tools from the model's view and blocks any sneaky attempts to call them anyway—reducing security breaches to zero.
Problem solvedCompanies deploying autonomous LLM agents face a critical security gap: models ignore access control instructions when forbidden tools are visible, creating a real attack vector. Prompts alone don't work; you need enforced rules that the model literally cannot bypass.
- 2605.18407·May 18, 2026·~10 mincond-mat.mes-hallcond-mat.mtrl-scics.AI
Qumus: Realization of An Embodied AI Quantum Material Experimentalist
Lihan Shi, Zhaoyi Joy Zheng, Xinzhe Juan, Yimin Wang, +13
ELI5A robot lab that uses AI to independently run physics experiments—it generates ideas, plans procedures, executes them with robotic hands, analyzes results, and learns from mistakes, successfully creating graphene and nanoscale devices on its own.
Problem solvedScientists spend months on repetitive materials experiments with slow feedback loops. This system compresses that cycle by having an AI agent autonomously execute, monitor, and refine experiments in real time without human intervention between steps.
- 2605.18401·May 18, 2026·~9 mincs.CLcs.AI
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
Hongyi Liu, Haoyan Yang, Tao Jiang, Bo Tang, +2
ELI5A system that lets AI agents collect, organize, and improve reusable skills from their past experiences—like building a library of proven tricks—while filtering out bad or outdated ones so they don't clutter the agent's future attempts.
Problem solvedLLM agents generate lots of experience but can't easily reuse or learn from it; raw trajectories are messy, skills conflict or become outdated, and bad skills pollute future context. This gives agents a governed way to evolve and share working solutions.
- 2605.18395·May 18, 2026·~10 mincs.CYcs.AI
Diagnosing Korean-Language LLM Political Bias via Census-Grounded Agent Simulation
Sungwoo Kang
ELI5A tool that tests whether Korean language AI models are politically biased by simulating how they'd vote in real elections. The researchers found the models have predictable quirks—like favoring progressive candidates or ignoring smaller parties—and showed ways to fix them.
Problem solvedKorean-language LLMs show hidden political leanings that could skew applications like polling or civic engagement tools. This work diagnoses exactly where and how those biases fail, then offers practical fixes without needing to retrain the models.
- 2605.18332·May 18, 2026·~14 mincs.SEcs.AI
Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents
Wei Ma, Zhi Chen, Jingxu Gu, Tianling Li, +2
ELI5When researchers study how AI agents fix code, they notice patterns like 'good agents test after coding' or 'good agents keep errors short.' But this study found those patterns flip upside-down depending on which agent framework you use—what helps one agent hurts another. Same behavior, opposite meaning.
Problem solvedPapers keep publishing 'rules' about what makes software engineering agents work better, but those rules don't reliably transfer between different agent systems. Teams adopting these findings waste effort following advice that may actively harm their specific setup.
- 2605.18327·May 18, 2026·~10 mincs.AI
Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows
Dhairya Dalal, Endre Sara, Ben Yemini, Christine Miller, +1
ELI5Instead of feeding AI agents raw monitoring data and making them figure out what's broken, this system pre-builds a map of how your systems are connected and what causes what—so when something fails, the AI agent can diagnose the problem 63% faster and use way fewer tokens.
Problem solvedSRE teams use AI agents to debug production issues, but without structured knowledge of system dependencies, agents waste time and tokens parsing raw telemetry. This causes slow diagnosis, high API costs, and unreliable answers. A causal model fixes this by giving agents pre-computed relationships.
- 2605.18181·May 18, 2026·~11 mincs.AIcs.CL
Scalable Environments Drive Generalizable Agents
Jiayi Zhang, Fanqi Kong, Guibin Zhang, Maojia Song, +6
ELI5Training AI agents only on fixed rules makes them brittle—like teaching someone to play chess but not checkers. This paper argues you need to systematically vary the underlying rules and mechanics agents interact with, not just give them more tasks in the same world.
Problem solvedCurrent AI agents fail when the environment changes (new interfaces, different physics, altered rewards). They're overfit to specific rule-sets. You need agents that adapt to fundamentally different worlds, but nobody's systematically scaling that dimension.
- 2605.18176·May 18, 2026·~11 mincs.CVcs.AI
MARS: Technical Report for the CASTLE Challenge at EgoVis 2026
Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, +3
ELI5A system that answers questions about real-world activities by smartly choosing which of many data sources to look at—videos, transcripts, photos, eye gaze, heart rate—rather than trying to process everything at once.
Problem solvedEgocentric video understanding requires reasoning over massive amounts of multimodal data (4 days, 15 camera angles, multiple sensor streams). Models can't fit all of it in memory, so you need an agent that knows which evidence to pull and when to answer.
- 2605.18144·May 18, 2026·~13 mincs.AI
Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine
Christiaan G. A. Viviers, Koen de Bruin, Mirre M. Trines, Ayla M. Hokke, +5
ELI5A tool that reads thousands of nanomedicine research papers, finds gaps and connections between different research areas, then uses AI to suggest new research directions—backed by actual citations so scientists can verify where the ideas came from.
Problem solvedNanomedicine researchers drown in fragmented literature across chemistry, biology, and medicine, making it hard to spot promising new directions. This system helps researchers discover underexplored intersections and generates hypothesis ideas grounded in existing evidence.
- 2605.18133·May 18, 2026·~11 mincs.CRcs.AIcs.HC
An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments
Hongjang Yang, Hyunsik Na, Daeseon Choi
ELI5Hackers can trick AI chatbots into leaking private information by sneaking malicious instructions into websites the chatbot visits. The attacker makes their instructions look like innocent examples, so the AI follows them instead of the user's original request.
Problem solvedAs AI chatbots browse the web to help users, they're vulnerable to data theft attacks embedded in websites they visit. Companies need to understand these risks to protect user privacy before deploying these tools in production.