What do these badges mean?
- 🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
- 📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
- 💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
- 🎭HypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.
- 💤Quiet2605.18738·May 18, 2026·~11 mincs.AI
What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models
Payal Chandak, Victoria Alkin, David Wu, Maya Dagan, +10
⭐ 1 stars / 2 repos📚 0 citesELI5Researchers tested whether AI language models used in medicine make ethical decisions the same way human doctors do, or if they have hidden value biases. They found that while AI systems discuss multiple viewpoints, they actually make the same decision over and over, and some consistently underweight patient choice—potentially replacing diverse doctor values with one AI's preferences at scale.
Problem solvedWhen hospitals deploy AI to advise on medical decisions, nobody knows what ethical values the AI actually prioritizes. Some AI systems might consistently favor treatment efficiency over patient autonomy, or other values, but this bias goes undetected—meaning one AI's hidden preferences could be applied to thousands of patients instead of respecting the natural variation in how good doctors make ethical trade-offs.
- 2605.18693·May 18, 2026·~14 mincs.AI
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang, +7
ELI5A benchmark that tests whether AI agents can write their own reusable instructions (skills) by learning from code repositories and documents, then using those skills to solve new tasks—like learning a recipe and actually being able to cook.
Problem solvedCurrent benchmarks only test if agents can use pre-made skills or solve tasks from raw data, but don't measure the core challenge: can agents actually generate correct, reusable skills themselves? This benchmark isolates and measures that skill-generation capability.
- 2605.18681·May 18, 2026·~10 mincs.AIcs.LG
Learning Quantifiable Visual Explanations Without Ground-Truth
Amritpal Singh, Andrey Barsky, Mohamed Ali Souibgui, Ernest Valveny, +1
ELI5A new way to measure how good an AI explanation is, without needing humans to label what the right answer should be. The method also trains a small add-on module that can sit on top of any AI model to generate better explanations of what it's actually looking at.
Problem solvedExplainability methods for AI are hard to evaluate because we don't have ground truth for what a 'good' explanation actually is. This creates a chicken-and-egg problem: you can't improve explanations if you can't measure them fairly.
- 2605.18667·May 18, 2026·~10 mincs.CVcs.LG
Better Together: Evaluating the Complementarity of Earth Embedding Models
Thijs L van der Plas, Jacob JW Bakermans, Vishal Nedungadi, Gabrielė Tijūnaitytė, +2
ELI5When you combine satellite image embeddings from different models, you get better results than using any single one alone. This paper measures how well different Earth observation models complement each other and shows that mixing them beats relying on just the best individual model.
Problem solvedEarth embedding models are typically evaluated separately, making it seem like one model is universally better. But the real value comes from combining multiple models—a capability that standard evaluation methods completely miss.
- 2605.18663·May 18, 2026·~14 mincs.AIcs.CLcs.LG
GIM: Evaluating models via tasks that integrate multiple cognitive domains
Rohit Patel, Alexandre Rezende, Steven McClain
ELI5A new benchmark that tests whether AI models can handle realistic tasks requiring multiple reasoning steps at once—like juggling constraints, tracking state, and knowing when to be uncertain. Instead of just piling on hard facts to memorize or abstract puzzles with no context, it measures integration of different thinking skills on grounded problems.
Problem solvedExisting benchmarks either test memorization (unfair) or pure abstraction (unrealistic). This creates a benchmark where difficulty comes from coordinating multiple reasoning types on practical tasks, and provides a statistical framework (IRT) to fairly compare models even when they fail differently or use different compute budgets.
- 2605.18661·May 18, 2026·~12 mincs.AI
AI for Auto-Research: Roadmap & User Guide
Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, +16
ELI5This guide maps out where AI can actually help scientists do research—from finding papers to writing them up—and where it still makes stuff up. It shows AI is great at organizing information but terrible at catching its own mistakes or coming up with truly new ideas.
Problem solvedAI can now generate research papers cheaply, but it fabricates results and misses errors, creating a credibility crisis. Teams need clear guidance on which research tasks are safe to automate versus which require human judgment to avoid publishing false science.
- 2605.18630·May 18, 2026·~12 mincs.AIphysics.comp-ph
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya, +4
ELI5A benchmark that tests whether AI assistants can ask clarifying questions when scientists describe vague or contradictory problems—like asking 'what material?' when someone says 'simulate stress' without specifying it—before jumping to solutions.
Problem solvedScientists often describe problems incompletely or with contradictions, but current LLM benchmarks assume problems are already well-defined. This tests whether AI can reliably dialog to fix broken problem statements before wasting time on wrong computations.
- 2605.18607·May 18, 2026·~12 mincs.CLcs.LG
Forecasting Downstream Performance of LLMs With Proxy Metrics
Arkil Patel, Siva Reddy, Marius Mosbach, Dzmitry Bahdanau
ELI5Instead of waiting for a language model to finish training to see if it's good, you can peek at how confidently it predicts expert-written solutions during training—using metrics like prediction certainty and accuracy on those examples—to forecast what it'll be able to do later.
Problem solvedTraining large language models is expensive and slow. Developers need to pick architectures, training data, and methods before committing compute, but current signals (loss numbers or expensive full evaluations) are unreliable. This lets you forecast downstream performance cheaply during training to make better development decisions.
- 2605.18583·May 18, 2026·~14 mincs.SEcs.AIcs.CL
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks
Yubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng, +3
ELI5When you ask an AI coding agent to do something small, it sometimes does way more than you asked—deleting files you didn't mention, changing configs, etc. This paper builds a test suite to measure how often this happens and discovers that agents stop respecting boundaries when you explicitly tell them what they're allowed to do.
Problem solvedAutonomous coding agents with file and network access pose a real safety risk: they expand tasks beyond scope and touch things the user never authorized. There's no good way to measure this behavior, and the measurement itself tricks agents into better compliance by stating rules explicitly.
- 2605.18580·May 18, 2026·~8 mincs.AIcs.LG
When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State
Peiying Zhu, Sidi Chang
ELI5When training AI agents to make decisions (like pricing), just checking if they hit the business goal isn't enough—they might break important rules along the way. This paper shows how to evaluate whether an agent actually behaves like it should by checking its full sequence of actions, not just the final outcome.
Problem solvedCompanies deploying RL agents discover too late that policies hit revenue targets while violating compliance rules or competitive norms. Current evaluation misses these behavioral failures because it only checks outcomes, not whether the agent preserves the discipline and patterns of the system it's replacing.
- 2605.18565·May 18, 2026·~13 mincs.CLcs.AI
LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, +2
ELI5This benchmark tests whether AI agents can remember and reason about information correctly when facts keep changing and interfering with each other over very long conversations or documents—like tracking a character's location across a 200-page novel where they move around repeatedly.
Problem solvedAI agents fail at real-world tasks where information gets updated frequently and facts overlap (like a chatbot managing a customer's changing order status). Existing tests only check single facts in isolation, not messy, interconnected memories where old and new information conflict.
- 2605.18562·May 18, 2026·~14 minstat.MEcs.AIcs.LG
Estimating Item Difficulty with Large Language Models as Experts
Diana Kolesnikova, Kirill Fedyanin, Abe D. Hofman, Matthieu J. S. Brinkhuis, +1
ELI5LLMs can estimate how hard test questions are without needing students to actually take them. The study tests different ways of asking these AI models to rate difficulty—like comparing two questions or rating one directly—and finds some approaches work nearly as well as human experts.
Problem solvedCreating new test items requires expensive pretesting with real students or expert judges to know which questions are easy vs. hard. This approach lets you get difficulty estimates instantly using off-the-shelf LLMs, cutting time and cost for adaptive learning systems and assessments.
- 2605.18548·May 18, 2026·~10 mincs.CLcs.AI
STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics
Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, +4
ELI5A benchmark that tests whether AI assistants can notice when the world changes mid-task and successfully adjust their plans—like a robot realizing a door locked while it was walking toward it and finding another route instead.
Problem solvedReal-world AI agents fail when their plans break mid-execution due to unexpected changes. Existing benchmarks only test *detecting* changes, not *fixing* broken plans, leaving a major gap in testing production-ready agents.
- 2605.18498·May 18, 2026·~9 mincs.LGcs.AI
DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs
Jing Wang, Hongxuan Lu, Jazze Young, Shu Wang, +1
ELI5This paper creates a diagnostic toolkit to measure how specialized different experts are in Mixture-of-Experts models—checking if each expert handles specific domains well versus doing a bit of everything. It then shows you can use these measurements to train models more efficiently by focusing on the experts that work best for your task.
Problem solvedMoE models are hard to understand and optimize because you can't easily tell if experts are truly specialized or just load-balanced. This work lets you identify which experts are good at what, then reuse that knowledge to cut training time by 85% while improving performance on specific domains.
- 2605.18490·May 18, 2026·~13 mincs.CLcs.IR
Vector RAG vs LLM-Compiled Wiki: A Preregistered Comparison on a Small Multi-Domain Research
Theodore O. Cochran
ELI5Two ways to help AI answer questions about research papers go head-to-head: traditional search-and-retrieve versus an AI-compiled wiki. The wiki better connected ideas across papers, but cost more per question to run, and neither was clearly better overall.
Problem solvedTeams need to know which approach to use when building AI systems that answer questions over document collections—does it make sense to pre-compile a wiki or retrieve chunks dynamically? This test shows the tradeoff depends on what matters most: accuracy, cost, or citation quality.
- 2605.18430·May 18, 2026·~8 mincs.LG
Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation
Liang Wang, Heng Meng, Zekai Xiang, Jin Liu, +3
ELI5A test suite for AI systems that turn written descriptions into 3D CAD design files. It includes 600 examples ranging from simple shapes to complex real-world parts, helping measure how well language models can actually create usable designs.
Problem solvedCurrent AI models can handle simple CAD generation but fail on complex designs with advanced features. Engineers and designers need a standardized way to measure whether AI can reliably convert natural language into production-ready parametric models across different industries.
- 2605.18421·May 18, 2026·~10 mincs.CLcs.AIcs.LG
EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective
Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, +6
ELI5A new benchmark that tests how well AI agents can remember and use information over time—both within a single conversation and across multiple conversations. It compares 15 different memory approaches to see which ones actually work best.
Problem solvedCurrent AI agent benchmarks ignore memory, even though real agents need to learn from past interactions and store important information. There was no standard way to measure whether an agent's memory system actually helps it perform better.
- 2605.18401·May 18, 2026·~9 mincs.CLcs.AI
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
Hongyi Liu, Haoyan Yang, Tao Jiang, Bo Tang, +2
ELI5A system that lets AI agents collect, organize, and improve reusable skills from their past experiences—like building a library of proven tricks—while filtering out bad or outdated ones so they don't clutter the agent's future attempts.
Problem solvedLLM agents generate lots of experience but can't easily reuse or learn from it; raw trajectories are messy, skills conflict or become outdated, and bad skills pollute future context. This gives agents a governed way to evolve and share working solutions.
- 2605.18380·May 18, 2026·~8 mincs.AI
QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
Anthony G. Cohn, Robert E. Blackwell
ELI5A benchmark that tests whether AI language models can reason about spatial and temporal relationships—like 'if A is left of B and B is left of C, what's the relationship between A and C?'—across different mathematical systems designed for this type of logic.
Problem solvedWe don't have a standard way to measure whether LLMs can actually do spatial and temporal reasoning, which matters for robotics, navigation, and planning tasks. This benchmark fills that gap and reveals that current models struggle with complex spatial calculi.
- 2605.18352·May 18, 2026·~9 mincs.CL
Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs
Tara Azin, Yongan Yu, Raj Singh, Olessia Jouravlev
ELI5Researchers tested whether AI language models understand how assumptions work in sentences like 'If the king is bald, the queen is sad'—specifically, whether they grasp that we presume the king exists. They compared AI outputs to what 120 humans said and found AIs often match human answers by accident, not by actually reasoning through the logic.
Problem solvedWe don't know if language models truly understand pragmatics (the unspoken rules of language) or just memorize patterns. This matters for building trustworthy AI: a model might give right answers for wrong reasons, which fails on novel language tasks.
- 2605.18332·May 18, 2026·~14 mincs.SEcs.AI
Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents
Wei Ma, Zhi Chen, Jingxu Gu, Tianling Li, +2
ELI5When researchers study how AI agents fix code, they notice patterns like 'good agents test after coding' or 'good agents keep errors short.' But this study found those patterns flip upside-down depending on which agent framework you use—what helps one agent hurts another. Same behavior, opposite meaning.
Problem solvedPapers keep publishing 'rules' about what makes software engineering agents work better, but those rules don't reliably transfer between different agent systems. Teams adopting these findings waste effort following advice that may actively harm their specific setup.
- 2605.18229·May 18, 2026·~8 mincs.LGcs.AI
Are Sparse Autoencoder Benchmarks Reliable?
David Chanin
ELI5Researchers tested the quality metrics used to judge sparse autoencoders (tools that help explain what language models do internally), and found that two popular scoring methods are unreliable—like using a broken scale to weigh ingredients. The remaining metrics work better but still struggle to clearly rank different approaches.
Problem solvedThe SAE field relies on benchmarks to measure progress, but if those benchmarks are broken or noisy, teams can't tell if they're actually improving their interpretability tools or just getting lucky. This audit reveals which metrics to trust and which to abandon, unblocking better research direction.
- 2605.18143·May 18, 2026·~9 mincs.AI
Generative AI and the Productivity Divide: Human-AI Complementarities in Education
Lihi Idan, Bharat Anand
ELI5When people use AI assistants to learn, some get huge benefits while others barely improve—not because they're smarter, but because they're better at asking the right questions and checking answers. Teaching people how to interact with AI well can level the playing field.
Problem solvedCompanies rolling out AI tools see wildly uneven productivity gains across their workforce, with success depending on an invisible skill (AI interaction competence) rather than traditional markers like IQ or domain knowledge. Without training on how to use AI effectively, firms waste money on tools that only help their best prompt-engineers.
- 💤Quiet2605.16234·May 15, 2026·~9 mincs.LGcs.AIcs.CL
Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find
Gabriel Garcia
⭐ 73 stars / 10 repos📚 0 citesELI5When you test whether a layer in a transformer is redundant, different test methods give different answers about which layers are safe to remove. This paper shows that the gap between these tests is large and unpredictable, so you need to run both tests before deciding what to prune.
Problem solvedModel compression tools rely on identifying redundant layers to remove, but current equivalence tests disagree on which layers are actually safe to cut. This means compression pipelines built on one test method can fail when the layers chosen for removal don't actually compress well in practice.
- 💤Quiet2605.16223·May 15, 2026·~5 mincs.GRcs.AIcs.CV
Evaluating Design Video Generation: Metrics for Compositional Fidelity
Adrienne Deganutti, Dingning Cao, Jaejung Seol, Elad Hirsch, +1
⭐ 78 stars / 10 repos📚 0 citesELI5A new way to automatically grade how well AI video generators handle design animations—checking if objects move the right way, stay where they should, and follow the instructions given.
Problem solvedDesign animation has strict rules (move this box left, keep that text still) but there was no automated way to measure if generated videos actually follow them. Teams had to manually watch and grade videos, slowing down development.
- 🚀Shipping2605.16215·May 15, 2026·~13 mincs.AIcs.CL
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, +4
⭐ 434 stars / 25 repos📚 0 citesELI5Researchers built the first completely transparent medical AI model where you can see everything: what data it learned from, how it was cleaned, how it was trained, and how it works. They combined medical question datasets, added clinician-verified practice guidelines, and had doctors validate every step.
Problem solvedMedical AI systems need to be trustworthy and auditable for doctors to use them, but most 'open' models hide their training data and methods. This makes it impossible to validate they're safe or understand why they give certain answers—a critical problem in healthcare.
- 🚀Shipping2605.16207·May 15, 2026·~8 mincs.AIcs.CL
Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, +2
⭐ 439 stars / 22 repos📚 0 citesELI5Researchers tested whether AI tutors can actually tell the difference between correct answers, partially correct answers, and wrong answers—and found they're surprisingly bad at catching subtle mistakes that real tutors should catch.
Problem solvedSchools and education platforms are replacing human tutors with AI, but we didn't know if these AI tutors could actually diagnose student mistakes well enough to give useful feedback. This matters because bad diagnosis leads to bad teaching.
- 💤Quiet2605.16116·May 15, 2026·~13 mincs.AI
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, +4
⭐ 72 stars / 9 repos📚 0 citesELI5A framework that turns real online stores into controllable, reproducible test environments for AI shopping agents. It captures the real structure and complexity of e-commerce sites but lets researchers reset them, inspect them, and run consistent experiments.
Problem solvedTesting e-commerce agents on real websites is messy and irreproducible; testing on hand-built fake stores is too narrow and unrealistic. ShopGym bridges this by automatically converting real storefronts into stable, inspectable simulations that preserve actual shopping complexity.
- 🚀Shipping2605.16107·May 15, 2026·~11 mincs.CL
Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection
Chenwang Wu, Yiuming Cheung, Bo Han, Shuhai Zhang, +1
⭐ 180 stars / 8 repos📚 0 citesELI5A new technique to catch AI-written text by looking at patterns in how tokens (words) relate to each other locally and globally, rather than just checking individual tokens in isolation.
Problem solvedCurrent detectors get fooled by the random variations in how AI generates text. This method fixes that by examining token relationships across context, making detection more robust across different AI models and domains.