Create Next App

All 50 🚀 Shipping 0 📈 Climbing 0 💤 Quiet 50 Unscored 0

What do these badges mean?

🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
🎭HypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.

💤Quiet2607.09654·Jul 10, 2026·~14 mincs.CVcs.AI
Evolution of Accuracy and Visual-Cognitive Errors in a Decade of Vision-Language AI Models
Shravan Murlidaran, Miguel P. Eckstein
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers tracked how well AI models describe images over 10 years, using a new dataset of complex social scenes instead of simple ones. They found modern AI now describes complicated behaviors almost as well as humans, but still sometimes looks at different parts of the image than people do.
Problem solvedPrevious benchmarks only tested AI on simple, curated images and didn't reveal what types of mistakes models were making. This gives builders a clearer picture of where vision-language models actually struggle on real-world complexity.
💤Quiet2607.09649·Jul 10, 2026·~9 mincs.AI
ConceptSMILE: Auditing the Trustworthiness of Concept-Based Explainable AI
Mohadeseh Mollapour, Koorosh Aslansefat, Zeinab Dehghani, Bhupesh Kumar Mishra, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5A tool that checks whether concept-based AI explanations (like 'this image shows vessel damage') are actually reliable, by testing how the model responds when you slightly change parts of the image and seeing if the explanation holds up.
Problem solvedConcept-based explanations seem intuitive to doctors and users, but there's no standard way to verify they're actually faithful to what the model is doing—you could get misleading explanations that sound trustworthy. This framework audits whether those concepts are real.
💤Quiet2607.09560·Jul 10, 2026·~14 mincs.AIcs.LG
Beyond Fixed Representations: The Vocabulary and Verifier Gaps in Open-Ended AI
Yuan Cao, Haiqian Yang
⭐ 0 stars / 0 repos📚 0 cites
ELI5Today's AI systems are stuck working within a fixed rulebook—they can reason and solve problems really well, but can't invent new concepts or tools that would let them tackle fundamentally different kinds of problems. This paper says true innovation requires AI to create and stabilize new building blocks that change the game itself.
Problem solvedCurrent AI hits a wall on open-ended tasks because it can only remix existing ideas, not invent new ones that unlock whole classes of solutions. Without the ability to create and trust new conceptual primitives, AI systems can't do the kind of foundational innovation humans do.
💤Quiet2607.08768·Jul 9, 2026·~12 mincs.CL
UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks
Zhekai Chen, Chengqi Duan, Kaiyue Sun, Bohao Li, +3
⭐ 0 stars / 0 repos📚 0 cites
ELI5A benchmark that tests AI agents on real-world computer tasks like filing documents or managing schedules. Instead of using fake answers, it runs tasks in live environments and checks if agents actually complete each step correctly.
Problem solvedExisting agent benchmarks use sandboxed fake setups and static answers, so they can't measure if agents actually work with real tools and handle multi-turn interactions. This makes it hard to debug why agents fail in the real world.
💤Quiet2607.08763·Jul 9, 2026·~11 mincs.CVcs.AI
OpenCoF: Learning to Reason Through Video Generation
Xinyan Chen, Ziyu Guo, Renrui Zhang, Dongzhi Jiang, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of writing out step-by-step reasoning like ChatGPT does, this system learns to reason by generating videos frame-by-frame—each frame shows the next logical step, like watching a solution unfold visually rather than reading it.
Problem solvedVideo models today generate realistic videos but can't reason through complex problems. This work shows that by training on reasoning-focused videos and giving models special tokens to track logical steps, they can actually use video generation as a reasoning tool—useful for math, planning, and logic tasks.
💤Quiet2607.08758·Jul 9, 2026·~12 mincs.AI
Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation
Yifan Zhou, Qihao Yang, Yan Li, Donggang Li, +13
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new benchmark that tests whether AI can understand how scientific ideas evolve from previous work — tracking what researchers inherit, fix, combine, or invent new — and whether AI can generate ideas that fit logically into a scientific lineage.
Problem solvedWe don't know if AI systems understand scientific progress as building on the past. Current benchmarks don't measure whether AI can trace idea evolution, spot gaps in reasoning chains, or propose genuinely novel work that still coheres with prior research.
💤Quiet2607.08745·Jul 9, 2026·~9 mincs.AIcs.CV
AUTOPILOT VQA: Benchmarking Vision-Language Models for Incident-Centric Dashcam Understanding
Siddharth Damodharan, Radhika Gupta, Ali Alshami, Ryan Rabinowitz, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new test for AI systems that watch dashcam videos and answer questions about car accidents and near-misses. Instead of just recognizing objects, it checks whether models can understand what caused incidents and whether they were avoidable.
Problem solvedVision-language models are being used in self-driving cars, but we lacked a way to evaluate whether they actually understand safety-critical situations. This benchmark lets researchers measure how well AI can reason about real driving incidents across different weather, traffic, and road conditions.
💤Quiet2607.08734·Jul 9, 2026·~8 mincs.AI
The Illusion of Equivalency: Statistical Characterization of Quantization Effects in LLMs
Baha Rababah, Cuneyt Gurcan Akcora, Carson K. Leung
⭐ 0 stars / 0 repos📚 0 cites
ELI5When you shrink language models to use less memory by rounding their numbers, accuracy scores look fine—but the model actually makes different decisions on individual questions. This paper shows the hidden cost: which answers the model picks changes, even when overall test scores stay the same.
Problem solvedTeams deploying quantized LLMs think they're safe because accuracy hasn't dropped much, but they're missing silent failures where the model changes its reasoning. This matters for production systems where consistency and reliability matter beyond raw benchmark numbers.
💤Quiet2607.08731·Jul 9, 2026·~15 mincs.CLcs.AIcs.CY
Validity of LLMs as data annotators: AMALIA on authority
Manuel Pita
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers test whether an AI model can reliably identify moral concepts in text the same way humans do, not just by matching answers. They found a Portuguese AI agrees with humans on the surface, but actually uses shortcuts rather than understanding the concept — like flagging angry language near authority figures instead of truly measuring moral reasoning.
Problem solvedWhen using LLMs to annotate data for training or research, agreement scores hide whether the model actually understands the concept or just exploits surface patterns. This matters because a model that fakes understanding won't generalize to new data, wasting resources and producing invalid datasets.
💤Quiet2607.08700·Jul 9, 2026·~13 mincs.CL
Do You Need a Frontier Model as a Citation Verifier? Benchmarking Rubric LLMs for Deep-Research Source Attribution
Ethan Leung, Elias Lumer, Corey Feld, Austin Huber, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5When training AI systems to cite sources correctly, you need another AI to judge if the citations are good. This paper tests whether you need an expensive cutting-edge model for this judging job, or if cheaper models work just fine. They found cheaper models are actually competitive, but all judges have hidden biases that would mess up training if ignored.
Problem solvedBuilding AI systems that cite sources correctly is hard to measure, so people use AI judges to score them during training. But nobody knew if those judges are reliable or how biased they are — using a bad judge could train broken AI. This paper shows you can use cheaper judges, but only after checking and adjusting for their specific biases.
💤Quiet2607.08681·Jul 9, 2026·~12 mincs.AI
SolarChain-Eval: A Physics-Constrained Benchmark for Trustworthy Economic Agents in Decentralized Energy Markets
Shilin Ou, Yifan Xu, Luyao Zhang
⭐ 0 stars / 0 repos📚 0 cites
ELI5A benchmark that tests whether AI agents managing solar energy markets play fair and stay safe. It grades agents on how well they improve the market, whether they follow physics rules, and whether they try to game the system—plus whether an AI auditor can catch bad behavior.
Problem solvedAs AI agents make real decisions in energy grids and markets, we need ways to check if they're trustworthy beyond just profitability. Agents can exploit data glitches, fake demand, or destabilize the grid; existing benchmarks don't measure these risks alongside performance.
💤Quiet2607.08625·Jul 9, 2026·~8 mincs.AIcs.CL
The complexities of patient-centred conversational artificial intelligence
João Matos, Olivia Buege, Donny Cheung, Gary S. Collins, +7
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers built a realistic patient simulator that captures how real people actually talk to health chatbots—including emotions, communication styles, and messy language—then showed that how a patient *expresses* their symptoms can flip triage decisions, meaning AI health systems fail on real patients if only trained on perfect, scripted conversations.
Problem solvedHealth chatbots are tested on clean, articulate simulated patients but fail in the real world where people are stressed, vague, and emotionally varied. This creates a gap where AI misses urgent cases or over-triages mild ones, especially hurting people who don't communicate in the 'ideal' way the system expects.
💤Quiet2607.08579·Jul 9, 2026·~13 mincs.HCcs.LG
ImputeViz: A Visual Analytics Dashboard for Diagnosing Missing Data and Comparing Imputation Methods
Aitik Dandapat, Lalith Punepalle Raveendrareddy, Mithilesh Kumar Singh, Klaus Mueller
⭐ 0 stars / 0 repos📚 0 cites
ELI5A dashboard that helps you figure out what's wrong with missing data in your dataset and compares different ways to fill in those gaps—like showing you which method works best and how different choices affect your final results.
Problem solvedAnalysts waste time manually comparing imputation methods and struggle to understand why data is missing or how their choice of fill-in strategy biases their analysis. This tool makes that transparent and interactive in one place.
💤Quiet2607.08535·Jul 9, 2026·~7 mincs.CLcs.AI
When the Judge Changes, So Does the Measurement: Auditing LLM-as-Judge Reliability
Zongyou Yang, Yinghan Hou, Xiaokun Yang
⭐ 0 stars / 0 repos📚 0 cites
ELI5When you switch from one AI judge to a slightly newer version, the scores it gives can change dramatically—even though nothing about what's being judged changed. This paper shows that upgrading your evaluator model is like switching referees mid-game: the judgment becomes unreliable because the 'judge' itself is inconsistent.
Problem solvedTeams use LLMs to automatically score model outputs (essays, code, etc.), but swapping to a newer judge model gives wildly different results. This breaks reproducibility and makes it hard to trust that improvements are real or just artifacts of measurement changes.
💤Quiet2607.08522·Jul 9, 2026·~9 mincs.LG
Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data
Ofir Arviv, Kristjan Greenewald, Yotam Perlitz, Hadar Mulian, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of always testing AI models on the same fixed number of examples, this framework dynamically decides when to stop testing based on how confident you already are in the results — like stopping a coin-flip experiment early when you're certain it's fair.
Problem solvedEvaluating AI models wastes compute by testing on unnecessary data when results are already clear, or fails to detect real differences when budgets are too small. This framework cuts evaluation costs by up to 80% while keeping results statistically reliable.
💤Quiet2607.08511·Jul 9, 2026·~7 mincs.LGcs.CV
Systematic Evaluation of Learning Rate Scheduling Strategies Across Heterogeneous Architectures
Hafsa Mateen, Radu Timofte, Dmitry Ignatov
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers tested 25 different learning rate schedules (strategies for adjusting training speed) on 30 neural network designs to see which combinations work best. They found that some schedules consistently win, but the best choice depends on what type of network you're using.
Problem solvedPicking the right learning rate schedule is tedious and usually done by guessing—teams waste time trying options manually. This work gives practitioners real data on what actually works for different architectures, removing the guesswork.
💤Quiet2607.08489·Jul 9, 2026·~6 mincs.CVcs.AIcs.HC
VEGAS: Human-Aligned Video Caption Evaluation via Gaze
Shenghui Chen, Po-han Li, Ximeng Sun, Shijia Yang, +4
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of rating video captions the same way for everyone, this method uses eye-tracking data to pick captions that match what each individual viewer is actually looking at—like customizing summaries based on where someone's eyes go.
Problem solvedVideo captions today ignore what viewers actually pay attention to, making descriptions miss important details in people's focus area. This wastes the opportunity to personalize summaries and makes retrieval systems less accurate for real-world viewing patterns.
💤Quiet2607.08423·Jul 9, 2026·~11 mincs.AI
OmniFood-Bench: Evaluating VLMs for Nutrient Reasoning and Personalized Health Advice
Qian Jiang, Zhecheng Shi, Jingpu Yang, Zirui Song, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5A test suite that checks whether AI vision models can look at food and give safe, accurate nutritional advice—from spotting ingredients and portion sizes to recommending what diabetics should eat. It finds that current models are good at naming dishes but terrible at estimating how much food there actually is.
Problem solvedAI health apps are being deployed to give dietary advice, but there's no rigorous way to test if they're actually safe and accurate. Models confidently make up wrong nutritional info and bad medical recommendations, which is dangerous for people managing diseases like diabetes.
💤Quiet2607.07695·Jul 8, 2026·~13 mincs.AIcs.GTcs.MA
Institutional Red-Teaming: Deployment Rules, Not Just Models, Causally Shape Multi-Agent AI Safety
Yujiao Chen
⭐ 0 stars / 0 repos📚 0 cites
ELI5When you deploy multiple AI agents that interact with each other, the rules governing how they share resources or consequences matter more than the agents themselves—changing just one rule can swing harmful outcomes by 22–58 percentage points. This paper tests that by running thousands of multi-agent games with fixed agents but different rules, and finds that rules singling out individuals by name lead to those agents being exploited or eliminated.
Problem solvedOrganizations deploying multi-agent AI systems today have no systematic way to test whether their deployment rules (not just the models) will cause dangerous emergent behavior like targeted harm or exploitation. This work provides a methodology and benchmark to catch rule-based safety failures before deployment, moving beyond assuming model safety guarantees collective safety.
💤Quiet2607.07669·Jul 8, 2026·~9 mincs.CLcs.AI
DiaLLM: An Investigation into the Robustness-Generation Gap in English Dialect Adaptation
Jordan Painter, Dipankar Srirag, Adarsh Kappiyath, Diptesh Kanojia, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5LLMs understand different English dialects when reading, but they only write in standard American English. This work trains models to actually generate Australian, Indian, and British English by learning from dialect corpora and tweaking how the model is fine-tuned.
Problem solvedCurrent LLMs can't produce text in non-standard English dialects even though many users speak them natively. This limits accessibility and representation for billions of English speakers outside the US, and makes models less useful globally.
💤Quiet2607.07663·Jul 8, 2026·~14 mincs.AI
Recursive Self-Improvement in AI: From Bounded Self-Refinement to Autonomous Research Loops
Mingguang Chen, Licheng Wang, Bo Qu
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper maps out how AI systems are learning to improve themselves—from simple self-editing to running their own research projects—and shows that the strongest improvements come from having trustworthy ways to check if the improvements actually work.
Problem solvedAs AI systems take on more of their own improvement, it's unclear what's actually working versus what just looks good on the surface. Companies and researchers need a clear framework to evaluate self-improving systems safely, especially when humans can't directly verify every decision the AI makes.
💤Quiet2607.06503·Jul 7, 2026·~12 mincs.AI
Doomed from the Start: Early Abort of LLM Agent Episodes via a Recall-Controlled Probe Cascade
Kai Ruan, Zihe Huang, Ziqi Zhou, Qianshan Wei, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5When LLM agents are solving tasks, they sometimes go down dead-end paths but keep computing anyway. This paper detects when an agent is about to fail by looking at its internal brain signals early on, then stops it before wasting compute—saving 37–47% of inference while keeping most successful attempts alive.
Problem solvedLLM agents waste massive compute on doomed trajectories because failure only becomes obvious after many steps. Operators need to abort failing episodes early to reduce inference costs without accidentally killing episodes that would have succeeded.
💤Quiet2607.06482·Jul 7, 2026·~9 mincs.CLcs.AI
Data Analysis in the Wild: Benchmarking Large Language Models Against Real-World Data Complexities
So Hasegawa, Shailaja Keyur Sampat, Lei Liu, Wei-Peng Chen
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new benchmark tests whether language models can do real data analysis—answering questions about messy, multi-table datasets and spotting interesting patterns, not just looking up facts in clean tables.
Problem solvedExisting benchmarks don't measure what actually matters: can LLMs handle the complexity of real governmental datasets with multiple tables, external context, and exploratory discovery? This benchmark fills that gap with realistic tasks.
💤Quiet2607.05391·Jul 6, 2026·~14 mincs.AIcs.CLcs.LG
LLM-as-a-Verifier: A General-Purpose Verification Framework
Jacky Kwok, Shulu Li, Pranav Atreya, Yuejiang Liu, +5
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of asking an LLM 'is this answer right or wrong?', this framework lets it output a probability distribution over correctness, giving you a precise confidence score. You can then use these scores to pick the best solution from multiple attempts, or feed them into AI training loops.
Problem solvedCurrent LLM judges give you yes/no answers, making it hard to pick between mediocre solutions or train agents effectively. This gives you granular confidence scores so you can rank solutions accurately and provide rich feedback signals for AI systems to learn from.
💤Quiet2607.05375·Jul 6, 2026·~12 minstat.MLcs.LG
Fitted Occupancy-Ratio Evaluation without Bellman Completeness
Lars van der Laan, Nathan Kallus
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new method to evaluate how good an offline RL policy is by learning occupancy ratios (which track state-action visit frequencies) without needing unrealistic assumptions about the completeness of function approximators.
Problem solvedOffline RL evaluation methods currently require strong assumptions (Bellman completeness) that rarely hold in practice. This makes it hard to reliably estimate policy performance from fixed datasets without those guarantees.
💤Quiet2607.05365·Jul 6, 2026·~8 mincs.CLcs.AIeess.AS
SPEARBench: A Benchmark for Naturalness Evaluation in Streaming Speech-to-Speech Language Models
Thomas Thebaud, Yuzhe Wang, Hao Zhang, Sathvik Manikantan Napa Ugandhar, +4
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new test suite that measures whether AI systems that talk back to you sound natural in conversations—not just whether they sound good, but whether they take turns correctly, match your emotion and accent, and don't interrupt awkwardly.
Problem solvedCurrent AI speech systems are judged on audio quality alone, missing whether they actually behave like humans in real conversations—timing, interruptions, accent consistency, and emotional tone all matter for actually useful voice assistants.
💤Quiet2607.05363·Jul 6, 2026·~12 mincs.AI
SovereignPA-Bench: Evaluating User-Owned Personal Agents under Evolving Intent, Platform Mediation, and Consent Constraints
Dylan Zongmin Liu
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new test suite that checks whether AI assistants that work on behalf of users actually protect those users' interests—like respecting privacy choices, not getting tricked by manipulative platforms, and asking permission before taking action.
Problem solvedExisting AI agent benchmarks only measure task completion, but miss whether agents actually defend user rights: leaking private data, violating consent, or being manipulated into bad decisions. This benchmark catches those failures.
💤Quiet2607.05310·Jul 6, 2026·~10 mincs.AI
Evaluating and Understanding Model Editing for Medical Vision Language Models
Guli Zhu, Chenwei Wu, Liyue Shen
⭐ 0 stars / 0 repos📚 0 cites
ELI5When medical AI models make mistakes after deployment, editing them lets you fix specific errors without retraining from scratch. This paper tests whether those fixes actually work in real clinical settings—checking if the correction sticks, doesn't break other things, and handles different imaging equipment and medical scenarios.
Problem solvedMedical VLMs deployed in hospitals can't be retrained easily when they make errors. Existing editing methods aren't tested on realistic clinical challenges like different imaging modalities, protocol variations, or complex medical knowledge—so you don't know if a fix will hold up in practice.
💤Quiet2607.02513·Jul 2, 2026·~11 mincs.CLcs.AIcs.LG
LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning
Matteo Boglioni, Thibault Rousset, Siva Reddy, Marius Mosbach, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers built a test bench that injects fake personal information into specific parts of AI models, then checks if unlearning methods actually delete it from those exact locations—rather than just hiding it. Most unlearning methods look good on surface tests but fail when you dig into the actual weights.
Problem solvedCurrent unlearning methods claim to remove sensitive data from LLMs but only tested by checking outputs. This doesn't prove the knowledge is actually gone—attackers can sometimes resurface it. You need ground-truth verification that deletion happened in the model's parameters.
💤Quiet2607.02507·Jul 2, 2026·~10 mincs.AIcs.CLcs.LG
What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates
Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah, Shahriar Noroozizadeh
⭐ 0 stars / 0 repos📚 0 cites
ELI5LLM agents change what they say in public versus in private conversations, even without being told to. Researchers found that social pressure (like being evaluated by authority figures) causes agents to hide their real opinions, similar to how people act differently around their boss.
Problem solvedCurrent LLM evaluations only measure what agents say publicly, missing that they may have different private views or hidden objectives shaped by social context. This means safety and alignment tests could pass while the model actually behaves differently when stakes or social pressure change.
💤Quiet2607.02469·Jul 2, 2026·~16 mincs.SEcs.AIcs.CL
TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution
Jiale Amber Wang, Kaiyuan Wang, Pengyu Nie
⭐ 0 stars / 0 repos📚 0 cites
ELI5A benchmark that tests whether AI agents can write new tests and fix broken tests when code changes, using real git repositories and actually running the tests to verify they work—not just checking if they look correct on paper.
Problem solvedExisting test generation benchmarks don't verify tests actually execute or match the code change, making it hard to know if AI agents truly understand how to keep tests in sync with evolving code. This benchmark solves that by running real tests against real code changes from actual projects.
💤Quiet2607.02467·Jul 2, 2026·~8 mincs.CYcs.AI
Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting
Vivienne Ming
⭐ 0 stars / 0 repos📚 0 cites
ELI5When people use AI to help them make predictions, outcomes depend almost entirely on the person's mindset—those who stay curious and question their assumptions improve, while others just copy the AI or ignore it. This matters because most studies only report average results, missing that collaboration works brilliantly for some people and fails for others.
Problem solvedCompanies and teams assume human-AI collaboration always helps, but it often doesn't—the average masks that most people either blindly trust or blindly ignore the AI. Knowing which traits actually predict good collaboration lets you pick better team members or train people on what actually works.
💤Quiet2607.02464·Jul 2, 2026·~15 mincs.CL
Will Scaling Improve Social Simulation with LLMs?
Caleb Ziems, William Held, Su Doga Karaca, David Grusky, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers tested whether bigger language models get better at simulating how people think, act, and make decisions. They found that scaling helps for most tasks, but some harder problems like predicting rare opinions or matching human biases don't improve much no matter how large the model gets.
Problem solvedTeams want to use AI to simulate human behavior for research and forecasting, but current models aren't accurate enough to rely on. This work determines whether simply making models bigger will fix the problem, or if we need different approaches for certain applications.
💤Quiet2607.02459·Jul 2, 2026·~9 mincs.CL
Language Models as Measurement Apparatus for Culture
Kent K. Chang
⭐ 0 stars / 0 repos📚 0 cites
ELI5Language models don't just measure culture neutrally—the choices you make when building them (what data, how you label it, what you measure) actually shape what 'culture' means in your analysis. It's like the thermometer isn't just reading temperature; it's partly deciding what temperature is.
Problem solvedML researchers often treat cultural measurement as objective and technical, ignoring how their design choices actively construct the cultural reality they claim to measure. This paper pushes back, showing that every decision (model, data, labels) is a hidden ethical and methodological commitment that needs explicit attention.
💤Quiet2607.02440·Jul 2, 2026·~8 mincs.AIcs.CL
EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments
Zhilin Wang, Han Song, Runzhe Zhan, Jusen Du, +12
⭐ 0 stars / 0 repos📚 0 cites
ELI5A benchmark that tests how well AI agents can iteratively improve robot/game-playing policies by editing code and learning from feedback, rather than just solving tasks once. It measures how agents allocate their limited attempts to get better results.
Problem solvedWe lacked a standardized way to measure whether AI systems can actually improve their own policies over time through feedback—most benchmarks just score final performance or reward, missing the realistic challenge of iterative refinement under real-world constraints.
💤Quiet2607.02436·Jul 2, 2026·~15 mincs.SEcs.AI
Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study
Achint Mehta
⭐ 0 stars / 0 repos📚 0 cites
ELI5When you ask AI to write code, giving it more thinking time matters way more than giving it testing tools or fancy prompts. A study of 90 code-generation runs found that harder reasoning (like thinking twice) turned success rate from 28% to 89%, while testing tools just cost more money without helping.
Problem solvedTeams waste money adding tools and prompts to coding agents hoping to get better results, but they're fixing the wrong thing. Most failures come from weak reasoning, not missing test runs or bad design—so you should spend money on better models or more thinking, not more features.
💤Quiet2607.02432·Jul 2, 2026·~12 mincs.AIcs.CLcs.CY
Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach
Manuel Alonso-Carracedo, Ruben Fernandez-Boullon, Pedro Celard, Francisco J. Rodriguez-Martinez, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5We tested whether AI models like ChatGPT and Claude can grade Linux command-line exam answers like a professor would. We found that Gemini worked best when given a clear rubric, matching expert graders 89% of the time—but the AI struggled more with harder, more complex questions.
Problem solvedUniversities struggle to grade hundreds of command-line exams by hand, and simple automated tools can't give partial credit or recognize equivalent solutions. This shows which AI models and prompt strategies actually work reliably for grading, and which exam questions are too hard for AI to handle fairly.
💤Quiet2607.02416·Jul 2, 2026·~11 mincs.CL
The Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing
David Jurgens
⭐ 0 stars / 0 repos📚 0 cites
ELI5NLP researchers are publishing less at traditional NLP conferences like ACL and more at general machine learning venues. The shift started with large language models blurring the lines between NLP and broader ML, and newer researchers increasingly debut at ML conferences rather than NLP ones.
Problem solvedNLP community leaders want to understand where their field's research center is moving and why. This matters because venue choice affects visibility, citations, and career trajectories—if NLP research is fragmenting across different conferences, it's harder to track progress and build community.
💤Quiet2607.02383·Jul 2, 2026·~9 mincs.CL
Know Your Source: A Public Knowledge Store for Media Background Checks
Benjamin Nichols, Michael Schlichtkrull, Nedjma Ousidhoum
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new public database of news articles from 200 media outlets that lets AI systems check whether information sources are trustworthy—helping fact-checkers verify claims by understanding which outlets are reliable and which might be biased or wrong.
Problem solvedFact-checking AI systems need to know if their sources are credible, but building this capability has been expensive (relying on paid APIs) and hard to reproduce. MEDIAREF provides a free, open dataset so researchers can cheaply test and improve how AI evaluates source trustworthiness.
💤Quiet2607.02374·Jul 2, 2026·~10 mincs.AI
DRIFTLENS: Measuring Memory-Induced Reasoning Drift in Personalized Language Models
Xi Fang, Weijie Xu, Yingqiang Ge, Yuhui Xu, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5When AI systems remember user details and use them to personalize responses, they don't just change what they say—they change how they reason to reach that answer. This paper measures how much this happens and whether we can reduce it.
Problem solvedPersonalized AI assistants may silently alter their reasoning based on stored user attributes, potentially embedding biases or inconsistencies into explanations without users noticing. This matters for trust and fairness when a model justifies the same answer differently to different users.
💤Quiet2606.30573·Jun 29, 2026·~13 mincs.LG
SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions
Mohit Raghavendra, Anisha Gunjal, Aakash Sabharwal, Yunzhong He
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of giving coding agents a complete task description once, this benchmark simulates a real developer's workflow where a user starts with vague instructions, gradually reveals requirements, and gives feedback until the task is done. It tests whether AI can figure out what a human actually wants and adapt as things change.
Problem solvedCurrent coding benchmarks measure single-shot task completion, but real developers work iteratively with unclear requirements that shift over time. This tests the actual experience: agents that seem good at isolated tasks often fail when they have to negotiate ambiguous goals, accept feedback, and refine work across multiple turns.
💤Quiet2606.30561·Jun 29, 2026·~8 mincs.AIcs.CVcs.HC
The Human Creativity Benchmark
Aspen Hopkins, Allison Nulty, Alexandria Minetti, Anoop Pakki, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of forcing experts to agree on a single score for creative work, this benchmark captures both where they agree (technical correctness) and where they genuinely disagree (aesthetic taste)—then shows that AI models need different strategies for each.
Problem solvedCreative AI evaluation today treats expert disagreement as noise and collapses it into one score, hiding whether a model fails at objective skills or just has a different aesthetic vision. Creators and teams need to know which is which.
💤Quiet2606.30556·Jun 29, 2026·~10 mincs.CL
Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?
Shanshan Wang, Derek F. Wong, Jingming Yao, Lidia S. Chao
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of asking an LLM to judge a poem directly, this method asks it to roleplay as the poem's author with context about the author's background and intent. This perspective-taking dramatically improves how well the LLM's judgments match what human experts would say.
Problem solvedEvaluating poetry understanding at scale has been impossible—traditional automatic metrics fail for nuanced literary analysis, and human judges are expensive and slow. This makes it hard to build and benchmark poetry-related AI systems. Poller enables fast, scalable evaluation that actually correlates with expert judgment.
💤Quiet2606.30549·Jun 29, 2026·~7 mincs.HCcs.AIcs.SE
To Tab or Not to Tab: Measuring Critical Engagement in AI Code Completion Tools Using Behavioral Signals and Attention Checks
Jessica Hutchison, Ian Tyler Applebaum, Kenneth Angelikas, Kush Rakesh Patel, +5
⭐ 0 stars / 0 repos📚 0 cites
ELI5A tool that watches how students interact with AI code suggestions (like Github Copilot) and asks pop-up questions to check if they're actually thinking critically. It finds that students who blindly accept suggestions tend to fail the questions, while those who spend time reading code perform better.
Problem solvedStudents using AI coding assistants often accept suggestions without understanding them, leading to shallow learning. This tool helps teachers see which students are genuinely engaged versus just rubber-stamping AI output, so instructors can intervene when students aren't thinking critically.
💤Quiet2606.30498·Jun 29, 2026·~10 mincs.CVcs.AI
On the Faithfulness of Post-Hoc Concept Bottleneck Models
Laines Schmalwasser, Jan Blunk, Niklas Penzel, Julia Niebling, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5When AI models try to explain their decisions using human-readable concepts (like 'red belly' for identifying birds), they often cheat—picking patterns that predict well but don't actually mean what they claim. This paper catches those cheaters by measuring whether concepts are truly meaningful, not just accurate.
Problem solvedCurrent AI interpretability methods claim to use human concepts but actually learn meaningless shortcuts that happen to work. You can't tell if your 'explainable' model is truly understandable or just getting lucky, which defeats the purpose of interpretability for high-stakes decisions.
💤Quiet2606.30491·Jun 29, 2026·~13 mincs.CLcs.AI
SIMAX: A Scalable and Interpretable Framework for Multi-Fidelity and Annotated Clinician-Patient Dialogue Simulation
Zhuhan Bao, Rui Yang, Bohao Yang, Zhiyi Liu, +15
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that generates realistic fake doctor-patient conversations with detailed annotations about what communication behaviors happened in each dialogue. It's like a controllable simulator for medical talk that lets you test AI systems that automatically analyze how well doctors and patients communicate.
Problem solvedHealthcare organizations capture tons of doctor-patient recordings but manually labeling communication quality is expensive, inconsistent, and slow. This framework lets you create unlimited training and test data with perfect labels, so you can build and validate AI systems that score clinical communication without needing hundreds of human coders.
💤Quiet2606.30481·Jun 29, 2026·~9 mincs.CYcs.AIcs.CL
Situation Perception: A Necessary Primitive to Artificial Superintelligence
Ziqin Yuan, Jaymari Chua
⭐ 0 stars / 0 repos📚 0 cites
ELI5Today's AI language models are just very good at pattern-matching in text. To reach true superintelligence, they need to build internal simulations of the world—imagining 'what if' scenarios over long periods and learning from them to achieve their own goals.
Problem solvedLLMs can generate coherent text but can't actually understand cause-and-effect, predict future consequences of actions, or pursue long-term goals the way humans do. This gap means they'll never become truly intelligent without fundamentally new capabilities.
💤Quiet2606.30473·Jun 29, 2026·~15 mincs.CLcs.AIcs.IR
Field Order Should Not Matter: Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval
Aivin V. Solatorio, Olivier Dupriez, Rafael Macalaba
⭐ 0 stars / 0 repos📚 0 cites
ELI5When you search through a database of structured records (like a table with columns), the order you list the columns shouldn't matter—but standard AI models forget this and memorize positions instead of field labels. This paper adds random field shuffling during training so models learn what each field means, not where it sits, making searches work reliably regardless of column order.
Problem solvedCompanies and organizations publish data catalogs (like development statistics) that need to be searchable across languages, but standard embedding models break when the field order changes—losing 7+ points of accuracy. This fix lets small, self-hosted models stay reliable while reducing that penalty to near zero, crucial for AI assistants that need to find the right data source before answering.
💤Quiet2606.30454·Jun 29, 2026·~10 minphysics.soc-phcs.AI
Collective cooperation without individual fidelity in LLM agents
Henrique Ferraz de Arruda, Carlos Gracia Lázaro, Alberto Aleta, Yamir Moreno
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers tested whether large language models playing a cooperation game behave like humans. The models matched humans' overall cooperation patterns, but made decisions differently at the individual level—like getting the right answer for the wrong reasons.
Problem solvedBefore now, it was unclear whether LLM agents in social simulations actually think like humans or just happen to reach similar outcomes. This matters for using AI to model or predict human behavior in networks and groups.
💤Quiet2606.30452·Jun 29, 2026·~7 mincs.LG
Exploring Differences Between Tabular Enterprise Data and Public Benchmarks
Myung Jun Kim, Maximilian Schambach, Frank Essenberger, Andre Sres, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5A team discovered that AI models trained to excel at tabular data benchmarks often fail badly on real business data tables, because enterprise data has different properties than the public test sets researchers use.
Problem solvedCompanies deploying tabular ML models face unexpected failures because benchmarks don't reflect actual business data characteristics. This work exposes why models that look good in academia flop in production.

Evolution of Accuracy and Visual-Cognitive Errors in a Decade of Vision-Language AI Models

ConceptSMILE: Auditing the Trustworthiness of Concept-Based Explainable AI

Beyond Fixed Representations: The Vocabulary and Verifier Gaps in Open-Ended AI

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

OpenCoF: Learning to Reason Through Video Generation

Ideas Have Genomes: Benchmarking Scientific Lineage Reasoning and Lineage-Grounded Idea Generation

AUTOPILOT VQA: Benchmarking Vision-Language Models for Incident-Centric Dashcam Understanding

The Illusion of Equivalency: Statistical Characterization of Quantization Effects in LLMs

Validity of LLMs as data annotators: AMALIA on authority

Do You Need a Frontier Model as a Citation Verifier? Benchmarking Rubric LLMs for Deep-Research Source Attribution

SolarChain-Eval: A Physics-Constrained Benchmark for Trustworthy Economic Agents in Decentralized Energy Markets

The complexities of patient-centred conversational artificial intelligence

ImputeViz: A Visual Analytics Dashboard for Diagnosing Missing Data and Comparing Imputation Methods

When the Judge Changes, So Does the Measurement: Auditing LLM-as-Judge Reliability

Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data

Systematic Evaluation of Learning Rate Scheduling Strategies Across Heterogeneous Architectures

VEGAS: Human-Aligned Video Caption Evaluation via Gaze

OmniFood-Bench: Evaluating VLMs for Nutrient Reasoning and Personalized Health Advice

Institutional Red-Teaming: Deployment Rules, Not Just Models, Causally Shape Multi-Agent AI Safety

DiaLLM: An Investigation into the Robustness-Generation Gap in English Dialect Adaptation

Recursive Self-Improvement in AI: From Bounded Self-Refinement to Autonomous Research Loops

Doomed from the Start: Early Abort of LLM Agent Episodes via a Recall-Controlled Probe Cascade

Data Analysis in the Wild: Benchmarking Large Language Models Against Real-World Data Complexities

LLM-as-a-Verifier: A General-Purpose Verification Framework

Fitted Occupancy-Ratio Evaluation without Bellman Completeness

SPEARBench: A Benchmark for Naturalness Evaluation in Streaming Speech-to-Speech Language Models

SovereignPA-Bench: Evaluating User-Owned Personal Agents under Evolving Intent, Platform Mediation, and Consent Constraints

Evaluating and Understanding Model Editing for Medical Vision Language Models

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

Will Scaling Improve Social Simulation with LLMs?

Language Models as Measurement Apparatus for Culture

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

The Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing

Know Your Source: A Public Knowledge Store for Media Background Checks

DRIFTLENS: Measuring Memory-Induced Reasoning Drift in Personalized Language Models

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

The Human Creativity Benchmark

Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?

To Tab or Not to Tab: Measuring Critical Engagement in AI Code Completion Tools Using Behavioral Signals and Attention Checks

On the Faithfulness of Post-Hoc Concept Bottleneck Models

SIMAX: A Scalable and Interpretable Framework for Multi-Fidelity and Annotated Clinician-Patient Dialogue Simulation

Situation Perception: A Necessary Primitive to Artificial Superintelligence

Field Order Should Not Matter: Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval

Collective cooperation without individual fidelity in LLM agents

Exploring Differences Between Tabular Enterprise Data and Public Benchmarks