Create Next App

All 50 🚀 Shipping 6 📈 Climbing 0 💤 Quiet 44 Unscored 0

What do these badges mean?

🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
🎭HypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.

💤Quiet2607.09657·Jul 10, 2026·~8 mincs.CVcs.AIcs.MM
Scalable Visual Pretraining for Language Intelligence
Yiming Zhang, Zhonghan Zhao, Wenwei Zhang, Haiteng Zhao, +12
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of converting documents into plain text for training language models, this work shows that training directly on the visual layout and images of documents—charts, equations, page structure—gives better results than just using text alone.
Problem solvedLanguage models trained on text-only representations lose rich information from figures, equations, and document layouts. This wastes training data and limits what models can learn from visually complex sources like PDFs, papers, and web pages.
💤Quiet2607.09623·Jul 10, 2026·~11 mincs.CLcs.AI
Task-Specific Multimodal Question Answering Agents via Confidence Calibration and Incremental Reasoning for QANTA 2026
Nirjhar Das, Md. Al-Mamun Provath
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that answers trivia questions from partial clues (text + images) by using two specialized AI agents—one decides when to buzz in on tossup questions, the other carefully selects answers on bonus questions—using confidence scoring and reasoning rules instead of brute-force retrieval.
Problem solvedMultimodal trivia systems need to work fast with limited compute while handling two different question types with opposite constraints: tossup requires risk-aware timing (answer too soon = wrong, too late = someone else wins), bonus requires accuracy. This system wins the QANTA competition by building task-specific strategies rather than one generic approach.
💤Quiet2607.08768·Jul 9, 2026·~12 mincs.CL
UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks
Zhekai Chen, Chengqi Duan, Kaiyue Sun, Bohao Li, +3
⭐ 0 stars / 0 repos📚 0 cites
ELI5A benchmark that tests AI agents on real-world computer tasks like filing documents or managing schedules. Instead of using fake answers, it runs tasks in live environments and checks if agents actually complete each step correctly.
Problem solvedExisting agent benchmarks use sandboxed fake setups and static answers, so they can't measure if agents actually work with real tools and handle multi-turn interactions. This makes it hard to debug why agents fail in the real world.
💤Quiet2607.08573·Jul 9, 2026·~13 mincs.AI
SHAP-Weighted Cross-Modal Expert Fusion for Emotion and Sentiment Recognition: Evidence and Limits
Adis Alihodzic, Selma Skopljakovic Hubljar
⭐ 0 stars / 0 repos📚 0 cites
ELI5When combining audio, video, and text to understand emotions, this paper shows that using SHAP explanations to decide which expert to trust works best when you preserve the total importance score across high and low-dimensional modalities.
Problem solvedMultimodal emotion recognition struggles to balance modularity with cross-modal interaction—early fusion is accurate but rigid, late fusion is flexible but loses connections between modalities. This work shows how to weight different fusion strategies transparently.
💤Quiet2607.08497·Jul 9, 2026·~12 mincs.CVcs.AIcs.CL
Cognitive-structured Multimodal Agent for Multimodal Understanding, Generation, and Editing
Feng Wang, Canmiao Fu, Zhipeng Huang, Chen Li, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5A multimodal AI agent that remembers images and text from earlier in a conversation instead of re-reading everything every turn. It stores visual summaries in memory, retrieves what's relevant when needed, and decides what to do next—like a person who keeps notes instead of re-reading a whole book.
Problem solvedLong conversations with images get slow and unreliable because models stuff all prior images into context, causing token explosion and losing track of what was discussed. This agent solves it by selectively remembering only what matters, cutting inference time in half while improving accuracy.
💤Quiet2607.08489·Jul 9, 2026·~6 mincs.CVcs.AIcs.HC
VEGAS: Human-Aligned Video Caption Evaluation via Gaze
Shenghui Chen, Po-han Li, Ximeng Sun, Shijia Yang, +4
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of rating video captions the same way for everyone, this method uses eye-tracking data to pick captions that match what each individual viewer is actually looking at—like customizing summaries based on where someone's eyes go.
Problem solvedVideo captions today ignore what viewers actually pay attention to, making descriptions miss important details in people's focus area. This wastes the opportunity to personalize summaries and makes retrieval systems less accurate for real-world viewing patterns.
💤Quiet2607.08475·Jul 9, 2026·~10 mincs.LG
Frequency-Domain Multi-Modality Transportation Modeling
Jiewen Deng, Hangchen Liu, Junchen Li, Boyuan Zhang, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5This method predicts traffic and transit patterns by analyzing how different transportation modes (cars, buses, trains) interact with each other in the frequency domain—like tuning a radio to pick up the right signals from each mode instead of treating them all the same way.
Problem solvedTraffic forecasting systems struggle because different transportation modes have different rhythms and don't interact uniformly across time scales. Existing methods can't selectively learn which modes help predict each other at different frequencies, leading to poor predictions and wasted effort learning irrelevant cross-mode patterns.
💤Quiet2607.08470·Jul 9, 2026·~10 mincs.LG
MatBind: A Shared Embedding Space for Multimodal Materials Characterization
Le Yang, Anoop K. Chandran, Jona Östreicher, Evgenii Sovetkin, +7
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that learns to translate four different ways of describing materials (crystal structure, X-ray patterns, electronic properties, and text) into a shared language, so you can search for similar materials using any description and find matches across all the others.
Problem solvedMaterials scientists have to manually connect information about the same material stored in different formats and databases. This lets you query materials across those boundaries automatically—search by structure and find similar materials described in text, or vice versa.
💤Quiet2607.07708·Jul 8, 2026·~12 mincs.CLcs.AIcs.CE
Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning
Chen Tang, Yizhou Wang, Jianyu Wu, Lintao Wang, +25
⭐ 0 stars / 0 repos📚 0 cites
ELI5A AI model that reasons about proteins, molecules, and crystals by treating their 3D structures as discrete, inspectable building blocks—like assembling a transparent blueprint that shows *why* a prediction is correct, not just what the answer is.
Problem solvedScientists struggle to trust AI predictions in biology and chemistry because models operate as black boxes. This system makes structure-based reasoning transparent and explainable, so researchers can verify logic against known scientific principles instead of blindly trusting a number.
💤Quiet2607.07673·Jul 8, 2026·~14 mincs.CVcs.LG
MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models
Hyunjae Kim, Dain Kim, Pan Xiao, Serina S. Applebaum, +24
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers built an automated system to extract 11 million high-quality medical image-text pairs from scientific papers, then used this data to train better AI models for medical imaging tasks. The resulting models outperform existing medical AI systems on diverse medical benchmarks and real hospital data.
Problem solvedMedical AI models struggle because there's limited access to large, high-quality training data with images paired to accurate descriptions. Existing datasets from literature are messy and unreliable; this framework automatically cleans and validates them, making medical foundation models faster to build and more accurate.
💤Quiet2607.06565·Jul 7, 2026·~10 mincs.CVcs.AIcs.LG
ELSA3D: Elastic Semantic Anchoring for Unified 3D Understanding and Generation
Tianjiao Yu, Xinzhuo Li, Yifan Shen, Onkar Susladkar, +3
⭐ 0 stars / 0 repos📚 0 cites
ELI5A 3D AI model that understands both language and 3D shapes by treating them like different zoom levels—it smartly routes text descriptions to the right level of geometric detail instead of mixing everything together, making it faster and more accurate.
Problem solvedUnified 3D models struggle to align text and geometry effectively because they flatten both into one representation, losing detail and wasting compute. This approach bridges language and 3D structure intelligently, cutting inference time in half while improving results.
💤Quiet2607.06540·Jul 7, 2026·~13 mincs.CL
Hierarchical Acoustic-Semantic Modeling: Modality Separation and Semantic Coherence for Full-Duplex SLMs
Zhenyu Liu, Yunxin Li, Xuanyu Zhang, Qixun Teng, +9
⭐ 0 stars / 0 repos📚 0 cites
ELI5When AI tries to listen and speak at the same time using both audio and text, the two processes fight each other and make mistakes. This paper separates them into different neural pathways while keeping them coordinated, making the AI much better at real-time conversation.
Problem solvedFull-duplex spoken AI (listen-and-speak simultaneously) currently degrades in quality because audio processing and text understanding interfere with each other during training. This makes real-time conversational AI feel choppy and unintelligent.
💤Quiet2607.06531·Jul 7, 2026·~13 mincs.AIcs.LG
The Large Cancer Assistant (LCA): A Model-Agnostic Orchestration Framework for Scalable Clinical Decision Support in Oncology
Ghassen Marrakchi, Basarab Matei
⭐ 0 stars / 0 repos📚 0 cites
ELI5A flexible system that sits on top of any cancer diagnosis AI model, standardizing how patient data flows in and results flow out—so hospitals can swap AI tools without rebuilding the entire pipeline.
Problem solvedHospital cancer AI systems are locked into specific models and data formats, making it expensive to upgrade tools or integrate new ones. This framework decouples the plumbing from the AI engine, letting hospitals use any model and switch easily.
💤Quiet2607.02504·Jul 2, 2026·~8 mincs.CLcs.AIcs.CV
Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas
Yuxuan Li, Lingxi Xie, Xinyue Huo, Jihao Qiu, +5
⭐ 0 stars / 0 repos📚 0 cites
ELI5When watching TV dramas, it's hard to figure out who's speaking in complex scenes. This paper trains an AI model that reasons through audio, text, and video clues together to accurately match voices to characters—like a detective piecing together clues from what it hears, sees, and knows about the story.
Problem solvedCurrent video understanding systems struggle to identify speakers in long TV shows, especially for short lines where voice recognition alone fails. This makes it hard to automatically caption, analyze, or index dramatic content accurately.
💤Quiet2607.02490·Jul 2, 2026·~9 mincs.CLcs.CV
Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
Liyan Tang, Fangcong Yin, Greg Durrett
⭐ 0 stars / 0 repos📚 0 cites
ELI5Vision-language models can get better at fixing their own mistakes by looking at images while they think through problems. This work teaches them to do this by showing them messed-up situations they have to recover from, making them actually use visual information instead of just talking about it.
Problem solvedVision-language models fail when images look different from training data because they don't properly use visual clues when correcting mistakes. Teams need models that can genuinely reference and learn from what they see, not just generate text about corrections.
💤Quiet2607.02371·Jul 2, 2026·~12 mincs.CVcs.AI
VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment, Featuring Personalized Object Retrieval
Cristian-Gabriel Florea, Stelian Spînu
⭐ 0 stars / 0 repos📚 0 cites
ELI5An Android app that lets blind and visually impaired people use their phone camera to identify objects around them, find their own belongings, and navigate—all running offline on the phone itself, with optional AI help for describing scenes.
Problem solvedVisually impaired people struggle with everyday navigation and object finding; existing apps need cloud connection, only recognize preset categories, or require special hardware. This works offline on any Android phone.
💤Quiet2607.02369·Jul 2, 2026·~5 mincs.CLcs.AI
World Wide Models: Literary Tools for Cultural AI
Nina Begus
⭐ 0 stars / 0 repos📚 0 cites
ELI5Literary scholars have techniques for understanding how texts carry cultural meaning across different languages and traditions—this paper argues those same techniques should guide how we build AI systems that work across cultures, not just optimize for one language.
Problem solvedMost large language models are trained primarily on English text and reflect Western perspectives, causing them to misunderstand or misrepresent other cultures. Literary analysis methods offer proven ways to spot these blind spots and build more culturally aware AI.
💤Quiet2606.30576·Jun 29, 2026·~10 mincs.CVcs.AI
Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization
Liyao Wang, Ruipu Wu, Haojun Xu, Lei Shi, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that finds where a building or object is located in a satellite image by looking at a ground or drone photo of it — like reverse image search but in 3D, understanding camera angles and positions rather than just matching pixels.
Problem solvedFinding objects across different camera views (ground, drone, satellite) is hard because datasets are small and don't include camera geometry info. This paper provides a 220K+ image dataset with pose data and builds a model that works across view types without needing paired training data for each combination.
💤Quiet2606.30561·Jun 29, 2026·~8 mincs.AIcs.CVcs.HC
The Human Creativity Benchmark
Aspen Hopkins, Allison Nulty, Alexandria Minetti, Anoop Pakki, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of forcing experts to agree on a single score for creative work, this benchmark captures both where they agree (technical correctness) and where they genuinely disagree (aesthetic taste)—then shows that AI models need different strategies for each.
Problem solvedCreative AI evaluation today treats expert disagreement as noise and collapses it into one score, hiding whether a model fails at objective skills or just has a different aesthetic vision. Creators and teams need to know which is which.
💤Quiet2606.30374·Jun 29, 2026·~9 mincs.CVcs.AIcs.LG
Set-Inclusive Uncertainty Modeling for Robust Brain Tumor Segmentation
Seunghun Baek, Jihwan Park, Jaeyoon Sim, Hoseok Lee, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5When brain tumor scans are missing some imaging types, this method represents the model's knowledge as a range of possibilities (a distribution) rather than a single guess, so it can tell you when it's unsure versus confident about its predictions.
Problem solvedHospitals can't always acquire all MRI modalities due to cost, time, or equipment constraints. Existing segmentation models either fail or give wrong answers without acknowledging their uncertainty when data is missing—this approach quantifies that uncertainty so doctors know when to trust the output.
💤Quiet2606.30355·Jun 29, 2026·~11 mincs.CVcs.AI
Residual-Guided Expert Specialization for Incomplete Multimodal Learning
Seunghun Baek, Jihwan Park, Jaeyoon Sim, Minjae Jeong, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5When AI systems are missing some types of input data (like no audio in a video), this method trains separate specialist experts to handle each type of missing-data pattern, using clues from complete data during training to guide them.
Problem solvedReal-world systems often receive incomplete data at runtime—missing sensor inputs, corrupted modalities, or dropped streams. Existing methods struggle because representations shift when data is missing, and you can't rely on full data during deployment to fix it.
💤Quiet2606.30319·Jun 29, 2026·~9 mincs.CVcs.LG
BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language
Haitao Wu, Qirui Zhang, Zhouheng Yao, Shangquan Sun, +7
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that treats your brain like a language translator—it learns to convert between brain activity, images, and text in both directions using a single model, so you can decode what someone saw from their brain scan or predict brain patterns from pictures.
Problem solvedBrain imaging research is fragmented into separate encoding/decoding tasks with limited cross-modal understanding. This unified model enables bidirectional brain-to-image-to-text translation, making brain data more interpretable and opening new neuroscience applications without task-specific retraining.
💤Quiet2606.30296·Jun 29, 2026·~10 mincs.AI
ManimAgent: Self-Evolving Multimodal Agents for Visual Education
Wenjia Jiang, Zongyuan Cai, Yuanhang Shao, Chenru Wang, +6
⭐ 0 stars / 0 repos📚 0 cites
ELI5An AI agent learns to write code that creates math animations by remembering what worked and what failed across many tasks, building its own memory bank of successful examples and common mistakes without needing human guidance or model retraining.
Problem solvedAI agents typically forget lessons learned after each task ends, forcing them to re-solve similar problems from scratch. This wastes computation and prevents improvement across related tasks—especially for specialized domains like scientific visualization.
💤Quiet2606.30291·Jun 29, 2026·~10 mincs.AI
PromptGNN-sim: Deep Fusion and Alignment of GNN and LLMs for Text-Attributed Graph Learning
Zhifei Hu, Alexandra I. Cristea
⭐ 0 stars / 0 repos📚 0 cites
ELI5This method combines graph neural networks with large language models to better understand graphs where each node has text attached to it. Instead of processing text and graph structure separately, the two models talk to each other during learning—the graph model helps the language model focus on relevant text, and the language model helps the graph model understand semantic meaning.
Problem solvedText-attributed graphs (like citation networks where papers are connected) are hard to learn from because existing methods treat text and structure as independent inputs, missing opportunities for them to inform each other. This hurts performance when there aren't many connections or when applying a model to a new dataset.
💤Quiet2606.30256·Jun 29, 2026·~15 mincs.AIcs.CY
EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots
Camilo Chacón Sartori
⭐ 0 stars / 0 repos📚 0 cites
ELI5A testing system that checks if chatbots designed to help people in emotional crises actually stay safe and helpful across multiple languages and long conversations. It uses one AI to role-play someone needing help and another to grade how well the chatbot handles 19 different safety measures.
Problem solvedEmotional-support chatbots often fail in ways that simple, single-turn tests miss—especially across languages and real multi-turn conversations. This benchmark catches failures that fixed-prompt evaluations hide, and reveals that even identical conversations can produce wildly different responses from the same model.
💤Quiet2606.28274·Jun 26, 2026·~8 mincs.LGcs.AI
Parameter Efficient Hybrid Transformer (PEHT) for Network Traffic Prediction via Dynamic Urban Congestion Integration
Abdolazim Rezaei, Mehdi Sookhak, Mahboobeh Haghparast
⭐ 0 stars / 0 repos📚 0 cites
ELI5A smarter way to predict how crowded cell networks will be by combining phone traffic data with real-world city congestion patterns, using a lightweight transformer model that needs far fewer parameters to train.
Problem solvedTelecom companies struggle to allocate network resources efficiently because predicting demand requires understanding both user behavior and actual traffic congestion—existing models either miss urban context or are too expensive to run.
💤Quiet2606.28273·Jun 26, 2026·~10 mincs.CL
Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models
Niclas Lietzow, Danielle Bitterman, Carsten Eickhoff, William Rudman, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers found the specific switches inside vision-language models that decide whether to trust what they see in an image or what they've memorized about the world. They discovered just a handful of attention heads (2-5%) act as gatekeepers controlling this choice.
Problem solvedVision-language models sometimes disagree with visual reality when their training data contradicts what's in the image. Understanding where this conflict happens and how to control it helps make multimodal AI more reliable and trustworthy for real applications.
💤Quiet2606.27330·Jun 25, 2026·~10 mincs.CLcs.AIcs.CV
Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning
Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper trains smaller AI models to control websites and apps by having them explore environments themselves and learn from their mistakes, focusing on high-level task planning rather than low-level actions. A 7B model trained this way outperforms much larger models at generalizing to new websites.
Problem solvedSmall, cheap AI models are bad at breaking down complex GUI tasks and don't work well on websites they haven't seen before. This method fixes that by letting models learn from their own exploration and high-level task experiences, making them competitive with 32B models while being faster and cheaper.
💤Quiet2606.27187·Jun 25, 2026·~11 mincs.CVcs.CL
HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models
Jiajun Wu, Haoyu Kang, Yining Sun, Jiacheng Hou, +12
⭐ 0 stars / 0 repos📚 0 cites
ELI5A test for AI models that watch videos and decide if they're harmful—but instead of just yes/no, it asks deeper questions to see if models understand *why* a video is bad or need to look at surrounding context.
Problem solvedContent moderation AI often flags harmful videos for shallow reasons (one bad frame) rather than understanding real harm. This benchmark forces models to explain their reasoning and handle multi-layered context, catching shortcuts that fail in real moderation.
💤Quiet2606.27161·Jun 25, 2026·~11 mincs.AI
TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference
Tinghao Wang, Yichen Guo, Rui Huang, Zheng Lu, +10
⭐ 0 stars / 0 repos📚 0 cites
ELI5A technique that removes unnecessary image information before feeding it to multimodal AI models, keeping only the most important visual details relevant to the question being asked. This speeds up inference by 3-5x while maintaining accuracy.
Problem solvedMultimodal AI models process too many redundant image tokens, making them slow and expensive to run. Existing pruning methods either keep junk tokens or ignore what the user actually asked. TOPS cuts 75%+ of tokens without losing quality.
💤Quiet2606.26079·Jun 24, 2026·~10 mincs.CLcs.CVcs.LG
Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models
Akshay Paruchuri, Sanmi Koyejo, Ehsan Adeli
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers tested whether AI models that process both text and images give the same answer when you rearrange the input order—like shuffling which image comes first or reordering answer choices. They found all 18 models tested fail this basic test, changing their answers 24-50% of the time depending on order.
Problem solvedCurrent AI benchmarks only test models once per question, missing a critical failure mode: models give different answers to the same question depending on input order. This hidden unreliability means evaluations overstate real-world reliability and make models look more consistent than they are.
💤Quiet2606.26029·Jun 24, 2026·~12 mincs.CVcs.AI
TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs
Yu-Yang Chen, Lan-Zhe Guo
⭐ 0 stars / 0 repos📚 0 cites
ELI5A benchmark that tests how well AI vision models can understand 3D scenes by looking at them from multiple angles, with carefully controlled difficulty levels. It reveals that all tested models struggle similarly—they're great at simple tasks but fail badly when objects block each other or when they need to match objects across different camera views.
Problem solvedExisting vision benchmarks don't systematically test whether multimodal AI can reason about 3D structure and handle occlusion. This benchmark isolates why models fail at spatial reasoning tasks, showing the real problem isn't reasoning strategy but rather fundamental gaps in how models track objects across viewpoints.
💤Quiet2606.24849·Jun 23, 2026·~9 mincs.CVcs.AI
IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation
Zixuan Li, Haokun Lin, Yicheng Xiao, Zhiwei Li, +9
⭐ 0 stars / 0 repos📚 0 cites
ELI5This method helps image generators follow complex instructions better by having them think about the layout and structure of objects first (like sketching a plan), then fill in the details and colors—all without needing an actual sketch at inference time.
Problem solvedText-to-image models often miscount objects, get spatial relationships wrong, or fail to bind attributes correctly because they try to plan structure and render appearance at the same time. This separates those concerns into two stages.
💤Quiet2606.23679·Jun 22, 2026·~12 mincs.CVcs.AIcs.GR
Semantic Browsing: Controllable Diversity for Image Generation
Sara Dorfman, Maya Vishnevsky, Omer Dahary, Or Patashnik, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of asking an image generator to make random variations of an image, this method uses AI to systematically explore different meaningful design choices (like changing the scene's style, composition, or subject details) and organizes them into a browsable gallery you can explore.
Problem solvedText-to-image models generate boring variations—all similar outputs with only random noise differences. Users had no way to intentionally explore specific creative directions or understand what changed between images. This lets you browse structured variations where each change is a deliberate, interpretable choice.
💤Quiet2606.23678·Jun 22, 2026·~10 mincs.CVcs.AI
AIR: Adaptive Interleaved Reasoning with Code in MLLMs
Cong Han, Xiaohan Lan, Haibo Qiu, Yujie Zhong
⭐ 0 stars / 0 repos📚 0 cites
ELI5This paper teaches multimodal AI models to think step-by-step while writing and running code to solve math and numerical problems. It trains the model to decide when to use code tools adaptively, like a student choosing when to grab a calculator vs. working through logic.
Problem solvedExisting multimodal models struggle with numerical computation and rely on fixed rules for tool-use. This approach enables models to flexibly reason and compute answers to complex problems involving numbers and visual data simultaneously.
💤Quiet2606.20527·Jun 18, 2026·~11 mincs.CLcs.CV
StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs
Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers created a test showing that a few visual details like age, body type, and clothing style cause AI vision models to make biased social judgments about people—and just 15 of these details explain 80% of all the bias.
Problem solvedIt was hard to tell whether AI models' biased judgments came from seeing someone's actual identity or from surface-level appearance cues. This work isolates the specific visual details driving bias, making it possible to identify and fix the root causes.
💤Quiet2606.20523·Jun 18, 2026·~13 mincs.CVcs.AIcs.DB
SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm
Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, +1
⭐ 0 stars / 0 repos📚 0 cites
ELI5A new dataset pairing high-resolution radar images from space with matching optical photos and text descriptions. It's like a visual dictionary that teaches AI models to understand what radar and camera see in the same location—useful because radar works in clouds and darkness where cameras fail.
Problem solvedAI models trained on optical images alone struggle with radar data, which is crucial for all-weather satellite monitoring. This dataset fills that gap by providing aligned radar-optical-text triplets, enabling models to learn cross-modal understanding and work reliably regardless of weather or time of day.
💤Quiet2606.20477·Jun 18, 2026·~7 mincs.CVcs.CLcs.LG
Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, +3
⭐ 0 stars / 0 repos📚 0 cites
ELI5A medical imaging AI model learns to understand and describe CT and MR scans in multiple languages while also pointing to specific problem areas in the images—all trained on real hospital data without doctors having to manually label every location.
Problem solvedMedical AI systems struggle to explain their findings by pointing to specific regions in scans, and creating labeled training data requires expensive manual annotation by radiologists. This work enables spatial grounding at scale using automated techniques.
💤Quiet2606.20382·Jun 18, 2026·~10 mincs.LG
Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach
Zhengyu Wu, Hongchao Qin, Xunkai Li, Zekai Chen, +2
⭐ 0 stars / 0 repos📚 0 cites
ELI5When training AI models across multiple organizations that each have different types of data (images, text, etc.), some organizations might be missing certain data types entirely. This paper teaches the model to intelligently guess the missing data types by learning patterns from organizations that have them, without sharing raw data between organizations.
Problem solvedCompanies in federated learning setups often have incomplete data—some have images but no text, others vice versa. Existing solutions don't work well for graph-structured data across organizations. This paper fixes that by letting models synthesize missing modalities locally while keeping data private.
💤Quiet2606.20280·Jun 18, 2026·~13 mincs.IRcs.AI
ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval
Yuhan Liu, Pei Fu, Hang Li, Yukun Qi, +7
⭐ 0 stars / 0 repos📚 0 cites
ELI5A system that helps AI models better understand complex search queries by learning to rank similar results differently rather than treating all non-matching results the same. It uses reinforcement learning rules instead of traditional reward models to improve multimodal search (text + images).
Problem solvedCurrent multimodal search systems treat all wrong answers equally, missing subtle differences in what a complex query is actually asking for. This causes them to miss relevant results when queries have multiple layers of meaning or specific details.
💤Quiet2606.19341·Jun 17, 2026·~10 mincs.CVcs.CLcs.SD
Native Active Perception as Reasoning for Omni-Modal Understanding
Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, +7
⭐ 0 stars / 0 repos📚 0 cites
ELI5Instead of watching an entire video passively, this system learns to actively decide what parts of the audio and video to look at based on the question—like a smart student who skims the relevant sections instead of reading every word, saving time and computation.
Problem solvedLong video understanding is expensive because models process every frame uniformly, making costs grow with video length. This system reduces that burden by selectively attending to relevant content, achieving better accuracy with less computation and enabling smaller models to outperform much larger ones.
💤Quiet2606.19259·Jun 17, 2026·~11 mincs.CVcs.AI
A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2
Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang
⭐ 0 stars / 0 repos📚 0 cites
ELI5Researchers created a test set of 8,600+ fake images with text and layouts (like receipts, posters, screenshots) made by GPT Image 2, then checked if existing AI-detection tools could spot them—most failed at this task.
Problem solvedAs AI image generators get better at creating realistic documents and structured designs with text, detecting these fakes becomes critical for fraud prevention and trust. Existing detection tools were only tested on simple object photos, not text-heavy documents.
💤Quiet2606.19256·Jun 17, 2026·~11 mincs.AI
X+Slides: Benchmarking Audience-Conditioned Slide Generation
Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, +5
⭐ 0 stars / 0 repos📚 0 cites
ELI5A test suite that checks whether AI-generated slide decks match what different audiences actually need—execs want action items, researchers want proofs, etc.—rather than just measuring if slides cover the source material.
Problem solvedCurrent slide generation tools don't tailor content to who's reading them, forcing users to manually edit decks for their audience. This benchmark forces systems to actually deliver the information that matters to each viewer type.
💤Quiet2606.11190·Jun 9, 2026·~13 mincs.LG
When to Align, When to Predict: A Phase Diagram for Multimodal Learning
Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, +1
⭐ 28 stars / 21 repos📚 0 cites
ELI5Imagine you're trying to learn from two types of sensor data (like images and text). Should you train them to look alike, or teach one to predict the other? This paper figures out which strategy works best depending on how much noise and correlation exists in your data.
Problem solvedTeams working with multiple data sources (medical scans + lab tests, telescope images + spectra) waste time trying standard multimodal methods that often fail. This framework lets them diagnose upfront which training approach is right, or even if combining data helps at all.

Scalable Visual Pretraining for Language Intelligence

Task-Specific Multimodal Question Answering Agents via Confidence Calibration and Incremental Reasoning for QANTA 2026

UniClawBench: A Universal Benchmark for Proactive Agents on Real-World Tasks

SHAP-Weighted Cross-Modal Expert Fusion for Emotion and Sentiment Recognition: Evidence and Limits

Cognitive-structured Multimodal Agent for Multimodal Understanding, Generation, and Editing

VEGAS: Human-Aligned Video Caption Evaluation via Gaze

Frequency-Domain Multi-Modality Transportation Modeling

MatBind: A Shared Embedding Space for Multimodal Materials Characterization

Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

ELSA3D: Elastic Semantic Anchoring for Unified 3D Understanding and Generation

Hierarchical Acoustic-Semantic Modeling: Modality Separation and Semantic Coherence for Full-Duplex SLMs

The Large Cancer Assistant (LCA): A Model-Agnostic Orchestration Framework for Scalable Clinical Decision Support in Oncology

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment, Featuring Personalized Object Retrieval

World Wide Models: Literary Tools for Cultural AI

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

The Human Creativity Benchmark

Set-Inclusive Uncertainty Modeling for Robust Brain Tumor Segmentation

Residual-Guided Expert Specialization for Incomplete Multimodal Learning

BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

ManimAgent: Self-Evolving Multimodal Agents for Visual Education

PromptGNN-sim: Deep Fusion and Alignment of GNN and LLMs for Text-Attributed Graph Learning

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

Parameter Efficient Hybrid Transformer (PEHT) for Network Traffic Prediction via Dynamic Urban Congestion Integration

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Semantic Browsing: Controllable Diversity for Image Generation

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

Native Active Perception as Reasoning for Omni-Modal Understanding

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

X+Slides: Benchmarking Audience-Conditioned Slide Generation

When to Align, When to Predict: A Phase Diagram for Multimodal Learning