What do these badges mean?
- 🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
- 📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
- 💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
- 🎭HypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.
- 💤Quiet2605.18740·May 18, 2026·~10 mincs.CVcs.AIcs.CL
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, +3
⭐ 2 stars / 5 repos📚 0 citesELI5A multimodal AI learns to spot fine details better by teaching itself: we show it crops of important image regions and ask questions, then use those answers to guide how it processes full images, helping it learn where to focus without needing outside supervision.
Problem solvedMultimodal AI models fail at detail-heavy tasks (like reading small text in images or spotting tiny objects) because they can't focus on relevant evidence in full images. This fix lets them learn where to look by studying their own successful answers on cropped details.
- 2605.18714·May 18, 2026·~10 mincs.CVcs.AI
Semantic Generative Tuning for Unified Multimodal Models
Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li
ELI5A new training method that teaches AI models to be better at both understanding images and generating them by having them solve image segmentation tasks during training, which forces the model to learn structural layout information that helps both skills.
Problem solvedMultimodal models trained for both image understanding and generation develop misaligned internal representations, so each skill interferes with the other. This method aligns them by using segmentation as a bridge task that teaches useful structural knowledge.
- 2605.18678·May 18, 2026·~9 mincs.CVcs.AI
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, +9
ELI5Lance is a single AI model that can understand images/videos and create/edit them—like having one Swiss Army knife instead of separate tools. It uses a clever dual-pathway design where shared knowledge helps both understanding and generation tasks strengthen each other.
Problem solvedMost multimodal models are either good at understanding OR generation, not both, and they require huge scaling or awkward bolted-together designs. Lance fixes this by training one lightweight model that does all tasks well through intelligent task collaboration.
- 2605.18667·May 18, 2026·~10 mincs.CVcs.LG
Better Together: Evaluating the Complementarity of Earth Embedding Models
Thijs L van der Plas, Jacob JW Bakermans, Vishal Nedungadi, Gabrielė Tijūnaitytė, +2
ELI5When you combine satellite image embeddings from different models, you get better results than using any single one alone. This paper measures how well different Earth observation models complement each other and shows that mixing them beats relying on just the best individual model.
Problem solvedEarth embedding models are typically evaluated separately, making it seem like one model is universally better. But the real value comes from combining multiple models—a capability that standard evaluation methods completely miss.
- 2605.18621·May 18, 2026·~12 mincs.CVcs.AI
CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark
Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, +3
ELI5This work teaches AI models to understand scenes from multiple camera angles at once, like how you'd understand a room by looking at it from different corners. It provides a dataset, evaluation benchmark, and a model that learns to match and reason about the same objects across different viewpoints.
Problem solvedCurrent AI vision models struggle when they need to reason about objects and their spatial relationships across multiple camera views—this matters for robotics, autonomous vehicles, and 3D scene understanding where you can't rely on a single perspective.
- 2605.18547·May 18, 2026·~12 mincs.AI
VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation
Linan ZHU, Zihao Zhai, Xiao Han, Yuqian Fu, +3
ELI5A system that watches videos of conversations and figures out how people feel by looking at the speaker's face and body language, without needing to retrain big vision models. It uses text and audio to help when the visual signals are unclear.
Problem solvedCurrent emotion recognition systems either miss non-verbal cues (text-only) or waste compute fine-tuning large models on irrelevant background details. This approach gets better emotion detection while being much cheaper to run.
- 2605.18430·May 18, 2026·~8 mincs.LG
Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation
Liang Wang, Heng Meng, Zekai Xiang, Jin Liu, +3
ELI5A test suite for AI systems that turn written descriptions into 3D CAD design files. It includes 600 examples ranging from simple shapes to complex real-world parts, helping measure how well language models can actually create usable designs.
Problem solvedCurrent AI models can handle simple CAD generation but fail on complex designs with advanced features. Engineers and designers need a standardized way to measure whether AI can reliably convert natural language into production-ready parametric models across different industries.
- 2605.18257·May 18, 2026·~7 mincs.CVcs.AIcs.CL
CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook
Zeyu Chen, Jie Li, Kai Han
ELI5CodeBind learns shared concepts across different types of data (text, images, audio, etc.) by splitting each data type into two parts: universal features everyone understands, and unique features that only matter for that data type. This lets the system align different modalities without needing every data type paired together.
Problem solvedConnecting different types of data (text, video, audio, 3D, thermal, etc.) is hard because they're fundamentally different and often don't have complete paired datasets. Strong modalities drown out weaker ones, and alignment spaces miss important unique details. CodeBind fixes this by handling missing data pairs and preserving each modality's distinctive properties.
- 2605.18221·May 18, 2026·~13 mincs.SDcs.CLcs.CV
SIREM: Speech-Informed MRI Reconstruction with Learned Sampling
Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, +5
ELI5This system reconstructs blurry speech MRI videos by using the audio you hear to help fill in missing details—like predicting where your tongue is based on what sound you're making, then combining that prediction with incomplete MRI data.
Problem solvedReal-time MRI of speech is too slow and blurry because you have to choose between speed and detail. By using the speech audio as a hint, this method reconstructs clearer vocal-tract images much faster, making it practical for clinics and research.
- 2605.18194·May 18, 2026·~13 mincs.AIcs.CV
Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
Yajing Zhou, Xiangyu Kong
ELI5MLLMs can't tell what another agent believes about the world based on what that agent can actually see or hear from its position. This paper tests whether models can figure out 'if I'm behind you and make noise, what do you think my location is?' — requiring them to think from someone else's perspective with their sensory limits.
Problem solvedMulti-agent AI systems need to predict what other agents believe, but current vision-language models rely on text-based reasoning that ignores physical constraints like occlusion and field-of-view. This breaks real robotic and simulation environments where agents must reason about others' limited perspectives.
- 2605.18188·May 18, 2026·~12 mincs.LG
UTOPYA: A Multimodal Deep Learning Framework for Physics-Informed Anomaly Detection and Time-Series Prediction
Robson W. S. Pessoa, Julien Amblard, Alessandra Russo, Idelfonso B. R. Nogueira
ELI5A system that combines 8 different sensor readings to spot problems and predict future values in chemical batch processes, like distillation. It uses physics rules (smoothness, thermodynamics) to guide learning and works better than simpler methods, even with few examples of actual failures.
Problem solvedBatch processes like distillation are hard to monitor because you have many sensor types, few labeled failure examples, and things change over time unpredictably. Current anomaly detection misses problems because it ignores the physics and doesn't use all sensor data together effectively.
- 2605.18176·May 18, 2026·~11 mincs.CVcs.AI
MARS: Technical Report for the CASTLE Challenge at EgoVis 2026
Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, +3
ELI5A system that answers questions about real-world activities by smartly choosing which of many data sources to look at—videos, transcripts, photos, eye gaze, heart rate—rather than trying to process everything at once.
Problem solvedEgocentric video understanding requires reasoning over massive amounts of multimodal data (4 days, 15 camera angles, multiple sensor streams). Models can't fit all of it in memory, so you need an agent that knows which evidence to pull and when to answer.
- 2605.18172·May 18, 2026·~10 mincs.AI
Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
Junyu Pan, Yansen Wang, Enze Zhang, Baoliang Lu, +2
ELI5Instead of trying to convert brain signals (EEG) directly into words, researchers generate images from the brain data first, then feed those images into vision-language models. It's like translating brain activity through pictures instead of text, which preserves more detail.
Problem solvedEEG datasets with visual information are rare, forcing models to align brain signals with abstract text that loses perceptual details. This method recovers that lost information by generating visual proxies that let brain-understanding models leverage their existing visual knowledge.
- 2605.18160·May 18, 2026·~11 mincs.CVcs.AI
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, +2
ELI5When AI models generate text about images, they often 'forget' the image as they write longer responses. This paper adds a lightweight module that keeps reminding the model to look back at the image during generation, like having a co-writer constantly point at the source material.
Problem solvedMultimodal AI models hallucinate and drift away from images in long-form outputs because visual information gets diluted among text tokens. This causes inaccurate descriptions and answers that don't match what's actually shown, limiting real-world reliability.
- 💤Quiet2605.16258·May 15, 2026·~9 mincs.CVcs.AIcs.RO
IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation
Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, +3
⭐ 95 stars / 10 repos📚 0 citesELI5A system that learns to understand 3D geometry from multiple 2D photos without knowing the camera positions, building a continuous 3D model that can render images, depth maps, and surface details from any angle.
Problem solvedCurrent 3D reconstruction methods require either precise camera poses or produce pixelated, discontinuous geometry. This approach reconstructs smooth, detailed 3D scenes from unposed images and handles multiple downstream tasks (rendering, depth, normals, pose estimation) with one model.
- 🚀Shipping2605.16165·May 15, 2026·~8 mincs.CVcs.AI
Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models
Yishun Lu, Wes Armour
⭐ 180 stars / 10 repos📚 0 citesELI5When training AI models that handle both images and text together, the gradients from each modality fight each other during optimization. This paper uses a smarter optimizer that understands the geometry of gradients better, reducing that conflict and letting the model scale to bigger batches without falling apart.
Problem solvedMultimodal models struggle to train efficiently at large batch sizes because image and text tasks pull optimization in conflicting directions. This causes instability and wastes compute. The new optimizer solves this by using second-order information to balance the competing gradients, unlocking faster, more efficient training.