💤Quietscore 73.8May 15, 2026·2605.16258cs.CVcs.AIcs.RO

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou, Jiwen Lu

Narrative

IVGT replaces the standard approach of predicting explicit pointmaps (pixel-aligned 3D coordinate regression, as in DUSt3R/MASt3R) with a continuous implicit representation using signed distance functions queried from a canonical coordinate system learned across multiple datasets. The system handles the full stack — mesh and point cloud reconstruction, novel view synthesis, depth and normal estimation, and camera pose estimation — all from unposed multi-view images, trained with 2D supervision plus 3D geometric regularization. No quantitative comparison numbers are available in the abstract, so the claimed margin over explicit geometry methods remains unverified without reading the full paper.

No production traction yet. The GitHub references are all automated arXiv tracking pipelines with no meaningful implementation activity, citations are at zero, and no official code repository has surfaced. This is a very fresh preprint from Tsinghua (Zhou/Lu lab), which has a track record in this space, but IVGT is purely at the research stage right now.

Abstract

Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.

Citation timeline
Not enough citation snapshots yet to plot a timeline. Come back after a few cron runs.