Efficient 2D Modality Fusion into Sparse Voxels for 3D Reconstruction

2025-02-21

This research presents an efficient 3D reconstruction method by fusing data from various 2D modalities (rendered depth, semantic segmentation results, and CLIP features) into pre-trained sparse voxels. The method utilizes a classical volume fusion approach, weighting and averaging 2D views to generate a 3D sparse voxel field containing depth, semantic, and language information. Examples are shown using rendered depth for mesh reconstruction via SDF, Segformer for semantic segmentation, and RADIOv2.5 and LangSplat for vision and language feature extraction. Jupyter Notebook links are provided for reproducibility.