SmallGS: Gaussian Splatting Based Camera Pose Estimation for Small-Baseline Videos

1University of Cambridge
2Meshcapade
PBVS Workshop - CVPR 2025
Teaser image

Abstract

Dynamic videos with small baseline motions are ubiquitous in daily life, especially on social media. However, these videos present a challenge to existing pose estimation frameworks due to ambiguous features, drift accumulation, and insufficient triangulation constraints. Gaussian splatting, which maintains an explicit representation for scenes, provides a reliable novel view rasterization when the viewpoint change is small.

Inspired by this, we propose SmallGS, a camera pose estimation framework that is specifically designed for small-baseline videos. SmallGS optimizes sequential camera poses using Gaussian splatting, which reconstructs the scene from the first frame in each video segment to provide a stable reference for the rest. The temporal consistency of Gaussian splatting within limited viewpoint differences reduced the requirement of sufficient depth variations in traditional camera pose estimation.

We further incorporate pretrained robust visual features, e.g. DINOv2, into Gaussian splatting, where high-dimensional feature map rendering enhances the robustness of camera pose estimation. By freezing the Gaussian splatting and optimizing camera viewpoints based on rasterized features, SmallGS effectively learns camera poses without requiring explicit feature correspondences or strong parallax motion. We verify the effectiveness of SmallGS in small-baseline videos in TUM-Dynamics sequences, which achieves impressive accuracy in camera pose estimation compared to MonST3R and DORID-SLAM for small-baseline videos in dynamic scenes. Data and code will be released.

Pipeline

Our method follows the CF-3DGS pipeline, with camera poses estimated in video segments. The camera pose estimation is performed as follows: 1) Use MonST3R to predict depth maps, confidence masks, and camera intrinsic parameters. 2) Lift the depth map of the first frame in the video segment to a dense point cloud, masking dynamic objects using the corresponding confidence mask, which serve as a semantic mask. 3) Initialize Gaussian splatting and update it based on the first frame of the segment. 4) Freeze the parameters of Gaussian splatting and optimize the batched camera poses by minimizing the error between the rasterized feature maps (under the learned camera poses) and the corresponding feature maps extracted by DINOv2. The semantic masks are applied to mask the dynamic objects for the rasterized feature maps and the corresponding feature maps.

Result Videos

Sequence 1

Sequence 2

Sequence 3

Sequence 4

BibTeX

@misc{yao2025smallgsgaussiansplattingbasedcamera,
        title={SmallGS: Gaussian Splatting-based Camera Pose Estimation for Small-Baseline Videos}, 
        author={Yuxin Yao and Yan Zhang and Zhening Huang and Joan Lasenby},
        year={2025},
        eprint={2504.17810},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2504.17810}, 
  }