SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting.


MICCAI 2025



1The Chinese University of Hong Kong, Hong Kong SAR, China, 2Shenzhen Research Institute, CUHK, Shenzhen, China,
3Technical University of Munich, Munich, Germany, 4University of Strasbourg & IHU Strasbourg, Strasbourg, France, 5University College London, London, United Kingdom
* Equal Contribution, Corresponding Authors



Endo-4DGX achieves state-of-the-art performance for robust reconstruction and illumination correction in the surgical scene with varying illumination.

Abstract

In contemporary surgical research and practice, accurately comprehending 3D surgical scenes with text-promptable capabilities is particularly crucial for surgical planning and real-time intra-operative guidance, where precisely identifying and interacting with surgical tools and anatomical structures is paramount. However, existing works focus on surgical vision-language model (VLM), 3D reconstruction, and segmentation separately, lacking support for real-time text-promptable 3D queries. In this paper, we present SurgTPGS, a novel text- promptable Gaussian Splatting method to fill this gap. We introduce a 3D semantics feature learning strategy incorporating the Segment Anything model and state-of-the-art vision-language models. We extract the segmented language features for 3D surgical scene reconstruction, enabling a more in-depth understanding of the complex surgical environment. We also propose semantic-aware deformation tracking to capture the seamless deformation of semantic features, providing a more precise reconstruction for both texture and semantic features. Furthermore, we present semantic region-aware optimization, which utilizes regional-based semantic information to supervise the training, particularly promoting the reconstruction quality and semantic smoothness. We conduct comprehensive experiments on two real-world surgical datasets to demonstrate the superiority of SurgTPGS over state-of-the-art methods, highlighting its potential to revolutionize surgical practices.


Architecture

Illustration of the proposed SurgTPGS. We first extract semantic embedding from SAM and VLM; then, we train the deformable Gaussians with semantic-aware deforma- tion tracking and semantic-region-aware optimization. SurgTPGS supports real-time semantic 3D query and novel-view rendering simultaneously.


Visualization

Text-Prompt 3D Segmentation


Qualitative Result on CholecSeg8K and EndoVis18 Dataset. Our method outperforms state-of-the-art baseline methods.


Result on CholecSeg8K.


Result on EndoVis2018

BibTeX


      @misc{huang2025surgtpgssemantic3dsurgical,
      title={SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting}, 
      author={Yiming Huang and Long Bai and Beilei Cui and Kun Yuan and Guankun Wang and Mobarakol Islam and Nicolas Padoy and Nassir Navab and Hongliang Ren},
      year={2025},
      eprint={2506.23309},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2506.23309}, 
      publisher={arXiv},
      }