To get started, create a virtual environment using the provided environment.yml file:
git clone https://github.com/ngailapdi/SplatTalk.git
cd SplatTalk
conda env create -f environment.yml
conda activate splattalkThis environment should work for systems with CUDA 12.X.
Troubleshooting
The Gaussian splatting CUDA code (diff-gaussian-rasterization) must be compiled using the same version of CUDA that PyTorch was compiled with. If your system does not use CUDA 12.X by default, you can try the following:
- Install a version of PyTorch that was built using your CUDA version. For example, to get PyTorch with CUDA 11.8, use the following command (more details here):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118- Install CUDA Toolkit 12.X on your system. One approach (try this at your own risk!) is to install a second CUDA Toolkit version using the
runfile (local)option. For instance, to install CUDA Toolkit 12.1, download from here. When you run the installer, disable the options that install GPU drivers and update the default CUDA symlinks. If you do this, you can point your system to CUDA 12.1 during installation as follows:
LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64 pip install -r requirements.txt
# If everything else was installed but you're missing diff-gaussian-rasterization, do:
LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64 pip install git+https://github.com/dcharatan/diff-gaussian-rasterization-modifiedSplatTalk is trained using scenes from ScanNet.
The downloaded dataset under path datasets/ should look like:
datasets
├─ scannet
│ ├─ train
│ ├ ├─sceneXXXX_XX
| ├ ├ ├─ color (RGB images)
│ ├ ├ ├─ depth (depth images)
│ ├ ├ ├─ intrinsic (intrinsics)
│ ├ ├ └─ extrinsics.npy (camera extrinsics)
│ ├ ├─ sceneYYYY_YY
│ ├ ...
│ ├─ test
│ ├ ├─
│ ├ ...
│ ├─ train_idx.txt (training scenes list)
│ └─ test_idx.txt (testing scenes list)
└─
To obtain extrinsics.npy from the raw ScanNet data, run
python convert_poses.py
Pre-trained weights for the self-supervised/zero-shot model can be found here
Pre-trained weights for the autoencoder can be found here
The main entry point is src/main.py. To train on 100 views, run the following command:
python -m src.main +experiment=scannet/fvt +output_dir=train_fvt_full_100vYou can modify the number of training views with the following command (replace XX with your desired number of views):
python -m src.main +experiment=scannet/fvt +output_dir=train_fvt_full_100v dataset.view_sampler.num_context_views=XXThe output will be saved in path outputs/<output_dir>.
We trained our model using one H100 GPU for 7 days.
To evaluate pre-trained model on the [N]-views setting on [DATASET], you can call:
python -m src.main +experiment=scannet/fvt +output_dir=[OUTPUT_PATH] mode=test dataset/view_sampler=evaluation checkpointing.load=[PATH_TO_CHECKPOINT] dataset.view_sampler.num_context_views=[N]Please refer to SplatTalk-LLaVA-Inference codebase for instructions.
If you find our work helpful, please consider citing our paper. Thank you!
@article{thai2025splattalk,
title={Splattalk: 3d vqa with gaussian splatting},
author={Thai, Anh and Peng, Songyou and Genova, Kyle and Guibas, Leonidas and Funkhouser, Thomas},
journal={arXiv preprint arXiv:2503.06271},
year={2025}
}
Our code is largely based on FreeSplat. Thanks for their great work!