Inconsistent Outputs: Pretrained Weights (`28 Feb 2025` vs. `12 Jun 2024`)

Hi, my friend. 

Your work on the metric depth estimation model performs well on large-scale outdoor sequence datasets like KITTI. My recent project [GigaSLAM](https://github.com/DengKaiCQ/GigaSLAM) uses the `UniDepth v2` module and achieves really good performance on large-scale outdoor sequence datasets. Recently I found that `UniDepth v2` has undergone several updates on both GitHub and HuggingFace Hub. 

As mentioned in [this link](https://github.com/DengKaiCQ/GigaSLAM/issues/6#issuecomment-2962271081), however, the latest version of code and pretrained weights performs differently compared to the old version. There were 40 files with 1.5k lines changed in GitHub commit `bebc4b2` on `10 Mar 2025`, which is quite a big update. I guess that somethings unnoticed changes that may affect the performance have been made in this commit. 

To elaborate, I tested the **latest** UniDepth code (`8d8cfe4` on GitHub) with the **latest** pretrained weights (`52b349b` on Hugging Face) and the **older** version (the code was cloned on `04 Nov 2024` - most likely `bebc4b2` on GitHub, and weights `1d0d3c5` on Hugging Face). Both ran on KITTI Sequence 06 (~1100 images). The outputs differ significantly. 

The following code was used in this small exp (latest `unidepth-v2-vitl14`, `8d8cfe4` on GitHub and `52b349b` on Hugging Face)

```python
import os
import time
import numpy as np
import torch
from PIL import Image
from unidepth.models import UniDepthV2
from tqdm import tqdm
import matplotlib.pyplot as plt

snapshot_dir = "lpiccinelli/unidepth-v2-vitl14"  
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = UniDepthV2.from_pretrained(snapshot_dir)
model = model.to(device)

fx, fy, cx, cy = 707.0912, 707.0912, 601.8873, 183.1104  # kitti 04-10
intrinsics = torch.tensor([[fx, 0, cx],
                         [0, fy, cy],
                         [0, 0, 1]], dtype=torch.float32).to(device)

image_dir = "/media/deng/Data/KITTIdataset/data_odometry_color/dataset/sequences/06/image_2"
output_dir = "./output"
os.makedirs(output_dir, exist_ok=True)

min_depths = []
median_depths = []
mean_depths = []
max_depths = []
frame_ids = []

image_files = [f for f in os.listdir(image_dir) if f.endswith(".png")]
image_files.sort()

for i, filename in enumerate(tqdm(image_files, desc="Processing images", unit="image")):
    image_path = os.path.join(image_dir, filename)

    rgb = Image.open(image_path).convert("RGB")
    rgb = torch.from_numpy(np.array(rgb)).permute(2, 0, 1).to(device)  # C, H, W

    begin = time.time()
    predictions = model.infer(rgb, intrinsics)
    depth = predictions["depth"].cpu().detach().numpy()
    end = time.time()
    
    min_depth = depth.min()
    median_depth = np.median(depth)
    mean_depth = depth.mean()
    max_depth = depth.max()
    
    min_depths.append(min_depth)
    median_depths.append(median_depth)
    mean_depths.append(mean_depth)
    max_depths.append(max_depth)
    frame_ids.append(i)
    
    print(f'Frame {i}: inference time: {(end - begin) * 1000:.2f} ms')
    print(f'min: {min_depth:.2f}, median: {median_depth:.2f}, mean: {mean_depth:.2f}, max: {max_depth:.2f}')

global_min_stats = {
    'min': np.min(min_depths),
    'median': np.median(min_depths),
    'mean': np.mean(min_depths),
    'max': np.max(min_depths)
}

global_median_stats = {
    'min': np.min(median_depths),
    'median': np.median(median_depths),
    'mean': np.mean(median_depths),
    'max': np.max(median_depths)
}

global_mean_stats = {
    'min': np.min(mean_depths),
    'median': np.median(mean_depths),
    'mean': np.mean(mean_depths),
    'max': np.max(mean_depths)
}

global_max_stats = {
    'min': np.min(max_depths),
    'median': np.median(max_depths),
    'mean': np.mean(max_depths),
    'max': np.max(max_depths)
}

plt.figure(figsize=(24, 6))

plt.subplot(1, 4, 1)
plt.scatter(frame_ids, min_depths, color='blue', s=10)
plt.xlabel('frame id')
plt.ylabel('Depth Value')
plt.title(f"Minimum Depth\nmin={global_min_stats['min']:.2f} med={global_min_stats['median']:.2f}\nmean={global_min_stats['mean']:.2f} max={global_min_stats['max']:.2f}")
plt.grid(True)

plt.subplot(1, 4, 2)
plt.scatter(frame_ids, median_depths, color='green', s=10)
plt.xlabel('frame id')
plt.ylabel('Depth Value')
plt.title(f"Median Depth\nmin={global_median_stats['min']:.2f} med={global_median_stats['median']:.2f}\nmean={global_median_stats['mean']:.2f} max={global_median_stats['max']:.2f}")
plt.grid(True)

plt.subplot(1, 4, 3)
plt.scatter(frame_ids, mean_depths, color='orange', s=10)
plt.xlabel('frame id')
plt.ylabel('Depth Value')
plt.title(f"Mean Depth\nmin={global_mean_stats['min']:.2f} med={global_mean_stats['median']:.2f}\nmean={global_mean_stats['mean']:.2f} max={global_mean_stats['max']:.2f}")
plt.grid(True)

plt.subplot(1, 4, 4)
plt.scatter(frame_ids, max_depths, color='red', s=10)
plt.xlabel('frame id')
plt.ylabel('Depth Value')
plt.title(f"Maximum Depth\nmin={global_max_stats['min']:.2f} med={global_max_stats['median']:.2f}\nmean={global_max_stats['mean']:.2f} max={global_max_stats['max']:.2f}")
plt.grid(True)

plt.tight_layout()

plot_path = os.path.join(output_dir, 'depth_statistics_4plots.png')
plt.savefig(plot_path, dpi=300, bbox_inches='tight')
plt.close()

print(f"Depth statistics plot saved to {plot_path}")
```

the **older** version (`bebc4b2` on GitHub i guess and `1d0d3c5` on Hugging Face) uses the same small exp code but only changes the lines below:

```python 
...

snapshot_dir = "lpiccinelli/unidepth-v2-vitl14"  
commit_hash = "1d0d3c52f60b5164629d279bb9a7546458e6dcc4"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = UniDepthV2.from_pretrained(
    snapshot_dir,
    revision=commit_hash
)
model = model.to(device)

...
(the rest remains the same)
```

The images of KITTI Seq 06 are a bunch of car driving scene like this:
<div align="center">
<img src="https://github.com/user-attachments/assets/c5e57804-7d70-4437-8b7b-3372bc304465" width="55%">
<img src="https://github.com/user-attachments/assets/d6f2a916-ce4c-4751-94df-de9a722420d1" width="55%">
<img src="https://github.com/user-attachments/assets/c685ff5a-274f-41a7-8b41-c4589b02f699" width="55%">
</div>


The scale of the depth outputs should remain consistent across the entire sequence.

You can see the outputs of this small experiment below, along with the point clouds generated from each depth output and the GT poses.

1. **Latest** version (`unidepth-v2-vitl14`, `8d8cfe4` on GitHub and `52b349b` on Hugging Face): 

As the frame_id of the input image sequence increases, you can clearly see that the depth scale decreases. This leads to inconsistent depth scales across different frames within the same sequence. This phenomenon can be seen more intuitively in the point cloud visualizations.

![Image](https://github.com/user-attachments/assets/3a44bec9-1fff-4126-acb5-eaf59238162d)

![Image](https://github.com/user-attachments/assets/871d5183-eb86-4923-9cee-4e86b6e14297)

<div align="center">
<img src="https://github.com/user-attachments/assets/bd237103-5b06-46b1-a291-969a2d1285e3" width="65%">
</div>


2. **Older** version (`unidepth-v2-vitl14`, `bebc4b2` on GitHub and `1d0d3c5` on Hugging Face): 

Although the statistics graph appears slightly noisy, the depth scale remains consistent. You can see the corresponding point cloud with GT poses — it performed very well on this sequence! We can clearly identify the cars and buildings.

![Image](https://github.com/user-attachments/assets/9f99d350-35bc-4d16-8e72-7ddc72bbcbd5)

![Image](https://github.com/user-attachments/assets/0de273f9-5eb7-4d52-b771-797750c031ab)

<div align="center">
<img src="https://github.com/user-attachments/assets/ac6fb74f-424a-483b-afcd-5fc9c9523f38" width="65%">
</div>


Based on this small experiment, we may find that there are some unknown differences between the older and newer versions of `UniDepth v2`.

I wonder if you could shed some light on the specific changes that might affect the consistency of depth scale across frames. I'm not sure if this is an intended behavior or just an edge case in my setup. Were there any known modifications to the scale process, the training procedure, or post-processing logic that could explain this behavior? For now a quick solution of this problem is to use the older verion of `UniDepth v2`.

To sum up, the scale consistency in `1d0d3c5` works perfectly for robotic applications, while the new version `52b349b` has drifting scales that may break downstream tasks. 

I really appreciate your work on UniDepth and look forward to your insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent Outputs: Pretrained Weights (`28 Feb 2025` vs. `12 Jun 2024`) #128

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inconsistent Outputs: Pretrained Weights (28 Feb 2025 vs. 12 Jun 2024) #128

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Inconsistent Outputs: Pretrained Weights (`28 Feb 2025` vs. `12 Jun 2024`) #128