Skip to content

Commit 6d227b4

Browse files
authored
Merge pull request #32 from chanzuckerberg/vcp-dataio
feat: vcp dataio
2 parents eb84fec + 6016555 commit 6d227b4

29 files changed

+1449
-558
lines changed

docs/api/quickstart2d.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ This quickstart guide shows you how to use SABER's API to segment 2D micrographs
1515
Before starting, ensure you have SABER installed and import the necessary modules: SABER supports various file formats commonly used in microscopy:
1616
```python
1717
from saber.segmenters.micro import cryoMicroSegmenter
18-
from saber.classifier.models import common
1918
from saber.visualization import classifier as viz
19+
from saber.classifier.models import common
2020
from saber.utils import io
2121
import numpy as np
2222
import torch

docs/assets/saber_gui.png

-966 KB
Loading

docs/getting-started/import-tomos.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -77,16 +77,15 @@ The `copick config filesystem` command assumes local paths, but you can edit the
7777

7878
</details>
7979

80-
<details markdown="1">
81-
<summary><strong> 💡 Understanding the `--objects` flag</strong></summary>
80+
</details>
8281

83-
The `--objects` flag accepts 2-4 elements separated by commas:
82+
!!! info
83+
The `--objects` flag accepts 2-4 elements separated by commas:
8484

85-
1. **Particle name** (required): e.g., `ribosome`
86-
2. **Is pickable** (required): `True` for particles, `False` for continuous segmentations
87-
3. **Particle radius** (optional): in Ångströms, e.g., `130`
88-
4. **PDB ID** (optional): reference structure, e.g., `6QZP`
89-
</details>
85+
1. **Particle name** (required): e.g., `ribosome`
86+
2. **Is pickable** (required): `True` for particles, `False` for continuous segmentations
87+
3. **Particle radius** (optional): in Ångströms, e.g., `130`
88+
4. **PDB ID** (optional): reference structure, e.g., `6QZP`
9089

9190
This structure supports both particle picking for sub-tomogram averaging and broader 3D segmentation tasks. Our deep learning platform [Octopi 🐙](https://github.com/chanzuckerberg/octopi) is designed to train models from copick projects for:
9291

docs/getting-started/quickstart.md

Lines changed: 26 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -15,62 +15,72 @@ For reference, you can skip steps 1 and 2 to visualize the raw SAM2 segmentation
1515
## 🧩 Phase 1: Curating Training Labels and Training and Domain Expert Classifier
1616

1717
### Producing Intial SAM2 Segmentations
18-
Use `prepare-tomogram-training` to generate 2D segmentations from a tomogram using SAM2-style slab-based inference. These masks act as a rough initialization for downstream curation and model training.
18+
Use `prep3` to generate 2D segmentations from a tomogram using SAM2-style slab-based inference. These masks act as a rough initialization for downstream curation and model training.
1919

2020
#### For tomogram data:
2121
```bash
22-
saber classifier prepare-tomogram-training \
22+
saber classifier prep3d \
2323
--config config.json \
2424
--voxel-size 10 --tomo-alg denoised \
25-
--num-slabs 3 --output training_data.zarr \
25+
--num-slabs 3 --output training.zarr \
2626
```
2727
This will save slab-wise segmentations in a Zarr volume that can be reviewed or refined further.
2828

2929
#### For electron micrograph/single-particle data:
3030
```bash
31-
saber classifier prepare-micrograph-training \
31+
saber classifier prep2d \
3232
--input path/to/folder/*.mrc \
33-
--ouput training_data.zarr \
33+
--ouput training.zarr \
3434
--target-resolution 10
3535
```
3636

37-
In the case of referencing MRC files from single particle datasets use `prepare-micrograph-training` instead.
37+
In the case of referencing MRC files from single particle datasets use `prep2d` instead.
3838

3939
### 🎨 Annotating Segmentations for the Classifier with the Interactive GUI
4040

4141
Launch an interactive labeling session to annotate the initial SAM2 segmentations and assign class labels.
4242
```
43-
saber gui \
44-
--input output_zarr_fname.zarr \
45-
--output curated_labels.zarr \
46-
--class-names carbon,lysosome,artifacts
43+
saber gui --input training.zarr
4744
```
4845

49-
For transfering the data between machines, its recommended ziping (compressing) the zarr file prior to data transfer (e.g. `zip -r curated_labels.zarr.zip curated_labels.zarr`).
46+
For transfering the data between machines, its recommended ziping (compressing) the zarr file prior to data transfer (e.g. `zip -r training.zarr.zip training.zarr`).
5047

51-
Once annotations are complete, split the dataset into training and validation sets:
48+
After you download the anntoated JSON file, you can apply the annotations on the original zarr file.
49+
50+
```bash
51+
saber classifier labeler \
52+
--input training.zarr \
53+
--labels labels.json \
54+
--classes class1,class2,class3 \
55+
--output labeld.zarr
56+
```
57+
58+
Once the labeled zarr is available, split the dataset into training and validation sets:
5259

5360
```
5461
saber classifier split-data \
55-
--input curated_labels.zarr \
62+
--input labeled.zarr \
5663
--ratio 0.8
5764
```
58-
This generates `curated_labels_train.zarr` and `curated_labels_val.zarr` for use in model training.
65+
This generates `labeled_train.zarr` and `labeled_val.zarr` for use in model training.
66+
67+
!!! info "Learn More"
68+
For detailed annotation instructions, see the [Annotation and Labeling](../tutorials/preprocessing.md#-annotation-with-the-saber-gui) section.
5969

6070
## 🧠 Phase 2: Train a Domain Expert Classifier
6171

6272
Train a classifier using your curated annotations. This model improves segmentation accuracy beyond zero-shot results by learning from expert-provided labels.
6373
```
6474
saber classifier train \
65-
--train curated_labels_train.zarr --validate curated_labels_val.zarr \
75+
--train labeled_train.zarr --validate labeled_val.zarr \
6676
--num-epochs 75 --num-classes 4
6777
```
6878
The number of classes should be 1 greater than the number of class names provided during annotation (to account for background).
6979
Training logs, model weights, and evaluation metrics will be saved under `results/`.
7080

7181
## 🔍 Phase 3: Inference
7282

73-
### 🖼️ Producting 2D Segmentations with SABER
83+
### 🖼️ Producing 2D Segmentations with SABER
7484

7585
SABER operates in two modes depending on your input: interactive mode when processing a single image, and batch processing mode when you provide a file path pattern (like `--input 'path/to/*.mrc'`) to process entire datasets automatically.
7686

docs/tutorials/preprocessing.md

Lines changed: 34 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -69,10 +69,10 @@ This preview helps you understand what structures SAM2 naturally identifies in y
6969

7070
## 🧬 Pre-processing Electron Micrographs
7171

72-
For single-particle datasets, ADF/BF signals from S/TEM, or FIB-SEM micrographs -- use the `saber classifier prepare-micrograph-training` command:
72+
For single-particle datasets, ADF/BF signals from S/TEM, or FIB-SEM micrographs -- use the `saber classifier prep2d` command:
7373

7474
```bash
75-
saber classifier prepare-micrograph-training \
75+
saber classifier prep2d \
7676
--input 'path/to/*.mrc' \
7777
--output training.zarr
7878
```
@@ -92,7 +92,7 @@ Traditional workflows require you to manually draw every mask from scratch. SABE
9292

9393
Generate comprehensive slab-based segmentations that maintain 3D context:
9494
```bash
95-
saber classifier prepare-tomogram-training \
95+
saber classifier prep3d \
9696
--config config.json \
9797
--zarr-path output_zarr_fname.zarr \
9898
--num-slabs 3
@@ -113,31 +113,48 @@ Small objects or sparse structures might not be present in a single slab project
113113

114114
---
115115

116-
## 🎨 Next Step: Annotation with the SABER GUI
116+
## 🎨 Annotation with the SABER GUI
117117

118+
Launch the GUI to begin annotating your pre-processed data:
119+
```bash
120+
saber gui --input output_zarr_fname.zarr
121+
```
118122
Once preprocessing is complete, SABER's unique annotation workflow begins. Instead of drawing masks from scratch, you simply:
119123

120-
1. **Point and Click** on the precomputed segmentations.
121-
2. **Assign Class Labels** using the dropdown menu.
124+
!!! info "How the GUI works:"
125+
1. **Point and Click** on the precomputed SAM2 segmentations.
126+
2. **Assign Class Labels** using the menu on the right.
127+
3. **Save the Annotations** Save the resulting JSON file with the bottom right button.
122128

123129
![SABER GUI](../assets/saber_gui.png)
124130

125-
```bash
126-
saber gui \
127-
--input output_zarr_fname.zarr \
128-
--output curated_labels.zarr \
129-
--class-names carbon,lysosome,artifacts
130-
```
131-
132-
**Class Configuration**: The `--class-names` flag defines the biological classes present in your data. For binary classification (object vs. background), you can omit this flag for a simple two-class system.
133-
134-
**💡 How Many Micrographs / Tomograms Should I Annotate?** In general we recommend annotating 20-40 runs per dataset. In cases where there are several objects per image/slab the lower range should be sufficient. If only a few instances are available, the higher range is recommended.
131+
!!! tip "Annotation Guidelines - How Many Images to Annotate?"
132+
- We recommend 20-40 runs per dataset
133+
- Lower range (20): When multiple objects appear per image/slab
134+
- Higher range (40): When only few instances are available
135+
- Consistency is key: Maintain uniform criteria across all annotations
136+
- Handle ambiguous segments: When uncertain, prefer skipping over mislabeling
135137

136138
**Tip:** For transferring data between machines, it's recommended to compress your Zarr files:
137139
```bash
138140
zip -r curated_labels.zarr.zip curated_labels.zarr
139141
```
140142

143+
## 🏷️ Applying Annotations for Classifier Training
144+
145+
Once you've completed annotations in the GUI, use the `labeler` command to apply your JSON annotations to the SAM2 masks, creating a training-ready dataset. The labeler converts your point-and-click annotations into properly indexed training data, handling class ordering automatically or according to your specifications.
146+
147+
!!! example "Basic Usage"
148+
```bash
149+
saber classifier labeler \
150+
--input training.zarr \
151+
--labels labels.json \
152+
--classes lysosome,carbon,edge \
153+
--output labeled.zarr
154+
```
155+
156+
We can either control the ordering of the labels or apply a subset of the labels with the `--classes` flag. If the flag is omitted, all classes are used in alphabetical orde
157+
141158
---
142159

143-
_Ready to move on? Check out the [Training a Classifier](training.md) tutorial!_
160+
_Ready to move on? Check out the [Training a Classifier](training.md) tutorial!_

saber/analysis/organelle_statistics.py

Lines changed: 27 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,22 @@
11
from skimage.measure import regionprops
2+
from copick_utils.io import writers
23
import numpy as np
34

4-
def extract_organelle_statistics(run, mask, organelle_name, session_id, user_id, voxel_size, save_copick = True, zfile=None, xyz_order=True):
5+
def extract_organelle_statistics(
6+
run, mask, organelle_name, session_id, user_id,
7+
voxel_size, save_copick = True, save_statistics=True, xyz_order=True):
8+
"""
9+
Extract statistics and return CSV rows.
10+
11+
Returns:
12+
List of CSV rows if save_statistics is True, empty list otherwise
13+
"""
514

615
unique_labels = np.unique(mask)
716
unique_labels = unique_labels[unique_labels > 0] # Ignore background (label 0)
817

918
coordinates = {}
10-
results = {}
19+
csv_rows = []
1120
for label in unique_labels:
1221

1322
component_mask = (mask == label).astype("int")
@@ -19,45 +28,39 @@ def extract_organelle_statistics(run, mask, organelle_name, session_id, user_id,
1928
centroid = centroid[::-1]
2029
coordinates[str(label)] = centroid
2130

22-
if zfile is not None:
31+
if save_statistics:
2332

2433
# Compute Volume in nm^3
2534
volume = np.sum(component_mask) * (voxel_size/10)**3 # Convert from Angstom to nm^3
2635

2736
# Sort axes to identify the first (Z-biased) and two in-plane dimensions
28-
axes_lengths = sorted([rprops.axis_major_length, rprops.axis_minor_length, rprops.axis_minor_length])
37+
axes_lengths = sorted([rprops.axis_major_length, rprops.axis_minor_length,
38+
rprops.axis_minor_length])
2939

3040
# Convert to physical units (nm)
3141
axis_x = axes_lengths[1] * (voxel_size/10) # Likely an in-plane axis
3242
axis_y = axes_lengths[2] * (voxel_size/10) # Likely an in-plane axis
3343
diameter = (axis_x + axis_y) / 2
3444

35-
# Save Statistics in a structured dictionary
36-
results[str(label)] = {'volume': volume, 'diameter': diameter, 'coordinates': centroid}
45+
# Prepare row for CSV
46+
csv_row = [
47+
run.name,
48+
int(label),
49+
volume,
50+
diameter,
51+
]
52+
csv_rows.append(csv_row)
3753

38-
# Save to Copick
54+
# Save Statistics to CSV File
3955
if len(coordinates) > 0:
40-
41-
# Save to Copick
56+
# Save Coordinates to Copick
4257
if save_copick:
43-
save_coordinates_to_copick(run, coordinates, organelle_name, session_id, user_id, voxel_size)
44-
45-
# Save Statistics into Zarr File
46-
if zfile is not None:
47-
group = zfile.create_group(run.name)
48-
# Save metadata as an array
49-
labels = np.array(list(results.keys()), dtype=int)
50-
volumes = np.array([r["volume"] for r in results.values()], dtype=float)
51-
diameters = np.array([r["diameter"] for r in results.values()], dtype=float)
52-
coordinates = np.array([r["coordinates"] for r in results.values()], dtype=float)
53-
54-
group.create_dataset("labels", data=labels, overwrite=True)
55-
group.create_dataset("volumes", data=volumes, overwrite=True)
56-
group.create_dataset("diameters", data=diameters, overwrite=True)
57-
group.create_dataset("coordinates", data=coordinates, overwrite=True)
58+
save_coordinates_to_copick(run, coordinates, organelle_name,
59+
session_id, user_id, voxel_size)
5860
else:
5961
print(f"{run.name} didn't have any organelles present!")
6062

63+
return csv_rows
6164

6265
def save_coordinates_to_copick(run, coordinates, organelle_name, session_id, user_id, voxel_size):
6366

saber/classifier/cli.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from saber.classifier.preprocess.tomogram_training_prep import prepare_tomogram_training
44
from saber.classifier.preprocess.split_merge_data import split_data, merge_data
55
from saber.classifier.preprocess.training_data_info import class_info
6+
from saber.classifier.preprocess.apply_labels import labeler
67
from saber.classifier.inference import predict, predict_slurm
78
from saber.classifier.train import train, train_slurm
89
from saber.classifier.evaluator import evaluate
@@ -22,6 +23,7 @@ def classifier_routines():
2223
classifier_routines.add_command(prepare_micrograph_training)
2324
classifier_routines.add_command(evaluate)
2425
classifier_routines.add_command(class_info)
26+
classifier_routines.add_command(labeler)
2527

2628
@click.group(name="classifier")
2729
def slurm_classifier_routines():

saber/classifier/datasets/singleZarrDataset.py

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -32,19 +32,20 @@ def __init__(self, zarr_path, mode='train', transform=None, min_area = 250):
3232
self.samples = []
3333
for run_id in tqdm(self.run_ids):
3434
group = self.zfile[run_id]
35-
image = group['image'][:]
36-
37-
if 'masks' in group:
38-
# Process candidate masks
39-
candidate_masks = group['masks'][:] # [Nclass, Nx, Ny]
35+
image = group['0'][:]
36+
labels = group['labels']
37+
38+
# Process candidate masks
39+
if '0' in labels:
40+
candidate_masks = labels['0'][:] # [Nclass, Nx, Ny]
4041
self._process_masks(candidate_masks, image)
4142
else:
4243
continue
4344

4445
# Check if "rejected_masks" exists before accessing
45-
if 'rejected_masks' in group:
46+
if 'rejected' in labels:
4647
# Process rejected masks
47-
rejected_masks = group['rejected_masks'][::negative_class_reduction]
48+
rejected_masks = labels['rejected'][::negative_class_reduction]
4849
self._process_masks(rejected_masks, image, is_negative_mask=True)
4950

5051
def _process_masks(self, masks, image, is_negative_mask = False):
@@ -66,7 +67,7 @@ def _process_masks(self, masks, image, is_negative_mask = False):
6667
self.samples.append({
6768
'image': image,
6869
'mask': component_mask,
69-
'label': 0 if is_negative_mask else class_idx + 1 # Assign labels properly
70+
'label': 0 if is_negative_mask else class_idx # Assign labels properly
7071
})
7172

7273
def __len__(self):

0 commit comments

Comments
 (0)