Diffusability Playground

Study diffusability of latent spaces starting from synthetic datasets with controllable geometry, then move toward real audio and vision datasets.

This repo is Hydra-driven and uses Astral uv for environment and dependency management. Use uv add for dependencies. For config-driven construction, use hydra.utils.instantiate; do not wire instantiation manually with OmegaConf.

Setup

uv sync

Main references:

https://docs.astral.sh/uv/
https://hydra.cc/docs/intro/

Repo Layout

conf/config.yaml                 # main experiment entrypoint
conf/data/*.yaml                 # dataset / datamodule configs
conf/model/*.yaml                # model configs
conf/trainer/*.yaml              # trainer configs

SiT/train.py                     # Hydra entrypoint + Lightning Trainer wiring
SiT/lightning_module.py          # training / validation / test module
SiT/eval_runner.py               # evaluation sampling + metric orchestration

datamodules/synthetic_pointclouds.py  # synthetic vector dataset + datamodule
utils/plot_distribution.py            # synthetic dataset visualization
utils/validation_distribution_plots.py  # validation-time GT vs generated comparison plots
utils/plot_anisotropy_intrinsic_sweep.py  # aggregate sweep plots from saved runs
utils/evaluate_checkpoint_metrics.py  # evaluate checkpoints every N epochs and write test_loss_by_class.json
program.md                            # active execution tracker for the current task

docs/synthetic_pointcloud_dataset.md
docs/synthetic_pointcloud_math_foundations.md

Current Default Experiment

conf/config.yaml is the main training config. Its defaults are:

data: synth_pc_datamodule
model: mini_mlp
trainer: sit_trainer

The current synthetic setup is a single-class affine-subspace dataset with:

one sample = one vector in R^D
ambient_dim defined once in conf/config.yaml
intrinsic_dim defined once in conf/config.yaml
anisotropy_max_scale defined once in conf/config.yaml

Those shared parameters are propagated via Hydra interpolations:

ambient_dim -> data.in_channels
ambient_dim -> model.in_channels
ambient_dim -> class_sweeps[*].base.D
intrinsic_dim -> class_sweeps[*].base.d
anisotropy_max_scale -> class_sweeps[*].sweep.anisotropy.max_scale
data_thickness -> class_sweeps[*].base.thickness

conf/data/synth_pc.yaml uses class_sweeps with a single base class and a sweepable anisotropy value. In practice, Hydra multirun produces one training job per anisotropy setting.

Synthetic Dataset Notes

datamodules/synthetic_pointclouds.py now works in the point-wise setting:

each sample is a single vector [D], not a cloud [N, D]
class geometry is sampled once per class
per-sample randomness only resamples latent coordinates, component choice, and additive noise

Supported geometric families:

affine_subspace
sine_warp_subspace
mog

The default datamodule computes:

SWD
Exact-W2
Energy-U
Feature-MMD

These metrics are computed directly on vector samples [N, D]; the legacy cloud-level metric path has been removed.

Per-run artifacts are written under results/<experiment>/metrics/:

class_registry.json
val_loss_by_class.jsonl
test_loss_by_class.json

Validation-time distribution plots are written under results/<experiment>/plots/val/:

distribution_comparison_epochXXX_stepXXXXXXX.png
manifest.jsonl

Each validation pass logs the same ground truth vs generated comparison figure to W&B. These figures use the same seaborn-style visual language already used by the repo plotting utilities, with a shared blue density palette and a shared density scale between the left and right panels for each class.

Training Commands

Single run with the current defaults:

uv run python SiT/train.py

Single run with explicit synthetic controls:

uv run python SiT/train.py \
  ambient_dim=8 \
  intrinsic_dim=6 \
  anisotropy_max_scale=4.0 \
  trainer.strategy=auto

Ambient-8 anisotropy sweep over 5 levels:

CUDA_VISIBLE_DEVICES=0 uv run python SiT/train.py -m \
  ambient_dim=8 \
  intrinsic_dim=6 \
  anisotropy_max_scale=1.0,2.0,4.0,8.0,16.0 \
  trainer.results_dir=results/anisotropy_sweep_ambient8_d6 \
  trainer.strategy=auto

Ambient-16 anisotropy sweep over the same 5 levels:

CUDA_VISIBLE_DEVICES=0 uv run python SiT/train.py -m \
  ambient_dim=16 \
  intrinsic_dim=6 \
  anisotropy_max_scale=1.0,2.0,4.0,8.0,16.0 \
  trainer.results_dir=results/anisotropy_sweep_ambient16_d6 \
  trainer.strategy=auto

Notes:

trainer.strategy=auto is the safest override for single-GPU runs.
model.num_classes is resolved automatically at runtime from the instantiated datamodule.
W&B is enabled by default through conf/trainer/sit_trainer.yaml.
trainer.run_name controls the human-readable run label used for local experiment naming.
trainer.wandb_run_name can override the W&B display name when needed.

Evaluate checkpoints every 5 epochs (defaults: W2 at 2048 samples/class, SWD/MMD/L2 at 10000):

uv run python utils/evaluate_checkpoint_metrics.py
# Example override:
uv run python utils/evaluate_checkpoint_metrics.py \
  roots='[results/gaussian_anisotropy_sweep]' \
  epoch_stride=10

Plotting

Visualize the current synthetic dataset:

uv run python utils/plot_distribution.py

Regenerate the documentation figures:

uv run python utils/plot_distribution.py --config-name plot_dataset_docs
uv run python utils/plot_distribution.py --config-name plot_dataset_docs_anis

Aggregate anisotropy sweep results:

uv run python utils/plot_anisotropy_intrinsic_sweep.py \
  results_root=results/anisotropy_sweep_ambient8_d6

uv run python utils/plot_anisotropy_intrinsic_sweep.py \
  results_root=results/anisotropy_sweep_ambient16_d6

The sweep plotting utility reads local training artifacts from results/... and matches them with local W&B logs under wandb/. It now summarizes and plots both val/feature_mmd_mean and val/swd_mean when available.

Sampling Configuration

Sampling is configured in conf/model/*.yaml:

sampling:
  mode: ODE  # or SDE
  ode:
    method: dopri5  # dopri5, euler, heun
    num_steps: 50
    atol: 1.0e-6
    rtol: 1.0e-3
  sde:
    method: Euler  # Euler, Heun
    num_steps: 250
    diffusion_form: SBDM
    diffusion_norm: 1.0
    last_step: Mean
    last_step_size: 0.04

Maintenance

When repo behavior changes, update this README so commands, config names, and outputs remain accurate.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
SiT		SiT
conf		conf
datamodules		datamodules
docs		docs
utils		utils
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
metric_stability_seedstudy.log		metric_stability_seedstudy.log
metric_stability_sweep.log		metric_stability_sweep.log
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diffusability Playground

Setup

Repo Layout

Current Default Experiment

Synthetic Dataset Notes

Training Commands

Plotting

Sampling Configuration

Maintenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Diffusability Playground

Setup

Repo Layout

Current Default Experiment

Synthetic Dataset Notes

Training Commands

Plotting

Sampling Configuration

Maintenance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages