-
Notifications
You must be signed in to change notification settings - Fork 31
Feature Request: Calculate and export global dataset statistics for ML normalization #98
Description
Description of the issue
When generating training-ready .zarr datasets for downstream models (like those in the neural-lam repository), the neural networks require normalized input data (typically zero mean and unit variance).
Currently, generating the dataset and calculating these normalization statistics are two separate processes. Users have to use mllam-data-prep to write the .zarr dataset to disk, and then make a completely separate secondary pass over that generated data to compute the global statistics (mean, standard deviation, min, max) for each variable. Because weather datasets are often massive and out-of-core, this secondary read pass is a huge I/O bottleneck, wasting valuable time and compute resources.
Solution
I propose adding an optional feature to calculate these global statistics dynamically during the initial data preparation phase.
Since mllam-data-prep is already loading, chunking, and processing the data via xarray and Dask, we can append these statistical reductions to the existing computation graph. This means the statistics are calculated as the data flows through the pipeline, rather than requiring a second read from the disk.
This will help improving efficiency
Action Plan
Configuration: Add an optional parameter (e.g., compute_statistics: true or a statistics block) to the data preparation YAML config.
Graph Building: During the pipeline execution, xarray builds the Dask task graph for the variables and derived variables.
Reduction: Append dask.array reduction tasks (mean, std, min, max) for each feature to the task graph before triggering the final compute.
Export: Once computed, these statistics are saved directly as global attributes within the output .zarr store, or exported as a companion statistics.json file.
Expected Outcome
Eliminates redundant I/O: Removes the need to re-read terabytes of data just to calculate normalization metrics.
Saves compute time: Leverages the existing Dask graph to compute stats efficiently in a single pass.
Improves Developer Experience: Tightly couples dataset creation with the exact input requirements of neural-lam, providing researchers with a "ready-to-train" dataset right out of the box.
If this feature keeps up with project goal and idea I would like to work on this feature if you assign this to me