Feature Request: Calculate and export global dataset statistics for ML normalization

**Description of the issue**
When generating training-ready .zarr datasets for downstream models (like those in the neural-lam repository), the neural networks require normalized input data (typically zero mean and unit variance).

Currently, generating the dataset and calculating these normalization statistics are two separate processes. Users have to use mllam-data-prep to write the .zarr dataset to disk, and then make a completely separate secondary pass over that generated data to compute the global statistics (mean, standard deviation, min, max) for each variable. Because weather datasets are often massive and out-of-core, this secondary read pass is a huge I/O bottleneck, wasting valuable time and compute resources.

**Solution**
I propose adding an optional feature to calculate these global statistics dynamically during the initial data preparation phase.

Since mllam-data-prep is already loading, chunking, and processing the data via xarray and Dask, we can append these statistical reductions to the existing computation graph. This means the statistics are calculated as the data flows through the pipeline, rather than requiring a second read from the disk.

This will help improving efficiency 

**Action Plan**
Configuration: Add an optional parameter (e.g., compute_statistics: true or a statistics block) to the data preparation YAML config.

Graph Building: During the pipeline execution, xarray builds the Dask task graph for the variables and derived variables.

Reduction: Append dask.array reduction tasks (mean, std, min, max) for each feature to the task graph before triggering the final compute.

Export: Once computed, these statistics are saved directly as global attributes within the output .zarr store, or exported as a companion statistics.json file.

**Expected Outcome**
Eliminates redundant I/O: Removes the need to re-read terabytes of data just to calculate normalization metrics.

Saves compute time: Leverages the existing Dask graph to compute stats efficiently in a single pass.

Improves Developer Experience: Tightly couples dataset creation with the exact input requirements of neural-lam, providing researchers with a "ready-to-train" dataset right out of the box.


If this feature keeps up with project goal and idea I would like to work on this feature if you assign this to me 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Calculate and export global dataset statistics for ML normalization #98

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Calculate and export global dataset statistics for ML normalization #98

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions