Skip to content

Feature Request: Calculate and export global dataset statistics for ML normalization #98

@Aditya200247

Description

@Aditya200247

Description of the issue
When generating training-ready .zarr datasets for downstream models (like those in the neural-lam repository), the neural networks require normalized input data (typically zero mean and unit variance).

Currently, generating the dataset and calculating these normalization statistics are two separate processes. Users have to use mllam-data-prep to write the .zarr dataset to disk, and then make a completely separate secondary pass over that generated data to compute the global statistics (mean, standard deviation, min, max) for each variable. Because weather datasets are often massive and out-of-core, this secondary read pass is a huge I/O bottleneck, wasting valuable time and compute resources.

Solution
I propose adding an optional feature to calculate these global statistics dynamically during the initial data preparation phase.

Since mllam-data-prep is already loading, chunking, and processing the data via xarray and Dask, we can append these statistical reductions to the existing computation graph. This means the statistics are calculated as the data flows through the pipeline, rather than requiring a second read from the disk.

This will help improving efficiency

Action Plan
Configuration: Add an optional parameter (e.g., compute_statistics: true or a statistics block) to the data preparation YAML config.

Graph Building: During the pipeline execution, xarray builds the Dask task graph for the variables and derived variables.

Reduction: Append dask.array reduction tasks (mean, std, min, max) for each feature to the task graph before triggering the final compute.

Export: Once computed, these statistics are saved directly as global attributes within the output .zarr store, or exported as a companion statistics.json file.

Expected Outcome
Eliminates redundant I/O: Removes the need to re-read terabytes of data just to calculate normalization metrics.

Saves compute time: Leverages the existing Dask graph to compute stats efficiently in a single pass.

Improves Developer Experience: Tightly couples dataset creation with the exact input requirements of neural-lam, providing researchers with a "ready-to-train" dataset right out of the box.

If this feature keeps up with project goal and idea I would like to work on this feature if you assign this to me

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions