feat: synthetic gridded dataset by frazane · Pull Request #637 · ecmwf/anemoi-datasets

frazane · 2026-05-22T12:42:44Z

Motivation

Testing and prototyping anemoi graphs/training/inference (and sometimes datasets itself) needs a dataset, and today that means building a real Zarr store: from a recipe, data sources, and compute, then keeping it on disk. There is no lightweight stand-in that can be used the same way for quick experiments, CI and unit tests, or benchmarking. Furthermore, it could open up to other potential applications like more theoretical machine learning research questions.

This PR adds an in-memory synthetic gridded dataset, opened with open_dataset(synthetic={...}). It needs no data, disk, or network.

What it adds

open_dataset(synthetic={...}) builds a SyntheticGriddedDataset — a GriddedZarr over a lazy in-memory store, so it inherits the full dataset contract.

ds = open_dataset(
    synthetic={
        "geography": {"bbox": [60, -10, 35, 30], "resolution": 0.25},  # or e.g. {"named": "o96"}
        "dates": {"start": "2020-01-01", "end": "2020-12-31", "frequency": "6h"},
        "layout": "gridded",
        "variables": [
            {"name": "2t", "values": {"constant": 273.15}},
            "msl",          # uses the dataset default generator
            "insolation",   # a computed forcing
        ],
    },
)

Geography: bbox, named, icon, unstructured — a one-of dict resolving to flat (latitudes, longitudes).
Per-variable variables: each entry is a name string, or a dict with a mandatory name plus optional values, metadata, statistics and tendencies_statistics. Per-variable metadata is surfaced on the dataset's variables_metadata.
Value generators: {"constant": 273.15} (value given directly) or {"random": {"mean": 0, "std": 1}} (seeded, reproducible Gaussian). A bare scalar is shorthand for constant, a bare string names a generator with its defaults. The top-level values is the default for variables without their own. Values are generated on the fly — only a requested batch ds[idx] is in memory at any time.
Computed forcings: a variable named insolation, cos_latitude, sin_julian_day, … is generated through earthkit's forcings source from the grid and dates (no template field needed). It owns its own generation, so it takes no values block.
layout: gridded is implemented; tabular / trajectories are reserved (raise NotImplementedError), leaving room for non-gridded layouts without an API break.
Composes with the usual transform keywords (cutout, select, start, rename, …), so a synthetic dataset can replace a real one in an existing open_dataset spec.

Self-contained: one new module plus a small dispatch hook in open_dataset; no existing dataset code is changed. Covered by 66 tests.

Possible future avenues

SyntheticTabularDataset: a tabular counterpart over TabularZarr, reusing the value generators and config parsing — the layout switch already reserves the seam.
More value generators: structured/smooth spatial fields, time-varying signals, per-variable physical ranges, NaNs. Long shot: values generated on the fly from other programs (...for distillation?)
Missing dates: currently not supported.
Variables metadata: support could be improved, e.g. default metadata based on the ECMWF parameter database.

This change was developed in tandem with AI coding agents.

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

📚 Documentation preview 📚: https://anemoi-datasets--637.org.readthedocs.build/en/637/

_SyntheticArray reimplemented date-axis indexing by hand: out-of-range indices silently fabricated a nonexistent timestep, a boolean mask was cast to integer positions, and a numpy array inside a tuple index raised an ambiguous-truth-value ValueError from _expand_index. Delegate date-axis resolution to numpy by indexing arange(n_dates): this reproduces numpy's bounds-checking, negative wrapping, and fancy/boolean indexing, so an out-of-range index now raises IndexError as a real zarr-backed dataset does. _expand_index uses identity checks instead of == against Ellipsis. _resolve_dates is removed.

An integer dtype combined with a 'random' or fractional 'constant' value mode silently truncated the generated values, while the analytic statistics kept reporting the un-truncated moments -- so statistics() disagreed with the data the dataset returned. Add _check_value_dtype, which rejects an integer dtype whenever a generator produces non-integer data, alongside the existing _check_index_dtype overflow guard.

GriddedZarr.constant_fields always recomputes the constant fields from the data and ignores the stored attribute. With a single date it cannot distinguish a constant field from a varying one and marks every field constant. Override constant_fields on SyntheticGriddedDataset to return the answer the synthetic config already knows exactly, recorded on the store.

_resolve_bbox built the grid axes with a float-step np.arange, which left floating-point fuzz on the edges -- e.g. a 0.1-degree grid over [0, 10] ended at 3.5e-14 instead of 0.0, breaking any downstream equality comparison of coordinates. Use np.linspace with a computed point count, which pins both endpoints exactly.

The synthetic config carried start/end/frequency as flat top-level keys, inconsistent with the dataset-building recipe API, where those live in a nested 'dates' block. Move them under a required 'dates' dict, validated like the top-level config (must be a dict, must hold exactly start/end/frequency). This is the only synthetic config block with a clean recipe analogue; grid, variables and the value keys stay flat.

open_dataset(synthetic=...) rejected every other keyword, so swapping a real dataset for a synthetic one in an existing spec meant rewriting the whole spec instead of changing one line. A SyntheticGriddedDataset is a genuine GriddedZarr, so the transform wrappers apply to it like any dataset. synthetic_factory now consumes only the 'synthetic' key and leaves the rest for _subset, making synthetic= a drop-in replacement for dataset=. Combination keywords (cutout, join, ...) are matched earlier in _open_dataset and still cannot co-occur with synthetic=.

floriankrb · 2026-05-26T09:11:46Z

Interesting use case and solution.

Did you see that for testing anemoi-dataset, we already have something similar here:

anemoi-datasets/tests/test_data.py

Line 100 in fbe30e2

def create_zarr(

This may not be enough for what you have in mind, though, especially because it is in the test code, and not available in the anemoi-datasets package, and I think you want to do this to test other packages, or to develop models. We could extend this functionality and expose it in open_dataset

floriankrb · 2026-05-26T09:14:22Z

Also, the dictionary inside open_dataset()

        "dates": {"start": "2020-01-01", "end": "2020-12-31", "frequency": "6h"},
       "grid": {"bbox": [60, -10, 35, 30], "resolution": 0.25}, # or e.g. {"named": "o96"}
        "variables": ["2t", "msl", "z_500"],
        "values": {
            "default": {"mode": "random", "mean": 0.0, "std": 1.0},
            "2t": {"mode": "constant", "value": 273.15},
        },

Looks a little like a recipe to build a dataset.

What about following the same syntax as in a recipe?

   open_dataset(
       recipe = { 
             "dates": {"start": "2020-01-01", "end": "2020-12-31", "frequency": "6h"},
             "input": {
                 "synthetic": {
                     "variables": ["2t", "msl", "z_500"],
                     "grid": {"bbox": [60, -10, 35, 30], "resolution": 0.25}, # or e.g. {"named": "o96"}
                     "values": {
                           "default": {"mode": "random", "mean": 0.0, "std": 1.0},
                           "2t": {"mode": "constant", "value": 273.15},
                      },
                 },
            },
            "output": {"layout": "gridded"},
      },

Note that I am not suggesting that we should implement building datasets on the fly with any kind of recipe/source, and all that is in synthetic must be explicitly given by the config.
This feature should not reuse the code building datasets, and keeping the two parts of building vs using datasets separate is important for now (like having two separate packages and introducing a dependency, this is not a light decision).

This would need more discussion.

frazane · 2026-05-26T09:21:22Z

Interesting use case and solution.

Did you see that for testing anemoi-dataset, we already have something similar here:

anemoi-datasets/tests/test_data.py

Line 100 in fbe30e2

def create_zarr(

This may not be enough for what you have in mind, though, especially because it is in the test code, and not available in the anemoi-datasets package, and I think you want to do this to test other packages, or to develop models. We could extend this functionality and expose it in open_dataset

Hi Florian, I did see it and took some inspiration from it but like you said it does not really cover what the PR proposes. I think this is a pretty good idea and we needed that "declarative synthetic dataset" API, I am just not 100% sure about the implementation, so I am more than open to suggestions. I also wonder how we can mitigate the risk of this class "drifting" from the rest of the code (inheriting from GriddedZarr already helps in that respect, but maybe we can do more).

frazane · 2026-05-26T09:24:52Z

Looks a little like a recipe to build a dataset.

What about following the same syntax as in a recipe?

Yes! I was kind of going in that direction. I think in a way it would make total sense to do so. I agree with you it's not realistic to use a real recipe, but something a bit close to it, definitely. Happy to discuss this further.

cathalobrien · 2026-05-27T09:44:10Z

@frazane very nice feature. I was wondering, have you tried using this as an input during a training run? as a way to fake the data loading? this would also be useful for benchmarking, to determine if the filesystem is bottlenecking the GPUs.

frazane · 2026-05-27T10:15:08Z

@frazane very nice feature. I was wondering, have you tried using this as an input during a training run? as a way to fake the data loading? this would also be useful for benchmarking, to determine if the filesystem is bottlenecking the GPUs.

I haven't tested it yet, but it's definitely one of the main goals here. I hadn't thought about the benchmarking aspect though, that's also very interesting.

I can give it a try.

frazane · 2026-05-27T11:14:51Z

@cathalobrien Oh wow, it worked out of the box. Very minimal (few variables, small model, callbacks disabled) config: synthetic_dataset.yaml.

frazane added 12 commits May 22, 2026 09:56

feat: add value generators for the synthetic dataset

1e18fef

feat: add grid resolvers for the synthetic dataset

9304570

feat: add synthetic dataset config parsing and validation

a00151c

feat: add the lazy SyntheticGriddedDataset leaf class

53af62f

feat: open synthetic datasets with open_dataset(synthetic=...)

cfbbfc9

docs: document open_dataset(synthetic=...)

94117a3

frazane added enhancement New feature or request ATS Approval not needed labels May 22, 2026

github-project-automation Bot added this to Anemoi-dev May 22, 2026

github-project-automation Bot moved this to To be triaged in Anemoi-dev May 22, 2026

github-actions Bot added bug Something isn't working documentation Improvements or additions to documentation tests and removed bug Something isn't working labels May 22, 2026

frazane changed the title ~~feat: in-memory synthetic gridded dataset~~ feat: synthetic gridded dataset May 22, 2026

frazane removed the bug Something isn't working label May 22, 2026

github-actions Bot added the bug Something isn't working label May 22, 2026

frazane requested review from aaron-hopkinson, b8raoult and floriankrb May 22, 2026 14:17

Merge branch 'main' into feature/synthetic-dataset

14427e6

frazane added 2 commits May 27, 2026 15:02

Merge branch 'main' into feature/synthetic-dataset

ed00df8

Merge branch 'main' into feature/synthetic-dataset

a9f9ab2

frazane added ATS approval needed and removed ATS Approval not needed labels Jun 4, 2026

frazane marked this pull request as draft June 10, 2026 10:29

Merge branch 'main' into feature/synthetic-dataset

a937a8a

frazane mentioned this pull request Jun 10, 2026

In-memory synthetic datasets for testing and prototyping #657

Closed

frazane added ATS approved and removed ATS approval needed labels Jun 10, 2026

frazane and others added 5 commits June 10, 2026 16:03

Merge branch 'main' into feature/synthetic-dataset

85b9a09

Merge branch 'main' into feature/synthetic-dataset

21081de

Merge branch 'main' into feature/synthetic-dataset

98826da

rework synthetic open_dataset surface to v2

f6ef973

Merge branch 'main' into feature/synthetic-dataset

5707da4

frazane marked this pull request as ready for review June 16, 2026 15:28

floriankrb approved these changes Jun 17, 2026

View reviewed changes

github-project-automation Bot moved this from To be triaged to For merging in Anemoi-dev Jun 17, 2026

floriankrb merged commit 8cbfc01 into main Jun 17, 2026
73 checks passed

floriankrb deleted the feature/synthetic-dataset branch June 17, 2026 05:51

github-project-automation Bot moved this from For merging to Done in Anemoi-dev Jun 17, 2026

DeployDuck mentioned this pull request Jun 16, 2026

chore(main): Release 0.5.40 #667

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: synthetic gridded dataset#637

feat: synthetic gridded dataset#637
floriankrb merged 21 commits into
mainfrom
feature/synthetic-dataset

frazane commented May 22, 2026 •

edited by github-actions Bot

Loading

Uh oh!

floriankrb commented May 26, 2026

Uh oh!

floriankrb commented May 26, 2026 •

edited

Loading

Uh oh!

frazane commented May 26, 2026

Uh oh!

frazane commented May 26, 2026 •

edited

Loading

Uh oh!

cathalobrien commented May 27, 2026

Uh oh!

frazane commented May 27, 2026

Uh oh!

frazane commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

frazane commented May 22, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What it adds

Possible future avenues

Uh oh!

floriankrb commented May 26, 2026

Uh oh!

floriankrb commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frazane commented May 26, 2026

Uh oh!

frazane commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cathalobrien commented May 27, 2026

Uh oh!

frazane commented May 27, 2026

Uh oh!

frazane commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

frazane commented May 22, 2026 •

edited by github-actions Bot

Loading

floriankrb commented May 26, 2026 •

edited

Loading

frazane commented May 26, 2026 •

edited

Loading