Skip to content

feat: synthetic gridded dataset#637

Merged
floriankrb merged 21 commits into
mainfrom
feature/synthetic-dataset
Jun 17, 2026
Merged

feat: synthetic gridded dataset#637
floriankrb merged 21 commits into
mainfrom
feature/synthetic-dataset

Conversation

@frazane

@frazane frazane commented May 22, 2026

Copy link
Copy Markdown
Contributor

Motivation

Testing and prototyping anemoi graphs/training/inference (and sometimes datasets itself) needs a dataset, and today that means building a real Zarr store: from a recipe, data sources, and compute, then keeping it on disk. There is no lightweight stand-in that can be used the same way for quick experiments, CI and unit tests, or benchmarking. Furthermore, it could open up to other potential applications like more theoretical machine learning research questions.

This PR adds an in-memory synthetic gridded dataset, opened with open_dataset(synthetic={...}). It needs no data, disk, or network.

What it adds

open_dataset(synthetic={...}) builds a SyntheticGriddedDataset — a GriddedZarr over a lazy in-memory store, so it inherits the full dataset contract.

ds = open_dataset(
    synthetic={
        "geography": {"bbox": [60, -10, 35, 30], "resolution": 0.25},  # or e.g. {"named": "o96"}
        "dates": {"start": "2020-01-01", "end": "2020-12-31", "frequency": "6h"},
        "layout": "gridded",
        "variables": [
            {"name": "2t", "values": {"constant": 273.15}},
            "msl",          # uses the dataset default generator
            "insolation",   # a computed forcing
        ],
    },
)
  • Geography: bbox, named, icon, unstructured — a one-of dict resolving to flat (latitudes, longitudes).
  • Per-variable variables: each entry is a name string, or a dict with a mandatory name plus optional values, metadata, statistics and tendencies_statistics. Per-variable metadata is surfaced on the dataset's variables_metadata.
  • Value generators: {"constant": 273.15} (value given directly) or {"random": {"mean": 0, "std": 1}} (seeded, reproducible Gaussian). A bare scalar is shorthand for constant, a bare string names a generator with its defaults. The top-level values is the default for variables without their own. Values are generated on the fly — only a requested batch ds[idx] is in memory at any time.
  • Computed forcings: a variable named insolation, cos_latitude, sin_julian_day, … is generated through earthkit's forcings source from the grid and dates (no template field needed). It owns its own generation, so it takes no values block.
  • layout: gridded is implemented; tabular / trajectories are reserved (raise NotImplementedError), leaving room for non-gridded layouts without an API break.
  • Composes with the usual transform keywords (cutout, select, start, rename, …), so a synthetic dataset can replace a real one in an existing open_dataset spec.

Self-contained: one new module plus a small dispatch hook in open_dataset; no existing dataset code is changed. Covered by 66 tests.

Possible future avenues

  • SyntheticTabularDataset: a tabular counterpart over TabularZarr, reusing the value generators and config parsing — the layout switch already reserves the seam.
  • More value generators: structured/smooth spatial fields, time-varying signals, per-variable physical ranges, NaNs. Long shot: values generated on the fly from other programs (...for distillation?)
  • Missing dates: currently not supported.
  • Variables metadata: support could be improved, e.g. default metadata based on the ECMWF parameter database.

This change was developed in tandem with AI coding agents.


As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.


📚 Documentation preview 📚: https://anemoi-datasets--637.org.readthedocs.build/en/637/


📚 Documentation preview 📚: https://anemoi-datasets--637.org.readthedocs.build/en/637/

frazane added 12 commits May 22, 2026 09:56
_SyntheticArray reimplemented date-axis indexing by hand: out-of-range
indices silently fabricated a nonexistent timestep, a boolean mask was
cast to integer positions, and a numpy array inside a tuple index raised
an ambiguous-truth-value ValueError from _expand_index.

Delegate date-axis resolution to numpy by indexing arange(n_dates): this
reproduces numpy's bounds-checking, negative wrapping, and fancy/boolean
indexing, so an out-of-range index now raises IndexError as a real
zarr-backed dataset does. _expand_index uses identity checks instead of
== against Ellipsis. _resolve_dates is removed.
An integer dtype combined with a 'random' or fractional 'constant' value
mode silently truncated the generated values, while the analytic
statistics kept reporting the un-truncated moments -- so statistics()
disagreed with the data the dataset returned.

Add _check_value_dtype, which rejects an integer dtype whenever a
generator produces non-integer data, alongside the existing
_check_index_dtype overflow guard.
GriddedZarr.constant_fields always recomputes the constant fields from
the data and ignores the stored attribute. With a single date it cannot
distinguish a constant field from a varying one and marks every field
constant.

Override constant_fields on SyntheticGriddedDataset to return the answer the
synthetic config already knows exactly, recorded on the store.
_resolve_bbox built the grid axes with a float-step np.arange, which
left floating-point fuzz on the edges -- e.g. a 0.1-degree grid over
[0, 10] ended at 3.5e-14 instead of 0.0, breaking any downstream
equality comparison of coordinates.

Use np.linspace with a computed point count, which pins both endpoints
exactly.
The synthetic config carried start/end/frequency as flat top-level keys,
inconsistent with the dataset-building recipe API, where those live in a
nested 'dates' block.

Move them under a required 'dates' dict, validated like the top-level
config (must be a dict, must hold exactly start/end/frequency). This is
the only synthetic config block with a clean recipe analogue; grid,
variables and the value keys stay flat.
open_dataset(synthetic=...) rejected every other keyword, so swapping a
real dataset for a synthetic one in an existing spec meant rewriting the
whole spec instead of changing one line.

A SyntheticGriddedDataset is a genuine GriddedZarr, so the transform
wrappers apply to it like any dataset. synthetic_factory now consumes
only the 'synthetic' key and leaves the rest for _subset, making
synthetic= a drop-in replacement for dataset=. Combination keywords
(cutout, join, ...) are matched earlier in _open_dataset and still
cannot co-occur with synthetic=.
@github-project-automation github-project-automation Bot moved this to To be triaged in Anemoi-dev May 22, 2026
@github-actions github-actions Bot added bug Something isn't working documentation Improvements or additions to documentation tests and removed bug Something isn't working labels May 22, 2026
@frazane frazane changed the title feat: in-memory synthetic gridded dataset feat: synthetic gridded dataset May 22, 2026
@frazane frazane removed the bug Something isn't working label May 22, 2026
@github-actions github-actions Bot added the bug Something isn't working label May 22, 2026
@floriankrb

Copy link
Copy Markdown
Member

Interesting use case and solution.

Did you see that for testing anemoi-dataset, we already have something similar here:

def create_zarr(

This may not be enough for what you have in mind, though, especially because it is in the test code, and not available in the anemoi-datasets package, and I think you want to do this to test other packages, or to develop models. We could extend this functionality and expose it in open_dataset

@floriankrb

floriankrb commented May 26, 2026

Copy link
Copy Markdown
Member

Also, the dictionary inside open_dataset()

        "dates": {"start": "2020-01-01", "end": "2020-12-31", "frequency": "6h"},
       "grid": {"bbox": [60, -10, 35, 30], "resolution": 0.25}, # or e.g. {"named": "o96"}
        "variables": ["2t", "msl", "z_500"],
        "values": {
            "default": {"mode": "random", "mean": 0.0, "std": 1.0},
            "2t": {"mode": "constant", "value": 273.15},
        },

Looks a little like a recipe to build a dataset.

What about following the same syntax as in a recipe?

   open_dataset(
       recipe = { 
             "dates": {"start": "2020-01-01", "end": "2020-12-31", "frequency": "6h"},
             "input": {
                 "synthetic": {
                     "variables": ["2t", "msl", "z_500"],
                     "grid": {"bbox": [60, -10, 35, 30], "resolution": 0.25}, # or e.g. {"named": "o96"}
                     "values": {
                           "default": {"mode": "random", "mean": 0.0, "std": 1.0},
                           "2t": {"mode": "constant", "value": 273.15},
                      },
                 },
            },
            "output": {"layout": "gridded"},
      },

Note that I am not suggesting that we should implement building datasets on the fly with any kind of recipe/source, and all that is in synthetic must be explicitly given by the config.
This feature should not reuse the code building datasets, and keeping the two parts of building vs using datasets separate is important for now (like having two separate packages and introducing a dependency, this is not a light decision).

This would need more discussion.

@frazane

frazane commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

Interesting use case and solution.

Did you see that for testing anemoi-dataset, we already have something similar here:

def create_zarr(

This may not be enough for what you have in mind, though, especially because it is in the test code, and not available in the anemoi-datasets package, and I think you want to do this to test other packages, or to develop models. We could extend this functionality and expose it in open_dataset

Hi Florian, I did see it and took some inspiration from it but like you said it does not really cover what the PR proposes. I think this is a pretty good idea and we needed that "declarative synthetic dataset" API, I am just not 100% sure about the implementation, so I am more than open to suggestions. I also wonder how we can mitigate the risk of this class "drifting" from the rest of the code (inheriting from GriddedZarr already helps in that respect, but maybe we can do more).

@frazane

frazane commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

Looks a little like a recipe to build a dataset.

What about following the same syntax as in a recipe?

Yes! I was kind of going in that direction. I think in a way it would make total sense to do so. I agree with you it's not realistic to use a real recipe, but something a bit close to it, definitely. Happy to discuss this further.

@cathalobrien

Copy link
Copy Markdown
Contributor

@frazane very nice feature. I was wondering, have you tried using this as an input during a training run? as a way to fake the data loading? this would also be useful for benchmarking, to determine if the filesystem is bottlenecking the GPUs.

@frazane

frazane commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

@frazane very nice feature. I was wondering, have you tried using this as an input during a training run? as a way to fake the data loading? this would also be useful for benchmarking, to determine if the filesystem is bottlenecking the GPUs.

I haven't tested it yet, but it's definitely one of the main goals here. I hadn't thought about the benchmarking aspect though, that's also very interesting.

I can give it a try.

@frazane

frazane commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

@cathalobrien Oh wow, it worked out of the box. Very minimal (few variables, small model, callbacks disabled) config: synthetic_dataset.yaml.

@frazane frazane marked this pull request as draft June 10, 2026 10:29
@frazane frazane marked this pull request as ready for review June 16, 2026 15:28
@github-project-automation github-project-automation Bot moved this from To be triaged to For merging in Anemoi-dev Jun 17, 2026
@floriankrb floriankrb merged commit 8cbfc01 into main Jun 17, 2026
73 checks passed
@floriankrb floriankrb deleted the feature/synthetic-dataset branch June 17, 2026 05:51
@github-project-automation github-project-automation Bot moved this from For merging to Done in Anemoi-dev Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ATS approved bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request tests

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants