feat: synthetic gridded dataset#637
Conversation
_SyntheticArray reimplemented date-axis indexing by hand: out-of-range indices silently fabricated a nonexistent timestep, a boolean mask was cast to integer positions, and a numpy array inside a tuple index raised an ambiguous-truth-value ValueError from _expand_index. Delegate date-axis resolution to numpy by indexing arange(n_dates): this reproduces numpy's bounds-checking, negative wrapping, and fancy/boolean indexing, so an out-of-range index now raises IndexError as a real zarr-backed dataset does. _expand_index uses identity checks instead of == against Ellipsis. _resolve_dates is removed.
An integer dtype combined with a 'random' or fractional 'constant' value mode silently truncated the generated values, while the analytic statistics kept reporting the un-truncated moments -- so statistics() disagreed with the data the dataset returned. Add _check_value_dtype, which rejects an integer dtype whenever a generator produces non-integer data, alongside the existing _check_index_dtype overflow guard.
GriddedZarr.constant_fields always recomputes the constant fields from the data and ignores the stored attribute. With a single date it cannot distinguish a constant field from a varying one and marks every field constant. Override constant_fields on SyntheticGriddedDataset to return the answer the synthetic config already knows exactly, recorded on the store.
_resolve_bbox built the grid axes with a float-step np.arange, which left floating-point fuzz on the edges -- e.g. a 0.1-degree grid over [0, 10] ended at 3.5e-14 instead of 0.0, breaking any downstream equality comparison of coordinates. Use np.linspace with a computed point count, which pins both endpoints exactly.
The synthetic config carried start/end/frequency as flat top-level keys, inconsistent with the dataset-building recipe API, where those live in a nested 'dates' block. Move them under a required 'dates' dict, validated like the top-level config (must be a dict, must hold exactly start/end/frequency). This is the only synthetic config block with a clean recipe analogue; grid, variables and the value keys stay flat.
open_dataset(synthetic=...) rejected every other keyword, so swapping a real dataset for a synthetic one in an existing spec meant rewriting the whole spec instead of changing one line. A SyntheticGriddedDataset is a genuine GriddedZarr, so the transform wrappers apply to it like any dataset. synthetic_factory now consumes only the 'synthetic' key and leaves the rest for _subset, making synthetic= a drop-in replacement for dataset=. Combination keywords (cutout, join, ...) are matched earlier in _open_dataset and still cannot co-occur with synthetic=.
|
Interesting use case and solution. Did you see that for testing anemoi-dataset, we already have something similar here: anemoi-datasets/tests/test_data.py Line 100 in fbe30e2 This may not be enough for what you have in mind, though, especially because it is in the test code, and not available in the anemoi-datasets package, and I think you want to do this to test other packages, or to develop models. We could extend this functionality and expose it in open_dataset
|
|
Also, the dictionary inside Looks a little like a recipe to build a dataset. What about following the same syntax as in a recipe? Note that I am not suggesting that we should implement building datasets on the fly with any kind of recipe/source, and all that is in synthetic must be explicitly given by the config. This would need more discussion. |
Hi Florian, I did see it and took some inspiration from it but like you said it does not really cover what the PR proposes. I think this is a pretty good idea and we needed that "declarative synthetic dataset" API, I am just not 100% sure about the implementation, so I am more than open to suggestions. I also wonder how we can mitigate the risk of this class "drifting" from the rest of the code (inheriting from GriddedZarr already helps in that respect, but maybe we can do more). |
Yes! I was kind of going in that direction. I think in a way it would make total sense to do so. I agree with you it's not realistic to use a real recipe, but something a bit close to it, definitely. Happy to discuss this further. |
|
@frazane very nice feature. I was wondering, have you tried using this as an input during a training run? as a way to fake the data loading? this would also be useful for benchmarking, to determine if the filesystem is bottlenecking the GPUs. |
I haven't tested it yet, but it's definitely one of the main goals here. I hadn't thought about the benchmarking aspect though, that's also very interesting. I can give it a try. |
|
@cathalobrien Oh wow, it worked out of the box. Very minimal (few variables, small model, callbacks disabled) config: synthetic_dataset.yaml. |
Motivation
Testing and prototyping anemoi graphs/training/inference (and sometimes datasets itself) needs a dataset, and today that means building a real Zarr store: from a recipe, data sources, and compute, then keeping it on disk. There is no lightweight stand-in that can be used the same way for quick experiments, CI and unit tests, or benchmarking. Furthermore, it could open up to other potential applications like more theoretical machine learning research questions.
This PR adds an in-memory synthetic gridded dataset, opened with
open_dataset(synthetic={...}). It needs no data, disk, or network.What it adds
open_dataset(synthetic={...})builds aSyntheticGriddedDataset— aGriddedZarrover a lazy in-memory store, so it inherits the full dataset contract.bbox,named,icon,unstructured— a one-of dict resolving to flat(latitudes, longitudes).variables: each entry is a name string, or a dict with a mandatorynameplus optionalvalues,metadata,statisticsandtendencies_statistics. Per-variablemetadatais surfaced on the dataset'svariables_metadata.{"constant": 273.15}(value given directly) or{"random": {"mean": 0, "std": 1}}(seeded, reproducible Gaussian). A bare scalar is shorthand forconstant, a bare string names a generator with its defaults. The top-levelvaluesis the default for variables without their own. Values are generated on the fly — only a requested batchds[idx]is in memory at any time.insolation,cos_latitude,sin_julian_day, … is generated through earthkit's forcings source from the grid and dates (no template field needed). It owns its own generation, so it takes novaluesblock.layout:griddedis implemented;tabular/trajectoriesare reserved (raiseNotImplementedError), leaving room for non-gridded layouts without an API break.cutout,select,start,rename, …), so a synthetic dataset can replace a real one in an existingopen_datasetspec.Self-contained: one new module plus a small dispatch hook in
open_dataset; no existing dataset code is changed. Covered by 66 tests.Possible future avenues
SyntheticTabularDataset: a tabular counterpart overTabularZarr, reusing the value generators and config parsing — thelayoutswitch already reserves the seam.This change was developed in tandem with AI coding agents.
As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/
By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.
📚 Documentation preview 📚: https://anemoi-datasets--637.org.readthedocs.build/en/637/
📚 Documentation preview 📚: https://anemoi-datasets--637.org.readthedocs.build/en/637/