Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

unreleased

Changed

consider full time information for derived calculation of TOA radiation #84 @observingClouds

Fixes

fix bug where coordinate selection of an unshared dimension isn't applied to subsequent ouput variables when an output variable without this dimension is processed before the others #90 @zweihuehner & @leifdenby

v0.7.0

All changes

This release adds support for cropping a dataset using the convex hull of the lat/lon coordinates of another dataset, which can be used for creating boundary data in Limited Area Modelling setups.

Added

add support for cropping a dataset using the convex hull of the lat/lon coordinates of another dataset (can be used for creating boundary data in Limited Area Modelling setups) #45, @leifdenby

v0.6.1

All changes

This release contains bugfixes to update tests to use newer version of pre-commit, use correct python version, and remove uses of incompatible typing notation.

Fixes

use old union typing notation compatible with all required python versions #77 @SimonKamuk

Maintenance

update pre-commit action to v3.0.1 #77 @SimonKamuk
fix tests to use expected python version from test matrix #77 @SimonKamuk

This release adds the ability to slice input data by any coordinate, derive variables from input datasets, and store config in created datasets. It also adds support for zarr 3.0.0 and above, and a mypy typing action to pre-commit hooks. In addition a number of bugs were fixed related to adding unwanted dimensions to the dataset, chunk size estimates, and derived functions. The release also includes a number of maintenance updates including updating the DANRA test dataset to v0.2.0 (which smaller, leading to faster test execution) and updating the dataclass-wizard dependency to at least v0.29.2.

Added

add functionality to slice input data by any coordinate #55@matschreiner
add ability to derive variables from input datasets #34, @ealerskans, @mafdmi
add github PR template to guide development process on github #44, @leifdenby
add support for zarr 3.0.0 and above #51, @kashif
warn if the user tries to load a non-YAML file #50, @j6k4m8
add mypy typing action to pre-commit hooks #67, @observingClouds
add support for storing config in created datasets and option to only overwrite zarr dataset of config change #64, @leifdenby

Fixes

fix bug which adds unwanted dimensions to the dataset #60, @ealerskans, @observingClouds
correct chunk size estimate #59, @ealerskans
fix bug arising when variables provided to derived functions are renamed #56, @leifdenby
ensure config fields defaulting to None are typed as Optional and fields defaulting to {} are given a default-factory so that serialization with default values works correctly #63, @leifdenby
fix reading of exported config files #67, @observingClouds

Maintenance

update DANRA test dataset to v0.2.0 which uses a smaller cropped domain #62, @leifdenby
update dataclass-wizard dependency to at least v0.29.2 allowing for use of Union types together with check for unmatched keys in config yaml #73, @leifdenby

v0.5.0

All changes

This release adds support for an optional extra section in the config file (for user-defined extra information that is ignored by mllam-data-prep) and fixes a few minor issues. Note that to use extra section in the config file the schema version in the config file must be increased to v0.5.0.

Added

Add optional section called extra to config file to allow for user-defined extra information that is ignored by mllam-data-prep but can be used by downstream applications. , @leifdenby

Changed

remove f-string from name_format in config examples #35
replace global config for dataclass_wizard on mllam_data_prep.config.Config with config specific to that dataclass (to avoid conflicts with other uses of dataclass_wizard) #36
Schema version bumped to v0.5.0 to match release version that supports optional extra section in config #18

v0.4.0

All changes

This release adds support for defining the output path in the command line interface and addresses bugs around optional dependencies for dask.distributed.

Added

add access to CLI via mllam_data_prep and add tests for CLI with/without dask.distributed $\25$ .
add optional output path argument to parser.

Changed

fix bug by making dependency distributed optional
change config example to call validation split val instead of validation #28
fix typo in install dependency distributed
add missing psutil requirement. #21.

v0.3.0

All changes

Added

add support for parallel processing using dask.distributed with command line flags --dask-distributed-local-core-fraction and --dask-distributed-local-memory-fraction to control the number of cores and memory to use on the local machine.

v0.2.0

All changes

Added

add support for creating dataset splits (e.g. train, validation, test) through output.splitting section in the config file, and support for optionally compute statistics for a given split (with output.splitting.splits.{split_name}.compute_statistics). .
include units and long_name attributes for all stacked variables as {output_variable}_units and {output_variable}_long_name .
include version of mllam-data-prep in output

Changed

split dataset creation and storage to zarr into separate functions mllam_data_prep.create_dataset(...) and mllam_data_prep.create_dataset_zarr(...) respectively
changes to spec from v0.1.0:
- the architecture section has been renamed output to make it clearer that this section defines the properties of the output of mllam-data-prep
- sampling_dim removed from output (previously architecture) section of spec, this is not needed to create the training data
- the variables (and their dimensions) of the output definition has been renamed from architecture.input_variables to output.variables
- coordinate value ranges for the dimensions of the output (i.e. what that the architecture expects as input) has been renamed from architecture.input_ranges to output.coord_ranges to make the use more clear
- selection on variable coordinates values is now set with inputs.{dataset_name}.variables.{variable_name}.values rather than inputs.{dataset_name}.variables.{variable_name}.sel
- when dimension-mapping method stack_variables_by_var_name is used the formatting string for the new variable is now called name_format rather than name
- when dimension-mapping is done by simply renaming a dimension this configuration now needs to be set by providing the named method (rename) explicitly through the method key, i.e. rather than {to_dim}: {from_dim} it is now {to_dim}: {method: rename, dim: {from_dim}} to match the signature of the other dimension-mapping methods.
- attribute inputs.{dataset_name}.name attribute has been removed, with the key dataset_name this is superfluous
relax minimuim python version requirement to >3.8 to simplify downstream usage

v0.1.0

First tagged release of mllam-data-prep which includes functionality to declaratively (in a yaml-config file) describe how the variables and coordinates of a set of zarr-based source datasets are mapped to a new set of variables with new coordinates to single a training dataset and write this resulting single dataset to a new zarr dataset. This explicit mapping gives the flexibility to target different different model architectures (which may require different inputs with different shapes between architectures).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

unreleased

Changed

Fixes

v0.7.0

Added

v0.6.1

Fixes

Maintenance

v0.6.0

Added

Fixes

Maintenance

v0.5.0

Added

Changed

v0.4.0

Added

Changed

v0.3.0

Added

v0.2.0

Added

Changed

v0.1.0

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

Changed

Fixes

Added

Fixes

Maintenance

Added

Fixes

Maintenance

Added

Changed

Added

Changed

Added

Added

Changed