All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- consider full time information for derived calculation of TOA radiation #84 @observingClouds
- fix bug where coordinate selection of an unshared dimension isn't applied to subsequent ouput variables when an output variable without this dimension is processed before the others #90 @zweihuehner & @leifdenby
This release adds support for cropping a dataset using the convex hull of the lat/lon coordinates of another dataset, which can be used for creating boundary data in Limited Area Modelling setups.
- add support for cropping a dataset using the convex hull of the lat/lon coordinates of another dataset (can be used for creating boundary data in Limited Area Modelling setups) #45, @leifdenby
This release contains bugfixes to update tests to use newer version of pre-commit, use correct python version, and remove uses of incompatible typing notation.
- use old union typing notation compatible with all required python versions #77 @SimonKamuk
- update pre-commit action to v3.0.1 #77 @SimonKamuk
- fix tests to use expected python version from test matrix #77 @SimonKamuk
This release adds the ability to slice input data by any coordinate, derive variables from input datasets, and store config in created datasets. It also adds support for zarr 3.0.0 and above, and a mypy typing action to pre-commit hooks. In addition a number of bugs were fixed related to adding unwanted dimensions to the dataset, chunk size estimates, and derived functions. The release also includes a number of maintenance updates including updating the DANRA test dataset to v0.2.0 (which smaller, leading to faster test execution) and updating the dataclass-wizard dependency to at least v0.29.2.
- add functionality to slice input data by any coordinate #55@matschreiner
- add ability to derive variables from input datasets #34, @ealerskans, @mafdmi
- add github PR template to guide development process on github #44, @leifdenby
- add support for zarr 3.0.0 and above #51, @kashif
- warn if the user tries to load a non-YAML file #50, @j6k4m8
- add mypy typing action to pre-commit hooks #67, @observingClouds
- add support for storing config in created datasets and option to only overwrite zarr dataset of config change #64, @leifdenby
- fix bug which adds unwanted dimensions to the dataset #60, @ealerskans, @observingClouds
- correct chunk size estimate #59, @ealerskans
- fix bug arising when variables provided to derived functions are renamed #56, @leifdenby
- ensure config fields defaulting to
Noneare typed asOptionaland fields defaulting to{}are given a default-factory so that serialization with default values works correctly #63, @leifdenby - fix reading of exported config files #67, @observingClouds
- update DANRA test dataset to v0.2.0 which uses a smaller cropped domain #62, @leifdenby
- update
dataclass-wizarddependency to at least v0.29.2 allowing for use ofUniontypes together with check for unmatched keys in config yaml #73, @leifdenby
This release adds support for an optional extra section in the config file (for user-defined extra information that is ignored by mllam-data-prep) and fixes a few minor issues. Note that to use extra section in the config file the schema version in the config file must be increased to v0.5.0.
- Add optional section called
extrato config file to allow for user-defined extra information that is ignored bymllam-data-prepbut can be used by downstream applications., @leifdenby
- remove f-string from
name_formatin config examples #35 - replace global config for
dataclass_wizardonmllam_data_prep.config.Configwith config specific to that dataclass (to avoid conflicts with other uses ofdataclass_wizard) #36 - Schema version bumped to
v0.5.0to match release version that supports optionalextrasection in config #18
This release adds support for defining the output path in the command line
interface and addresses bugs around optional dependencies for
dask.distributed.
- add access to CLI via
mllam_data_prepand add tests for CLI with/withoutdask.distributed.
- add optional output path argument to parser.
- fix bug by making dependency
distributedoptional - change config example to call validation split
valinstead ofvalidation#28 - fix typo in install dependency
distributed - add missing
psutilrequirement. #21.
- add support for parallel processing using
dask.distributedwith command line flags--dask-distributed-local-core-fractionand--dask-distributed-local-memory-fractionto control the number of cores and memory to use on the local machine.
-
add support for creating dataset splits (e.g. train, validation, test) through
output.splittingsection in the config file, and support for optionally compute statistics for a given split (withoutput.splitting.splits.{split_name}.compute_statistics)..
-
include
unitsandlong_nameattributes for all stacked variables as{output_variable}_unitsand{output_variable}_long_name.
-
split dataset creation and storage to zarr into separate functions
mllam_data_prep.create_dataset(...)andmllam_data_prep.create_dataset_zarr(...)respectively -
changes to spec from v0.1.0:
- the
architecturesection has been renamedoutputto make it clearer that this section defines the properties of the output ofmllam-data-prep sampling_dimremoved fromoutput(previouslyarchitecture) section of spec, this is not needed to create the training data- the variables (and their dimensions) of the output definition has been
renamed from
architecture.input_variablestooutput.variables - coordinate value ranges for the dimensions of the output (i.e. what that
the architecture expects as input) has been renamed from
architecture.input_rangestooutput.coord_rangesto make the use more clear - selection on variable coordinates values is now set with
inputs.{dataset_name}.variables.{variable_name}.valuesrather thaninputs.{dataset_name}.variables.{variable_name}.sel - when dimension-mapping method
stack_variables_by_var_nameis used the formatting string for the new variable is now calledname_formatrather thanname - when dimension-mapping is done by simply renaming a dimension this
configuration now needs to be set by providing the named method (
rename) explicitly through themethodkey, i.e. rather than{to_dim}: {from_dim}it is now{to_dim}: {method: rename, dim: {from_dim}}to match the signature of the other dimension-mapping methods. - attribute
inputs.{dataset_name}.nameattribute has been removed, with the keydataset_namethis is superfluous
- the
-
relax minimuim python version requirement to
>3.8to simplify downstream usage
First tagged release of mllam-data-prep which includes functionality to
declaratively (in a yaml-config file) describe how the variables and
coordinates of a set of zarr-based source datasets are mapped to a new set of
variables with new coordinates to single a training dataset and write this
resulting single dataset to a new zarr dataset. This explicit mapping gives the
flexibility to target different different model architectures (which may
require different inputs with different shapes between architectures).