Skip to content
This repository was archived by the owner on Sep 11, 2023. It is now read-only.

Machine-readable schema & validator for xarray.Dataset #211

Closed
Tracked by #213
JackKelly opened this issue Oct 8, 2021 · 5 comments · Fixed by #229
Closed
Tracked by #213

Machine-readable schema & validator for xarray.Dataset #211

JackKelly opened this issue Oct 8, 2021 · 5 comments · Fixed by #229
Assignees
Labels
enhancement New feature or request refactoring

Comments

@JackKelly
Copy link
Member

JackKelly commented Oct 8, 2021

Detailed Description

If we can find an off-the-shelf schema & validator for xarray.Dataset then we can, hopefully, combine the best of pydantic.BaseModel and xarray.Dataset. The ultimate aims are:

  • Have a single source of truth for the precise structure of the data that flows through nowcasting_datatset. This can be used for:
    • Humans to understand the structure of the data
    • Machines to automatically validate the data

Context

@peterdudfield has done excellent work in pull request #195 using Pydantic to define schemas for our data. Inspired by, and building on @peterdudfield's great work, it's possible that we can get the same advantages by using something like pandera, but with less effort on our part :) (I'm lazy!)

This is also related to #209

Related

I'll look into this this morning :)

@JackKelly JackKelly added enhancement New feature or request refactoring labels Oct 8, 2021
@JackKelly JackKelly self-assigned this Oct 8, 2021
@JackKelly
Copy link
Member Author

Hmm, I'm no longer confidence that panderas is capable of supporting n-dimensional numpy arrays (and hence can't support n-dimensional xr.DataArrays). I've asked the panderas folks.

But it looks like Pydantic might be able to. See this Pydantic issue and PR. Also, take a look at how SQLModel combines Pydantic and SQLAlchemy (thans to Benoît Bovy for suggesting this on twitter!)

@JackKelly
Copy link
Member Author

JackKelly commented Oct 8, 2021

Let me flesh out what I hope is possible (but I haven't tested yet!) This is adapted from pydantic/pydantic#667, and inspired by @peterdudfield's work in PR #195. (Also see Pydantic's docs on custom data types)

class PydanticXArrayDataset(xr.Dataset):
    """Abstract base class for validating xr.Dataset objects."""
    # From https://github.com/samuelcolvin/pydantic/issues/667

    @classmethod
    def __get_validators__(cls):
        yield cls.validate

    @classmethod
    def validate(cls, v: Any) -> str:
        """Validate data.  Must be overridden by child classes."""
        raise NotImplementedError()


class Satellite(PydanticXArrayDataset):
    @classmethod
    def validate(cls, v: Any) -> str:
        # validate Satellite data...
        return v

The above code is all that's required (I think) when pre-preparing on-disk batches, because, after #202 is implemented, the individual modalities wouldn't be squished together into a single batch object: Instead each modality would pass through nowcasting_dataset independently, and be written to disk independently.

When we load the batches of each modality from disk, then we could squish them into a Pydantic model like the code below, but I'd be a little worried about hurting performance, especially when all the pre-prepared batches should have been validated when they were created!

class Example(pydantic.BaseModel):
    satellite: Satellite

@JackKelly
Copy link
Member Author

OK, here's a functional, but very rough example of using xarray with pydantic.

This code validates a few things. But isn't ideal as a human-readable specification of the structure of xr.DataArrays or xr.Datasets.

@JackKelly
Copy link
Member Author

To quote @cosmicBboy from this comment:

once pandera schemas can be used as valid pydantic types, #453 is supported, the solution you outline [two comments above] would be pretty straightforward to port over to pandera

@flowirtz flowirtz moved this to Todo in Nowcasting Oct 15, 2021
@JackKelly JackKelly linked a pull request Oct 19, 2021 that will close this issue
11 tasks
@JackKelly
Copy link
Member Author

I think this is implemented in PR #229 (thanks @peterdudfield!)

Repository owner moved this from Todo to Done in Nowcasting Oct 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request refactoring
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant