Skip to content

Conversation

frankie567
Copy link
Collaborator

@frankie567 frankie567 commented Sep 13, 2022

Description

The goal of those changes is to implement an abstraction layer for querying the database. The objective is to always get proper Pydantic models to return to the API.

Basically, we have a BaseRepository class which will contain the common/generic logic for querying the database. Notice how this class expects to have the Pydantic model class and the name of the index as class variables.

For each model, we'll have a dedicated repository extending BaseRepository. To prove the concept, I currently just implemented DatasourceRepository.

With this pattern, it's easy to add specific query or operations we want to reuse, like get_by_name in this example. This way, we avoid to have too much OpenAPI query leaking in every parts of the codebase: everything stay in the repository class.

To instantiate those repositories, we define callable dependencies for FastAPI, like get_datasource_repository. It's a good pattern that may help us in the long run, especially if we want to write unit tests. For now, the underlying OpenSearch client is hard-wired but it could also be made as a dependency for convenience.

Finally, we can use it in our API endpoints. By injecting the repository in the datasource endpoints, we are able to directly query the DB and get proper Pydantic objects.

For convenience, I've also implemented a generic shortcut get_by_key_or_404, which can get an object by key or automatically raise a 404 if not found.

Please tell me now what you think about this before I implement this pattern for the other models 😄

Type of change

  • Refactoring

Checklist:

  • I have performed a self-review of my own code
  • All GitHub workflows have passed
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

@frankie567 frankie567 changed the title [WIP] Implement database abstraction Implement database abstraction Sep 16, 2022
@frankie567
Copy link
Collaborator Author

So, it turns out it became a quite big refactoring! Here is a summary of what I did:

  • Implementation of a Repository pattern, with base methods to query, create, update and delete data in the DB
  • For Datasource, Dataset and Expectation models, I implemented their own repository, adding dedicated methods when needed (e.g. so that all queries stay in one place).
  • In Datasource, Dataset and Expectation endpoints, I removed wherever possible all direct use of the OpenSearch client in favor of the repository helper

I also took this opportunity to improve the structuration of the Pydantic models:

  • Implementation of a KeyModel mixin: models inheriting from this one will get a key property. If not provided, an UUID4 is automatically generated.
  • Implementation of a CreateUpdateDateModel mixin: models inheriting from this one will get create_date and modified_date properties. Both will be automatically assigned to the current time if not provided.
    • The repository takes care of updating modified_date automatically during update.
  • Implementation of create and update models variations for Datasource, Dataset and Expectation. It allows us to better control the field the user can or can't set, in particular the automatic ones like key, create_date and modified_date.

Admittedly, this is a quite big PR. I've tested the changes as much as I could and noticed no breaking changes. Waiting for your feedback on this :)

Copy link
Contributor

@KentonParton KentonParton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks great! Significantly cleaner and will make for a much easier implementation of a client when we get to it.

Left some comments and suggested updates.



@router.post("", response_model=Datasource)
@router.post("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@router.post("")
@router.post("", response_model=Datasource)

@@ -41,52 +40,22 @@ def list_supported_expectations():
return JSONResponse(status_code=status.HTTP_200_OK, content=content)


@router.put("/{expectation_id}/enable", response_model=Expectation)
@router.put("/{expectation_id}/enable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you wanting to add response models for Expectation in another update?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the same as stated above for Datasource: since we actually have objects inheriting from Expectation, we'll lose all their specific fields if we set the response_model, the output will be "down-casted" to a base Expectation.

@KentonParton KentonParton added the improvement Improvement to application label Sep 18, 2022
@frankie567
Copy link
Collaborator Author

@KentonParton In 18f8f89, I implemented the discriminator approach we talked about. After all, things went quite well.

There is one small thing however which is surprising but actually works very well. The model we use for annotation is actually the Union of all classes, not the discriminator model. It has several benefits:

  1. OpenAPI schema works
  2. Type annotation works well, with proper type hinting and auto-completion from the IDE

Hence, I named the discriminator model ExpectationInput and the union Expectation. When working with an ExpectationInput, the code takes care of returning its __root__.


We are also able to get rid of type_map. We are able to list the available classes directly from the Union type.

for expectation in get_args(Expectation):


Another small refinement I made is to modify the schema so we have the actual value of expectation_type at hand, instead of manually fetching the first value of the enum. It helps us in the backend and the UI

def schema_extra(schema: dict[str, Any], model: type['ExpectationBase']) -> None:
expectation_type_schema = schema.get('properties', {}).get("expectation_type")
if expectation_type_schema is not None:
expectation_type_schema["value"] = expectation_type_schema["enum"][0]
schema["properties"]["expectation_type"] = expectation_type_schema

expectation_type = json_schema['properties']['expectation_type']['value']

const transformExpectationsPayload = (payload) => {
const cleanedPayload = clean(payload);
const expectation = expectationsJsonSchema.filter((item) => (
item.properties.expectation_type.value === payload.expectation_type))[0].properties;
delete cleanedPayload.expectation_type;
return {
datasource_id: dataset.datasource_id,
dataset_id: dataset.key,
expectation_type: expectation.expectation_type.value,
kwargs: {
...cleanedPayload,
},
};
};


Let me know what. you think about it and we can go forward with the same approach for Datasource.

@KentonParton
Copy link
Contributor

This is looking great @frankie567 💪

I like that:

  1. we are able to get rid of the type_map
  2. we have input and response models
  3. code is much cleaner

Let's do the same for datasource 👍

P.S. I am getting a model validation error for GET /datasets

swiple_api             |   File "/code/./app/api/api_v1/endpoints/dataset.py", line 55, in list_datasets
swiple_api             |     return repository.query(query, size=1000)
swiple_api             |   File "/code/./app/repositories/base.py", line 27, in query
swiple_api             |     return [
swiple_api             |   File "/code/./app/repositories/base.py", line 28, in <listcomp>
swiple_api             |     self._get_object_from_dict(result["_source"]) for result in results
swiple_api             |   File "/code/./app/repositories/base.py", line 90, in _get_object_from_dict
swiple_api             |     return self.model_class.parse_obj(d)
swiple_api             |   File "/usr/local/lib/python3.9/site-packages/pydantic/main.py", line 521, in parse_obj
swiple_api             |     return cls(**obj)
swiple_api             |   File "/usr/local/lib/python3.9/site-packages/pydantic/main.py", line 341, in __init__
swiple_api             |     raise validation_error
swiple_api             | pydantic.error_wrappers.ValidationError: 1 validation error for Dataset
swiple_api             | key
swiple_api             |   none is not an allowed value (type=type_error.none.not_allowed)

@frankie567
Copy link
Collaborator Author

P.S. I am getting a model validation error for GET /datasets

Fixed!

So, here we are: Datasource is also ported to the discriminator approach. Looks much much cleaner! OpenAPI is working well with proper annotations.

@KentonParton
Copy link
Contributor

Nice work, LGTM!

@KentonParton KentonParton merged commit e8c7c51 into main Sep 22, 2022
@KentonParton KentonParton deleted the db-abstraction branch September 22, 2022 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement to application
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants