Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Refactor Artifacts Parser to be Native so it's less brittle with each dbt version change #688

Merged
merged 40 commits into from
Sep 12, 2023

Conversation

sungchun12
Copy link
Contributor

@sungchun12 sungchun12 commented Aug 25, 2023

Problem: Each time a new dbt version is released, it results in running fire drills to fix breaking changes to the dbt integration specifically because of version upgrades to run_results.json and manifest.json. This is pronounced because data-diff relies on a solo-developer maintained package to parse the json docs into python objects. https://github.com/yu-iskw/dbt-artifacts-parser

Solution: Build native json/dictionary parsing within data-diff and focus on stable fields so that new version upgrades are immaterial to how data-diff interacts with dbt.

Technical Approach
I create native pydantic schema enforcers for both the manifest.json and run_results.json with only the fields necessary for data-diff to work properly. The fields data-diff reads in are relatively stable across artifact schema changes and can handle malformed artifact schemas if they're manipulated by a process outside of dbt. This makes this piece of the codebase lighter weight as the dbt-artifacts-parser enforces more artifact schema validation than needed. When future bugs arise because of breaking changes to the artifact schemas, we can simply adjust the native schema validator in data-diff vs. relying on the solo-developer maintained package.

Testing Approach
I installed multiple versions of dbt-snowflake and dbt-core to produce manifest.json and run_results.json artifacts for version >=1.0.0. dbt versions <1.0.0 are no longer supported by dbt Labs, so they are not tested. Testing project: https://github.com/datafold/datafold-demo-sung

Note: Only one version of run_results.json is maintained by dbt Labs which is v4.

@sungchun12 sungchun12 changed the title helpful notes for sung Refactor Artifacts Parser to be Native so it's less brittle with each dbt version change Aug 25, 2023
@sungchun12 sungchun12 self-assigned this Aug 25, 2023
Sung Won Chung and others added 7 commits August 28, 2023 16:30
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
sungchun12 and others added 2 commits August 30, 2023 16:10
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sung Won Chung and others added 4 commits August 30, 2023 16:11
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@dlawin
Copy link
Contributor

dlawin commented Aug 31, 2023

Approach looks great, this will simplify things in an elegant way, appreciate it @sungchun12

@sungchun12 sungchun12 linked an issue Sep 6, 2023 that may be closed by this pull request
@sungchun12 sungchun12 marked this pull request as ready for review September 6, 2023 20:41
@sungchun12 sungchun12 requested a review from dlawin September 6, 2023 20:41
Copy link
Contributor

@dlawin dlawin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, just need to remove the submodule before we merge

@sungchun12 sungchun12 merged commit 9c6ed97 into master Sep 12, 2023
sungchun12 added a commit that referenced this pull request Oct 2, 2023
… dbt version change (#688)

* helpful notes for sung

* v1 of native run results parser

* remove debug comments

* remove from import

* Update data_diff/dbt_parser.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update data_diff/dbt_parser.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* remove an import

* remove another print

* add schema validation for specific fields

* stricter validation

* replaced manifest parser with native one

* Apply suggestions from code review for spacing

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Apply suggestions from code review for double quotes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* create space

* Apply suggestions from code review for more formatting

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* add more necessary fields

* something to think through

* better type hints

* remove comment

* separation of duties

* remove mock call

* draft unit tests

* first draft of unit tests

* passing tests

* more pythonic

* remove nested git repo

* require name

* add strictness

* black formatting

* reduce scope of changes

* fix imports

* update patches

* fix mocking

* fix test failure

* fix mock tests

* remove submodule

* update toml

* remove submodule again

* add pydantic back in

---------

Co-authored-by: Sung Won Chung <sung@datafold.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dan Lawin <daniel@datafold.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not a soft of manifest.json
2 participants