Skip to content

Conversation

@krivard
Copy link
Contributor

@krivard krivard commented Feb 28, 2025

Overview

Three tables require a group_mean_continuity_check, which does the following:

  1. Group by a specified column
  2. Compute the mean of each group
  3. Compute the percent change between successive groups
  4. Check against a per-column threshold

This PR contains a draft for a possible implementation.

What did you change?

  • New macro, group_mean_continuity_check(group_column, max_pct_change)
  • Demo of macro in use for _core_eia923__cooling_system_information, using columns and thresholds from transform/eia923.py

Debatable design choices

  • This is written as a column-level check
    • pros: macro source is simple (no loops); keeps all checks for a column together in the column entry
    • cons: the grouping column and number of acceptable outliers is replicated in each data-test entry
    • options: could rewrite as a table-level check with a list of columns and thresholds; see Add model/table-level macro for group_mean_continuity_check #4116

@krivard krivard added dbt Issues related to the data build tool aka dbt data-validation Issues related to checking whether data meets our quality expectations. labels Feb 28, 2025
@krivard krivard requested a review from marianneke February 28, 2025 18:51
@krivard krivard linked an issue Mar 4, 2025 that may be closed by this pull request
Copy link
Collaborator

@marianneke marianneke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question: you mention pct change between successive groups. This suggests the group column is inherently ordered, right? Is this something you would generally expect to be the case?

I guess what I'm saying is: to me it is not intuitive that a "group" is always an ordered thing. Maybe this variable name should be changed to something that makes it clear that it is both a partition and something that has an inherent order to it

@aesharpe
Copy link
Member

I'm inclined to say that this test should remain a column-level test (rather than a table test as in #4116). I think this because it doesn't have to do with the relationship between columns in a table or the relationship between a table and another table. For that reason, I think it makes sense to have tests that implicate a particular column be column level (unless, perhaps, it's something that implicates every single column in a table). But this to me feels like it should remain at the column-level.

@krivard
Copy link
Contributor Author

krivard commented Mar 12, 2025

@aesharpe even though 2 of the 3 arguments for the test are intended to be identical across all the column tests within a table? and if they ever change, we have to update N lines instead of just one?

I can live with that; I just want to make sure the impact is clear before we decide.

@aesharpe
Copy link
Member

@aesharpe even though 2 of the 3 arguments for the test are intended to be identical across all the column tests within a table? and if they ever change, we have to update N lines instead of just one?

I can live with that; I just want to make sure the impact is clear before we decide.

Yes, this is good clarification. I think it's important to be able to customize things like max_fr_change and n_outliers_allowed to each column, hence, column level test :)

@krivard krivard force-pushed the krivard/dbt-migrations_group_mean_continuity branch from e0da49a to 4c8c677 Compare March 19, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data-validation Issues related to checking whether data meets our quality expectations. dbt Issues related to the data build tool aka dbt

Projects

Status: Icebox

Development

Successfully merging this pull request may close these issues.

Migrate group_mean_continuity_check validation tests to dbt

4 participants