-
Notifications
You must be signed in to change notification settings - Fork 20
DBT Snapshot filling in dbt_valid_to for most recent record #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @dhodge250 for the detailed report. I am trying to reproduce this. |
I think the issue could be related to duplicate value of MERGE INTO target t
USING staging s
ON (s.dbt_scd_id = d.dbt_scd_id) dbt-oracle uses ORA_HASH to compute I checked dbt Labs's implementation for PostgresSQL and it seems that they use Let me test with the STANDARD_HASH function Oracle DB provides. it supports I will share a macro here which you can include in your dbt project and that should override the default implementation of hashing method included in dbt-oracle. |
Could you include the following macro in your dbt project and then test dbt snapshot ? You can create a file {% macro oracle__snapshot_hash_arguments(args) -%}
STANDARD_HASH({%- for arg in args -%}
coalesce(cast({{ arg }} as varchar(4000) ), '')
{% if not loop.last %} || '|' || {% endif %}
{%- endfor -%}, 'SHA256')
{%- endmacro %} As we are changing the hashing method for Let me know if you have any questions |
@aosingh thank you for the update and quick response! I will add the macro to our project and start testing and let you know what I see for results. |
After adding the For additional context, we are utilizing a separate {%- macro surrogate_key(field_list) -%}
{#-- Combine strings from list into single string, create hash column--#}
{%- set key_as_string = field_list | join("|| '-' ||") -%}
{{ "rawtohex(standard_hash(" ~ key_as_string ~ ", 'MD5'))" }}
{%- endmacro -%} To my knowledge, these ID's should not be responsible for the |
By default (old method) dbt_scd_id's data type is This is due to the difference in between ORA_HASH and STANDARD_HASH functions. However, I highly recommend using STANDARD_HASH because of the strong hash algorithms it supports and avoiding hash collision. There are 2 options going forward Option 1:Drop the table Option 2:Second option would be to run an UPDATE on the column STANDARD_HASH(coalesce(cast(snap_id as varchar(4000) ), '') ||
'|' || coalesce(cast(update_dt as varchar(4000) ), ''), 'SHA256') |
I've opted for your suggestion with Option 2 to save the data from |
Did these records exist prior to running the |
No, these incorrect records did not exist prior to running the |
Thank you for confirming. I think at this point, it would be good to check what is in the staging table. We update the snapshot table using the staging table. I have added the SQL to create the staging table below. It would be good to check how many entries do we have for the above record in the staging table. |
@dhodge250 The attached SQL in the above comment looks like this: CREATE GLOBAL TEMPORARY TABLE o$pt_tmp_device_ad152327
ON COMMIT PRESERVE ROWS
AS
WITH snapshot_query AS (
SELECT ad.*
FROM dbt.stg_device_ad ad
..... Let me know if you have any questions |
I'll try creating the staging table for you in a bit using the SQL you've provided me with. I'll need to remove some of the PII from the table before I submit, but it will still include the required
|
Thank you @dhodge250 for sharing the staging table records. This was really helpful. As I had suspected, same record is getting picked up for
These 2 rows should not show up in staging table because there were no updates and the record is still valid i.e. Rows are picked up for staging if I am trying to understand how did these get picked up. Also,
|
Your recent comment has me thinking more about the
No,
No, the script only sets active records in the snapshot tables that have |
Thanks for the answers @dhodge250 Having If you don't save the time part in Going back to the 2 rows the column Rows are picked up for staging if |
Apologies for the delay. I've had a chance to review the Sorry for the confusion that lead to. |
Closing this issue as per previous comment. |
What is the resolution to this issue? In my previous comment I was just stating that the
were correct in their formatting based on the time being included in the date value. The issue I opened with snapshots is still present for us, unless you're saying that the time formatting is the issue here. |
@dhodge250 sorry for wrongly closing this. Reopening. |
Let's compare I have picked SELECT update_dt
FROM dbt.stg_device_ad ad
WHERE ad.snap_id = '0124CA8A277B605072F81615D49B47AC'
UNION
SELECT dbt_valid_from AS update_dt
FROM dbt.tmp_device_ad
WHERE dbt_unique_key = '0124CA8A277B605072F81615D49B47AC'
AND dbt_valid_to = NULL This is to check if |
This looks correct. Would you be open to discuss this issue on a zoom call ? |
I think the issue is identified along with it's fix. It is related to the computation of There are 2 steps to follow to test the fix
STANDARD_HASH(coalesce(cast(snap_id as varchar(4000) ), '') ||
'|' || coalesce(cast(TO_CHAR(update_dt, 'MM/DD/YY HH:MI:SS A.M.') as varchar(4000) ), ''), 'SHA256')
{% macro snapshot_timestamp_strategy(node, snapshotted_rel, current_rel, config, target_exists) %}
{% set primary_key = config['unique_key'] %}
{% set updated_at = config['updated_at'] %}
{% set invalidate_hard_deletes = config.get('invalidate_hard_deletes', false) %}
{% set row_changed_expr -%}
({{ snapshotted_rel }}.dbt_valid_from < {{ current_rel }}.{{ updated_at }})
{%- endset %}
{% set scd_id_expr = snapshot_hash_arguments([primary_key, 'TO_CHAR(' ~ updated_at ~ ', \'MM/DD/YY HH:MI:SS A.M.\')']) %}
{% do return({
"unique_key": primary_key,
"updated_at": updated_at,
"row_changed": row_changed_expr,
"scd_id": scd_id_expr,
"invalidate_hard_deletes": invalidate_hard_deletes
}) %}
{% endmacro %} |
My schedule is a little sporadic right now, so I'm just seeing your message from yesterday. I would be available for a zoom call on 12/15 if you would still like to discuss this issue. I was able to follow your recommendations and run an UPDATE on the It appears that this issue was caused by the time part being truncated while casting the |
- Added module and client_identifier as dbt connection string - Bugfix for dbt snapshots #52 - oracledb dependency upgraded to 1.2.1 - Global temp relation name (time format) includes milliseconds
This is merged in |
@aosingh Thank you for the support on this! |
Is there an existing issue for this?
Current Behavior
After running the dbt snapshot command and building a snapshot table with the invalidate_hard_deletes parameter set to True, various records are having their dbt_valid_to dates populated with the latest dbt_updated_at value, even though these records are still considered valid and have not been removed from their source (staging) tables. We are using the following code to build each snapshot:
The number of records this is affecting is inconsistent between snapshots (sometimes it's a few dozen, or a few hundred from tables up to 30k), but it appears that it is affecting the same records on each run.
Expected Behavior
When building a dbt snapshot table and setting the invalidate_hard_deletes parameter to True, I expect dbt to only fill in the dbt_valid_to value ONLY if a record no longer exists in the source table OR if a record has changed in the source table and a new record is created in the snapshot table and the previous record in the snapshot table is marked as invalid. An active record in the snapshot table SHOULD NOT be marked as having a dbt_valid_to date, instead it should have a value of null.
Steps To Reproduce
Relevant log output using
--debug
flag enabledNo response
Environment
What Oracle database version are you using dbt with?
19c
Additional Context
This issue looks to be similar, if not identical, to #2390 from dbt-core a few years ago that was resolved in v0.17.0. I've created a fork to play around with this and see if the two issues are related.
The text was updated successfully, but these errors were encountered: