-
Notifications
You must be signed in to change notification settings - Fork 210
Add iceberg support to table_diff #4441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sqlmesh/core/engine_adapter/base.py
Outdated
@@ -2203,6 +2203,19 @@ def temp_table( | |||
query_or_df, columns_to_types=columns_to_types, target_table=name | |||
) | |||
|
|||
if hasattr(self, "_query_table_type_or_raise") and hasattr(self, "source_table"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this logic be in athena.py
where these attributes are available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I... dont actually see how this works. Where does self.source_table
come from? This isn't an engine adapter property.
table_diff calls EngineAdapter.temp_table(...)
and passes the query that it's trying to store the results of, which looks something like:
with source as (select ... from source_table),
target as (select ... from target_table),
joined as (select ... from source full outer join target)
select * from joined
I actually think it's table_diff that should be supplying the table format, not the engine adapter introspecting the query to try and look up all the tables referenced in the query to see if any of them are Iceberg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erindru Something more like this previous commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. With perhaps a comment to explain why there is an Athena-specific check and the addition of some tests
@williebsweet thanks for the submission! Some tests will be helpful |
This reverts commit d237c3c.
@erindru I moved the logic back into
@izeigerman I also added tests for the scenarios above. |
Forgot to mention - I also tested this with my company's SQLMesh project (installing the |
# Sets the temp table's format to Iceberg. | ||
# If neither source nor target table is Iceberg, it defaults to Hive (Athena's default). | ||
elif source_table_type == "iceberg" or target_table_type == "iceberg": | ||
raise SQLMeshError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right. Table diff relies on copying a sample of data from each table into a single table, which on most engines is fine because they have a consistent table format.
I know that timestamp types in particular are handled differently between Athena/Iceberg and Athena/Hive so I dont think we can say "if Iceberg is detected - make the diff table Iceberg" because copying Hive data into it verbatim can still cause an error
Thanks @williebsweet ! This looks much better |
Resolves #4208
PR contains the following: