Skip to content

Add iceberg support to table_diff #4441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 26, 2025
Merged

Conversation

williebsweet
Copy link
Contributor

@williebsweet williebsweet commented May 16, 2025

Resolves #4208

PR contains the following:

  • Added a check of both source and target table formats
  • Added an error, should someone try to diff an Iceberg table and a Hive table
  • Set the table format for a Iceberg/Iceberg diff to Iceberg
  • Tests for various combinations of source/target/temp table formats

@CLAassistant
Copy link

CLAassistant commented May 16, 2025

CLA assistant check
All committers have signed the CLA.

@@ -2203,6 +2203,19 @@ def temp_table(
query_or_df, columns_to_types=columns_to_types, target_table=name
)

if hasattr(self, "_query_table_type_or_raise") and hasattr(self, "source_table"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this logic be in athena.py where these attributes are available?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... dont actually see how this works. Where does self.source_table come from? This isn't an engine adapter property.

table_diff calls EngineAdapter.temp_table(...) and passes the query that it's trying to store the results of, which looks something like:

with source as (select ... from source_table),
target as (select ... from target_table),
joined as (select ... from source full outer join target)
select * from joined

I actually think it's table_diff that should be supplying the table format, not the engine adapter introspecting the query to try and look up all the tables referenced in the query to see if any of them are Iceberg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erindru Something more like this previous commit?

Copy link
Collaborator

@erindru erindru May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. With perhaps a comment to explain why there is an Athena-specific check and the addition of some tests

@izeigerman
Copy link
Member

@williebsweet thanks for the submission! Some tests will be helpful

@williebsweet
Copy link
Contributor Author

@erindru I moved the logic back into table_diff. I did the following:

  • Added a check of both source and target table formats
  • Added an error, should someone try to diff an Iceberg table and a Hive table
  • Set the table format for a Iceberg/Iceberg diff to Iceberg

@izeigerman I also added tests for the scenarios above.

@williebsweet
Copy link
Contributor Author

Forgot to mention - I also tested this with my company's SQLMesh project (installing the --editable package) and I was able to run a diff. 🎆

# Sets the temp table's format to Iceberg.
# If neither source nor target table is Iceberg, it defaults to Hive (Athena's default).
elif source_table_type == "iceberg" or target_table_type == "iceberg":
raise SQLMeshError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right. Table diff relies on copying a sample of data from each table into a single table, which on most engines is fine because they have a consistent table format.

I know that timestamp types in particular are handled differently between Athena/Iceberg and Athena/Hive so I dont think we can say "if Iceberg is detected - make the diff table Iceberg" because copying Hive data into it verbatim can still cause an error

@erindru
Copy link
Collaborator

erindru commented May 25, 2025

Thanks @williebsweet ! This looks much better

@izeigerman izeigerman merged commit 76f52e6 into TobikoData:main May 26, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

table_diff on Athena fails for Iceberg tables
4 participants