You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 17, 2024. It is now read-only.
Describe the bug
When connecting to Motherduck, query_table_schema() throws an assertion error if the table path exists in more than 1 database.
I have a motherduck project with multiple databases, 1 per environment. I use these environments to materialize dbt models. In this example, I have a schema called dbt_marts with a table called orders. This exists in 3 databases, namely production, development, and local_test. The schema in all 3 tables is exactly the same.
I came across this bug when experimenting with data-diff.
then I ran the command: data-diff --conf data-diff-config.toml --run test2
which throws:
Traceback (most recent call last):
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/bin/data-diff", line 8, in <module>
sys.exit(main())
^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 344, in main
return _data_diff(
^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 482, in _data_diff
schemas = list(differ._thread_map(_get_schema, safezip(dbs, table_paths)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 77, in _get_schema
return db.query_table_schema(table_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/databases/base.py", line 1042, in query_table_schema
assert len(d) == len(rows)
AssertionError
I traced it back to this line. It takes the output of the information_schema query over here and converts it to a dictionary. This line verifies that the number of rows in the dictionary (representing columns in the schema) is the same as the number of rows in the original result.
This fails in my case, because the information_schema returns the columns from all that meet WHERE table_name = '{name}' AND table_schema = '{schema}', regardless of the database in which they're in. During the dictionary conversion however, the set of rows is reduced to unique ones, so the assertion fails.
Not sure how this should be fixed. Not sure if it's expected that the information_schema returns columns from tables in other databases than the one specified in the connection parameters. One way to avoid this error is modifying the where clause as follows:
WHERE table_name = '{name}' AND table_schema = '{schema}' AND table_catalog = '[database-name]'
Describe the environment
Running MacOS with data-diff version v0.10.1 and duckdb 0.9.2
The text was updated successfully, but these errors were encountered:
I replicated this bug with the example configs below. You made a great suggestion on the fix and that was all it took. Note that you'll need to provide the full table path: database.schema.table vs. your original example: schema.table to make this work correctly. This makes sense given your situation with different database names but consistent schema and table names.
Can you try installing from my public feature branch and verify it looks and feels the way you want before I wrote some tests?
# in your working git repo run these commands
git clone https://github.com/datafold/data-diff.git #clone sourcecd data-diff # change directory to source
git checkout fix-motherduck-bug # checkout feature branchecd .. # go back to your dbt project
pip install -e ../data-diff # install from source
data-diff --conf data-diff-config.toml --run test2 --debug # run your test command
Describe the bug
When connecting to Motherduck,
query_table_schema()
throws an assertion error if the table path exists in more than 1 database.I have a motherduck project with multiple databases, 1 per environment. I use these environments to materialize dbt models. In this example, I have a schema called
dbt_marts
with a table calledorders
. This exists in 3 databases, namelyproduction
,development
, andlocal_test
. The schema in all 3 tables is exactly the same.I came across this bug when experimenting with data-diff.
data-diff-config.toml
:then I ran the command:
data-diff --conf data-diff-config.toml --run test2
which throws:
I traced it back to this line. It takes the output of the information_schema query over here and converts it to a dictionary. This line verifies that the number of rows in the dictionary (representing columns in the schema) is the same as the number of rows in the original result.
This fails in my case, because the information_schema returns the columns from all that meet
WHERE table_name = '{name}' AND table_schema = '{schema}'
, regardless of the database in which they're in. During the dictionary conversion however, the set of rows is reduced to unique ones, so the assertion fails.Not sure how this should be fixed. Not sure if it's expected that the information_schema returns columns from tables in other databases than the one specified in the connection parameters. One way to avoid this error is modifying the where clause as follows:
WHERE table_name = '{name}' AND table_schema = '{schema}' AND table_catalog = '[database-name]'
Describe the environment
Running MacOS with data-diff version v0.10.1 and duckdb 0.9.2
The text was updated successfully, but these errors were encountered: