Metadata Query Feature for ConnectorX #704
pangjunrong
started this conversation in
Ideas
Replies: 1 comment
-
If this makes sense, I can look into implementing it and raising a PR for the change. At the moment, I'm using a workaround by placing a dummy metadata query at the start of my list of queries and removing the first row of the resultset when it is returned. It works but I figured maybe this approach can be integrated given that DBs like Oracle often choose terrible execution plans with complex queries. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey @wangxiaoying, so I'm noticing that the fetching metadata process uses whatever query is passed into connector-x and for cases with a list of queries, the first is chosen for the task.
While this works fine for a generic
SELECT * FROM TABLE
query, there is substantial overhead when a user passes a query which performs full table scans. For example, if a user wants to provide a list of queries because they are performing the partition logic themselves by constructing each query with specific partition conditions, the fetching metadata process will append aLIMIT 1
or equivalent clause to it.Using Oracle as an example, if the first query was something like the below where the user wants to fetch the first partition as 5 million rows:
SELECT * FROM (SELECT SQ.*, ROWNUM AS RN FROM TABLE WHERE ROWNUM < 5000000) WHERE RN >= 1
The metadata query will execute:
SELECT * FROM (SELECT SQ.*, ROWNUM AS RN FROM TABLE WHERE ROWNUM < 5000000) WHERE RN >= 1) FETCH FIRST 1 ROW ONLY
As such, even before the actual list of queries is executed to retrieve data, the metadata fetching process takes a long time to execute given that at best, it still needs to scan 5 million rows first before fetching the first row for the schema since the optimizer for Oracle will retrieve the full resultset of the inner query first.
I was wondering if we could have an optional argument for the
read_sql
method to specify a custom metadata query? This way, even if users have complex queries, they can explicitly override the internal logic for fetching metadata to use a more optimized query instead. So in the case above, we would have something like this for the arguments:conn = 'some connection string', query = ['query1', 'query2', 'query3'], metadata_query = 'SELECT * FROM TABLE'
Beta Was this translation helpful? Give feedback.
All reactions