Metadata Query Feature for ConnectorX #704

pangjunrong · 2024-11-15T06:19:05Z

pangjunrong
Nov 15, 2024
Collaborator

Hey @wangxiaoying, so I'm noticing that the fetching metadata process uses whatever query is passed into connector-x and for cases with a list of queries, the first is chosen for the task.

While this works fine for a generic SELECT * FROM TABLE query, there is substantial overhead when a user passes a query which performs full table scans. For example, if a user wants to provide a list of queries because they are performing the partition logic themselves by constructing each query with specific partition conditions, the fetching metadata process will append a LIMIT 1 or equivalent clause to it.

Using Oracle as an example, if the first query was something like the below where the user wants to fetch the first partition as 5 million rows:
SELECT * FROM (SELECT SQ.*, ROWNUM AS RN FROM TABLE WHERE ROWNUM < 5000000) WHERE RN >= 1

The metadata query will execute:
SELECT * FROM (SELECT SQ.*, ROWNUM AS RN FROM TABLE WHERE ROWNUM < 5000000) WHERE RN >= 1) FETCH FIRST 1 ROW ONLY

As such, even before the actual list of queries is executed to retrieve data, the metadata fetching process takes a long time to execute given that at best, it still needs to scan 5 million rows first before fetching the first row for the schema since the optimizer for Oracle will retrieve the full resultset of the inner query first.

I was wondering if we could have an optional argument for the read_sql method to specify a custom metadata query? This way, even if users have complex queries, they can explicitly override the internal logic for fetching metadata to use a more optimized query instead. So in the case above, we would have something like this for the arguments:

conn = 'some connection string', query = ['query1', 'query2', 'query3'], metadata_query = 'SELECT * FROM TABLE'

pangjunrong · 2024-11-15T06:23:23Z

pangjunrong
Nov 15, 2024
Collaborator Author

If this makes sense, I can look into implementing it and raising a PR for the change. At the moment, I'm using a workaround by placing a dummy metadata query at the start of my list of queries and removing the first row of the resultset when it is returned. It works but I figured maybe this approach can be integrated given that DBs like Oracle often choose terrible execution plans with complex queries.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata Query Feature for ConnectorX #704

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Metadata Query Feature for ConnectorX #704

Uh oh!

pangjunrong Nov 15, 2024 Collaborator

Replies: 1 comment

Uh oh!

pangjunrong Nov 15, 2024 Collaborator Author

pangjunrong
Nov 15, 2024
Collaborator

pangjunrong
Nov 15, 2024
Collaborator Author