-
Notifications
You must be signed in to change notification settings - Fork 201
feat(arrow): Allow record batches output from read_sql #819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
connectorx-python/src/arrow.rs
Outdated
| pub fn to_ptrs<'py>(&self, py: Python<'py>) -> Bound<'py, PyAny> { | ||
| let ptrs = py.allow_threads( | ||
| || -> Result<(Vec<String>, Vec<Vec<(uintptr_t, uintptr_t)>>), ConnectorXPythonError> { | ||
| let rbs = vec![self.0.clone()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this okay or do you suggest any workarounds?
# doesn't work without `.clone()`, breaks with the following
cannot move out of `self` which is behind a shared reference
move occurs because `self.0` has type `arrow::array::RecordBatch`, which does not implement the `Copy` trait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can wrap over Option<RecordBatch> instead of RecordBatch along with take to resolve this.
Also, since we are using an iterator to generate a batch at a time, we do no need to wrap over a vector of batches.
|
@wangxiaoying for your review. |
|
Thanks @chitralverma for the PR! I will take a look at it by the end of this week. |
| *, | ||
| return_type: Literal[ | ||
| "pandas", "polars", "arrow", "modin", "dask" | ||
| "pandas", "polars", "arrow", "modin", "dask", "arrow_record_batches" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be using arrow_stream instead of arrow_record_batches for simplicity?
| elif return_type in {"arrow", "polars", "arrow_record_batches"}: | ||
| try_import_module("pyarrow") | ||
|
|
||
| record_batch_size = int(kwargs.get("record_batch_size", 10000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe batch_size instead of record_batch_size for simplicity?
connectorx-python/src/arrow.rs
Outdated
| pub fn to_ptrs<'py>(&self, py: Python<'py>) -> Bound<'py, PyAny> { | ||
| let ptrs = py.allow_threads( | ||
| || -> Result<(Vec<String>, Vec<Vec<(uintptr_t, uintptr_t)>>), ConnectorXPythonError> { | ||
| let rbs = vec![self.0.clone()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can wrap over Option<RecordBatch> instead of RecordBatch along with take to resolve this.
Also, since we are using an iterator to generate a batch at a time, we do no need to wrap over a vector of batches.
|
Hi @chitralverma , thanks for the waiting! The code looks good in general to me. I have made some changes to the code, including:
I also left a few comments on the API. Can you take a look at my changes and the comments? If everything looks good, we can update the documentation and have a new release! |
|
Getting this error with the new implementation on a PostgreSQL table with The |
You are right @kevinbds. The I think we can have a separate PR for completing the arrowstream types as well as type conversions. |
|
I have merged the PR and released and alpha version |
|
Hi @wangxiaoying, It seems like version 0.4.4a2 only has the ARM build available, so I can't run tests on this version. Besides that, this issue #819 (comment) will still happen, right? |
Thanks for the reminder. The uploading was failed since the space limit was reached. I have deleted some old alpha version on PyPI and rerun the upload action. All compiled wheel files should be available on PyPI now.
Yes. |
|
Have tested this in the context of dlt extraction pipelines as it but I am getting this error: |


Changes
new_record_batch_iterto expose a pyarrowRecordBatchReaderon python sidekwargstoread_sql, users can passrecord_batch_sizeto control the number of records in each record batch.unwrapscausing issuesRecordBatchReadertrait to supportSend(helps offloadRecordBatchReaderto multi-threaded consumers like DuckDB)Usage/ Example
closes #278