feat(arrow): Allow record batches output from read_sql #819

chitralverma · 2025-06-20T13:48:16Z

Changes

Builds on the existing new_record_batch_iter to expose a pyarrow RecordBatchReader on python side
Supports completely lazy iterations over arrow stream destination
Added kwargs to read_sql, users can pass record_batch_size to control the number of records in each record batch.
fixed a few unwraps causing issues
Updates RecordBatchReader trait to support Send (helps offload RecordBatchReader to multi-threaded consumers like DuckDB)
Left existing implementations as is, ideally those can also rely on record batch approach

Usage/ Example

import connectorx as cx

conn = "mysql://username:password@server:port/database/"
query = "SELECT * FROM employees"

rb_iter = cx.read_sql(
    conn,
    query,
    return_type="arrow_record_batches",
    record_batch_size=120333,
)

closes #278

chitralverma · 2025-06-20T13:52:19Z

connectorx-python/src/arrow.rs

+    pub fn to_ptrs<'py>(&self, py: Python<'py>) -> Bound<'py, PyAny> {
+        let ptrs = py.allow_threads(
+            || -> Result<(Vec<String>, Vec<Vec<(uintptr_t, uintptr_t)>>), ConnectorXPythonError> {
+                let rbs = vec![self.0.clone()];


is this okay or do you suggest any workarounds?

# doesn't work without `.clone()`, breaks with the following cannot move out of `self` which is behind a shared reference move occurs because `self.0` has type `arrow::array::RecordBatch`, which does not implement the `Copy` trait

I think we can wrap over Option<RecordBatch> instead of RecordBatch along with take to resolve this.

Also, since we are using an iterator to generate a batch at a time, we do no need to wrap over a vector of batches.

connectorx-python/connectorx/__init__.py

chitralverma · 2025-06-20T14:11:01Z

@wangxiaoying for your review.
If this seems ok, I'll update the PR with documentation/ examples and such.

…s like DuckDB

wangxiaoying · 2025-06-23T17:15:40Z

Thanks @chitralverma for the PR! I will take a look at it by the end of this week.

connectorx-python/connectorx/__init__.py

wangxiaoying · 2025-06-28T19:54:01Z

connectorx-python/connectorx/__init__.py

    *,
    return_type: Literal[
-        "pandas", "polars", "arrow", "modin", "dask"
+        "pandas", "polars", "arrow", "modin", "dask", "arrow_record_batches"


May be using arrow_stream instead of arrow_record_batches for simplicity?

wangxiaoying · 2025-06-28T19:57:26Z

connectorx-python/connectorx/__init__.py

+    elif return_type in {"arrow", "polars", "arrow_record_batches"}:
        try_import_module("pyarrow")

+        record_batch_size = int(kwargs.get("record_batch_size", 10000))


Maybe batch_size instead of record_batch_size for simplicity?

wangxiaoying · 2025-06-28T20:46:14Z

connectorx-python/src/arrow.rs

+    pub fn to_ptrs<'py>(&self, py: Python<'py>) -> Bound<'py, PyAny> {
+        let ptrs = py.allow_threads(
+            || -> Result<(Vec<String>, Vec<Vec<(uintptr_t, uintptr_t)>>), ConnectorXPythonError> {
+                let rbs = vec![self.0.clone()];


I think we can wrap over Option<RecordBatch> instead of RecordBatch along with take to resolve this.

Also, since we are using an iterator to generate a batch at a time, we do no need to wrap over a vector of batches.

wangxiaoying · 2025-06-28T20:55:37Z

Hi @chitralverma , thanks for the waiting!

The code looks good in general to me. I have made some changes to the code, including:

Adding unit tests for getting arrow stream in python: connectorx-python/connectorx/tests/test_arrow.py
Resolve the CI by keeping the old arrow interface for arrow and polars destination.
Avoid clone in generating arrow by wrapping record batch with Option.

I also left a few comments on the API. Can you take a look at my changes and the comments? If everything looks good, we can update the documentation and have a new release!

kevinbds · 2025-06-30T22:50:17Z

@wangxiaoying

Getting this error with the new implementation on a PostgreSQL table with array<str> column:

called `Result::unwrap()` on an `Err` value: ConnectorX(NoConversionRule("TextArray(true)", "connectorx::destinations::arrowstream::typesystem::ArrowTypeSystem")

The "arrow" implementation works fine with the same table. Seems like a missing conversion rule for PostgreSQL text arrays.

wangxiaoying · 2025-07-01T03:10:26Z

@wangxiaoying

Getting this error with the new implementation on a PostgreSQL table with array<str> column:
called `Result::unwrap()` on an `Err` value: ConnectorX(NoConversionRule("TextArray(true)", "connectorx::destinations::arrowstream::typesystem::ArrowTypeSystem")
The "arrow" implementation works fine with the same table. Seems like a missing conversion rule for PostgreSQL text arrays.

You are right @kevinbds. The TextArray conversion should be added to the postgres_arrowstream transport file (similar to postgres_arrow transport file). Before that, we also need to add the Utf8Array type to arrowstream (similar to the same type in arrow).

I think we can have a separate PR for completing the arrowstream types as well as type conversions.

wangxiaoying · 2025-07-12T23:13:58Z

I have merged the PR and released and alpha version 0.4.4-alpha.2 for this, please feel free to try out!

kevinbds · 2025-07-14T16:31:50Z

Hi @wangxiaoying,

It seems like version 0.4.4a2 only has the ARM build available, so I can't run tests on this version.

Besides that, this issue #819 (comment) will still happen, right?

wangxiaoying · 2025-07-14T17:04:01Z

Hi @wangxiaoying,

It seems like version 0.4.4a2 only has the ARM build available, so I can't run tests on this version.

Thanks for the reminder. The uploading was failed since the space limit was reached. I have deleted some old alpha version on PyPI and rerun the upload action. All compiled wheel files should be available on PyPI now.

Besides that, this issue #819 (comment) will still happen, right?

Yes.

SebZbp · 2025-07-16T10:57:34Z

Have tested this in the context of dlt extraction pipelines as it but I am getting this error:
dlt-hub/dlt#2840 (comment)

Allow record batches

016adc0

chitralverma commented Jun 20, 2025

View reviewed changes

fix type

03c5f78

chitralverma commented Jun 20, 2025

View reviewed changes

connectorx-python/connectorx/__init__.py Show resolved Hide resolved

chitralverma changed the title ~~Allow record batches output from read_sql~~ feat(arrow): Allow record batches output from read_sql Jun 20, 2025

fix: make RecordBatchIterator Send to support multi-threaded consumer…

0f1ba57

…s like DuckDB

cargo fmt

1b23062

wangxiaoying self-requested a review June 28, 2025 19:33

wangxiaoying added 2 commits June 28, 2025 13:05

add arrow stream test

c6836d8

1. fix tests 2. avoid clone 3. only return a single batch

4213afb

wangxiaoying reviewed Jun 28, 2025

View reviewed changes

buu-nguyen mentioned this pull request Jul 7, 2025

feat(arrow): Allow record batches output from read_sql thealtoclef/connector-x#1

Merged

update var names

bc69438

wangxiaoying merged commit da319be into sfu-db:main Jul 12, 2025
2 checks passed

kevinbds mentioned this pull request Jul 14, 2025

Using ConnectorX backend, file_max_bytes and file_max_items have no effect on number of files being generated after normalisation dlt-hub/dlt#2840

Closed

feat(arrow): Allow record batches output from read_sql #819

feat(arrow): Allow record batches output from read_sql #819

Uh oh!

Conversation

chitralverma commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chitralverma Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxiaoying Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chitralverma commented Jun 20, 2025

Uh oh!

wangxiaoying commented Jun 23, 2025

Uh oh!

Uh oh!

wangxiaoying Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiaoying Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiaoying Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiaoying commented Jun 28, 2025

Uh oh!

kevinbds commented Jun 30, 2025

Uh oh!

wangxiaoying commented Jul 1, 2025

Uh oh!

Uh oh!

wangxiaoying commented Jul 12, 2025

Uh oh!

kevinbds commented Jul 14, 2025

Uh oh!

wangxiaoying commented Jul 14, 2025

Uh oh!

SebZbp commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chitralverma commented Jun 20, 2025 •

edited

Loading

chitralverma Jun 20, 2025 •

edited

Loading