-
Notifications
You must be signed in to change notification settings - Fork 62
fix: support results with STRUCT and ARRAY columns containing JSON subfields in to_pandas_batches()
#2216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
This reverts commit 8c34512.
tswast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Look good to me once typing and presubmit failures are addressed.
bigframes/core/blocks.py
Outdated
| for col in itertools.chain(self.value_columns, self.index_columns): | ||
| dtype = self.expr.get_column_type(col) | ||
| if bigframes.dtypes.contains_db_dtypes_json_dtype(dtype): | ||
| # Due to a limitation in Apache Arrow (#45262), JSON columns are not | ||
| # natively supported by the to_pandas_batches() method, which is | ||
| # used by the anywidget backend. | ||
| # Workaround for https://github.com/googleapis/python-bigquery-dataframes/issues/1273 | ||
| # PyArrow doesn't support creating an empty array with db_dtypes.JSONArrowType, | ||
| # especially when nested. | ||
| # Create with string type and then cast. | ||
|
|
||
| # MyPy doesn't automatically narrow the type of 'dtype' here, | ||
| # so we add an explicit check. | ||
| if isinstance(dtype, pd.ArrowDtype): | ||
| safe_pa_type = bigframes.dtypes._replace_json_arrow_with_string( | ||
| dtype.pyarrow_dtype | ||
| ) | ||
| safe_dtype = pd.ArrowDtype(safe_pa_type) | ||
| series_map[col] = pd.Series([], dtype=safe_dtype).astype(dtype) | ||
| else: | ||
| # This branch should ideally not be reached if | ||
| # contains_db_dtypes_json_dtype is accurate, | ||
| # but it's here for MyPy's sake. | ||
| series_map[col] = pd.Series([], dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chelsea-lin I assume we have similar code that does this, right? Maybe there's something that could be reused here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we have something similar in the loader component but they're slightly different.
| def _validate_dtype_can_load(name: str, column_type: bigframes.dtypes.Dtype): |
Also, I agree that we can simply logic a little bit, for example:
dtype = pd.ArrowDtype(pa.list_(pa.struct([("key", db_dtypes.JSONArrowType())])))
try:
s = pd.Series([], dtype=dtype)
except pa.ArrowNotImplementedError as e:
s = pd.Series([], dtype=pd.ArrowDtype(_replace_json_arrow_with_string(dtype.pyarrow_dtype))).astype(dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic has been simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The new logic looks even better!
to_pandas_batches()
|
Nit: I renamed the PR to be a little more user-oriented. Users don't care as much about the internal limitations. What changed from their perspective is that they can read |
bigframes/dtypes.py
Outdated
| return contains_db_dtypes_json_arrow_type(dtype.pyarrow_dtype) | ||
|
|
||
|
|
||
| def _replace_json_arrow_with_string(pa_type: pa.DataType) -> pa.DataType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function may be similar as the following two methods. Can you help to remove the one in the loader.py?
| def _has_json_arrow_type(arrow_type: pa.DataType) -> bool: |
| def contains_db_dtypes_json_arrow_type(type_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I removed this function, this code refactor is no longer relevant to this PR. I will start a new PR (#2221) for this code refactor.
bigframes/core/blocks.py, unused function removed from bigframes/dtypes.py
construction of the empty DataFrame with the more robust try...except block that leverages to_pyarrow and empty_table
bigframes/core/blocks.py
Outdated
| try: | ||
| empty_arrow_table = self.expr.schema.to_pyarrow().empty_table() | ||
| except pa.ArrowNotImplementedError: | ||
| # Bug with some pyarrow versions, empty_table only supports base storage types, not extension types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you please add the bug id in the docs.
bigframes/core/blocks.py
Outdated
| for col in itertools.chain(self.value_columns, self.index_columns): | ||
| dtype = self.expr.get_column_type(col) | ||
| if bigframes.dtypes.contains_db_dtypes_json_dtype(dtype): | ||
| # Due to a limitation in Apache Arrow (#45262), JSON columns are not | ||
| # natively supported by the to_pandas_batches() method, which is | ||
| # used by the anywidget backend. | ||
| # Workaround for https://github.com/googleapis/python-bigquery-dataframes/issues/1273 | ||
| # PyArrow doesn't support creating an empty array with db_dtypes.JSONArrowType, | ||
| # especially when nested. | ||
| # Create with string type and then cast. | ||
|
|
||
| # MyPy doesn't automatically narrow the type of 'dtype' here, | ||
| # so we add an explicit check. | ||
| if isinstance(dtype, pd.ArrowDtype): | ||
| safe_pa_type = bigframes.dtypes._replace_json_arrow_with_string( | ||
| dtype.pyarrow_dtype | ||
| ) | ||
| safe_dtype = pd.ArrowDtype(safe_pa_type) | ||
| series_map[col] = pd.Series([], dtype=safe_dtype).astype(dtype) | ||
| else: | ||
| # This branch should ideally not be reached if | ||
| # contains_db_dtypes_json_dtype is accurate, | ||
| # but it's here for MyPy's sake. | ||
| series_map[col] = pd.Series([], dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The new logic looks even better!
🤖 I have created a release *beep* *boop* --- ## [2.29.0](v2.28.0...v2.29.0) (2025-11-10) ### Features * Add bigframes.bigquery.st_regionstats to join raster data from Earth Engine ([#2228](#2228)) ([10ec52f](10ec52f)) * Add DataFrame.resample and Series.resample ([#2213](#2213)) ([c9ca02c](c9ca02c)) * SQL Cell no longer escapes formatted string values ([#2245](#2245)) ([d2d38f9](d2d38f9)) * Support left_index and right_index for merge ([#2220](#2220)) ([da9ba26](da9ba26)) ### Bug Fixes * Correctly iterate over null struct values in ManagedArrowTable ([#2209](#2209)) ([12e04d5](12e04d5)) * Simplify UnsupportedTypeError message ([#2212](#2212)) ([6c9a18d](6c9a18d)) * Support results with STRUCT and ARRAY columns containing JSON subfields in `to_pandas_batches()` ([#2216](#2216)) ([3d8b17f](3d8b17f)) ### Documentation * Switch API reference docs to pydata theme ([#2237](#2237)) ([9b86dcf](9b86dcf)) * Update notebook for JSON subfields support in to_pandas_batches() ([#2138](#2138)) ([5663d2a](5663d2a)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
This commit addresses issue where creating empty DataFrames with nested JSON columns would fail due to PyArrow's inability to create empty arrays with db_dtypes.JSONArrowType (Apache Arrow issue #45262).
Changes:
This workaround is specifically needed for the anywidget backend which uses to_pandas_batches()
Fixes #<456577463> 🦕