Skip to content

BUG/TST: fix arrow roundtrip / parquet tests for recent pyarrow #30077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 18, 2019

Conversation

jorisvandenbossche
Copy link
Member

Closes #29976

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Testing pandas testing functions or related to the test suite Bug labels Dec 5, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Dec 5, 2019
results = []
for arr in chunks:
# TODO should optimize this without going through object array
bool_arr = BooleanArray._from_sequence(np.array(arr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool_arr = BooleanArray._from_sequence(np.array(arr))
bool_arr = BooleanArray._from_sequence(np.asarray(arr))

No?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way that a conversion from a pyarrow boolean array (which uses a bitmask) to a numpy array can be without a copy, so it shouldn't matter I think

chunks = array.chunks

results = []
for arr in chunks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u write this method generically and put on the base class or Arrow mxin class

as it already looks like it would work for any extension type (except the final use of BooleanArray)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IntegerDtype implementation is different, though. The implementation here is indeed more or less the same as for the StringArray, but once we we fix the mentioned TODO (to avoid going through object dtype), the Boolean one will also be custom.

The StringDtype one could still be put in a base mixin (it needs to be a mixin, and not directly in the base ExtensionDtype class, as for pyarrow the presence or absence of this method is relevant), but no one else would be using it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yeah we want avoid repeating this code as much as possible, so having ageneric (but working) impl would be good, and using helper / properties to ease the burden on each dtype would also be great. sure this could be done later as well.

@@ -101,6 +101,24 @@ def __repr__(self) -> str:
def _is_boolean(self) -> bool:
return True

def __from_arrow__(self, array):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be annotated? or at least have types in the docstring

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is mentioned in the docstring (just not in a parameters section, but can turn it into a more fully fledged docstring). No one except for pyarrow should be calling this method though.

Question for annotating: how does it work to annotate it with pyarrow objects that cannot necessarily be imported? (since it's an optional dependency)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea about the annotation?

@jbrockmendel
Copy link
Member

couple of questions, none blockers, LGTM

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. add types if you can; ideally followup to move this code generically to base class of EA

@jorisvandenbossche
Copy link
Member Author

Going to merge this, as it fixes the tests. Will revisit extracting base functionality / utility function in the period/interval PR.

@jorisvandenbossche jorisvandenbossche merged commit 4e807a2 into pandas-dev:master Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_additional_extension_arrays fails with pd.NA
4 participants