Skip to content

ENH: Categorical.empty #40602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 13, 2021
Merged

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

xref #39776
xref dask/fastparquet#576 (comment)

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche looks like TestCategoricalConcat::test_categorical_concat with ArrayManager is broken on master

@jorisvandenbossche
Copy link
Member

looks like TestCategoricalConcat::test_categorical_concat with ArrayManager is broken on master

But it's not failing in CI on master? (but it's a bit strange how this PR would impact that test ..)

@jbrockmendel
Copy link
Member Author

But it's not failing in CI on master? (but it's a bit strange how this PR would impact that test ..)

IIUC that file isnt enabled for the ArrayManager tests, so it isnt such a problem that the test fails locally on master, mainly a mystery as to why it was run on this particular CI build?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 24, 2021

Ah, I think that's the same pytest mystery bug that I saw with the pytables tests. It's sometimes not skipping the first test of a file when using a marker for a full directory (in the __init__.py). For PyTables I solved it by adding the mark to each of the individual files.

@jorisvandenbossche
Copy link
Member

#39612 will also solve that pytest-mark issue

@jorisvandenbossche
Copy link
Member

On the actual PR: I very much would like to see such functionality. But:

@jbrockmendel
Copy link
Member Author

I think if adding it, it should be a general interface method (and not only added to a subset of our internal EAs), so xref #39776

I agree, but don't have any bright ideas for a general-case implementation. If it has obj._can_hold_na we can use obj.take([-1]*length).

I would add it as a method on the dtype, and not the array (you will typically have the dtype as a start, and then that avoids doing dtype.construct_array_type().empty(shape, dtype=dtype), where dtype.empty(shape) is easier)

I thought about this, but that leaves out DatetimeArray[naive] and TimedeltaArray. The workarounds for that are't that bad though, so id be open to this.

(in general i think many EA methods would make more sense as EADtype methods, xref #40574)

@jbrockmendel
Copy link
Member Author

If it has obj._can_hold_na we can use obj.take([-1]*length).

Tried this and got failures with ArrowBoolArray and StringArray

@jreback jreback added the Categorical Categorical Data Type label Mar 29, 2021
@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

on making this a general purpose method (in this PR), is this possible?

I think this is ok on the array itself, this follows numpy convention.

@jreback jreback added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 29, 2021
@jorisvandenbossche
Copy link
Member

I think this is ok on the array itself, this follows numpy convention.

Numpy doesn't have an empty() method (only a function)

I thought about this, but that leaves out DatetimeArray[naive] and TimedeltaArray. The workarounds for that are't that bad though, so id be open to this.

Ah, yes. Now, since those are not proper EAs, they IMO shouldn't direct the EA design, so if the workarounds are doable, I would personally still prefer it as a method on the dtype.

@jorisvandenbossche
Copy link
Member

If it has obj._can_hold_na we can use obj.take([-1]*length).

Tried this and got failures with ArrowBoolArray and StringArray

You can override the base one with custom implementations for those cases where it doesn't work.
But at least for StringArray it seems to work for me, though (pd.array([], dtype="string").take([-1]*10, allow_fill=True))

@jbrockmendel
Copy link
Member Author

they IMO shouldn't direct the EA design, so if the workarounds are doable, I would personally still prefer it as a method on the dtype.

Fair enough. I'll give it a go.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche getting the ArrowExtensionArray test working is going to require making its _from_sequence not-ignore the dtype arg. can i get your help with that? (im fine xfailing that for now)

@jorisvandenbossche
Copy link
Member

Will take a look tomorrow

@jorisvandenbossche
Copy link
Member

I might be missing something (I didn't run actual code / the tests), but you can just pass through the dtype, and handle it (cast if specified)? The only thing that's missing is a mapping of the ExtensionDtype to the arrow type:

--- a/pandas/tests/extension/arrow/arrays.py
+++ b/pandas/tests/extension/arrow/arrays.py
@@ -35,6 +35,7 @@ class ArrowBoolDtype(ExtensionDtype):
     kind = "b"
     name = "arrow_bool"
     na_value = pa.NULL
+    arrow_type = pa.bool_()
 
     @classmethod
     def construct_array_type(cls) -> type_t[ArrowBoolArray]:
@@ -59,6 +60,7 @@ class ArrowStringDtype(ExtensionDtype):
     kind = "U"
     name = "arrow_string"
     na_value = pa.NULL
+    arrow_type = pa.string()
 
     @classmethod
     def construct_array_type(cls) -> type_t[ArrowStringArray]:
@@ -76,8 +78,10 @@ class ArrowExtensionArray(OpsMixin, ExtensionArray):
     _data: pa.ChunkedArray
 
     @classmethod
-    def from_scalars(cls, values):
+    def from_scalars(cls, values, dtype=None):
         arr = pa.chunked_array([pa.array(np.asarray(values))])
+        if dtype is not None:
+            arr = arr.cast(dtype.arrow_type)
         return cls(arr)
 
     @classmethod
@@ -87,7 +91,7 @@ class ArrowExtensionArray(OpsMixin, ExtensionArray):
 
     @classmethod
     def _from_sequence(cls, scalars, dtype=None, copy=False):
-        return cls.from_scalars(scalars)
+        return cls.from_scalars(scalars, dtype=dtype)
 
     def __repr__(self):
         return f"{type(self).__name__}({repr(self._data)})"

@jorisvandenbossche
Copy link
Member

To recap from above:

  • I am -1 on adding it as a public method on the array class (it doesn't need the array class but the dtype instance, and it's not consistent with numpy)
  • An aspect that I brought up in API: ExtensionArray interface method to create an empty / all-NA array for a given dtype #39776 is that we should specify if it expects an "uninitialized" array (like np.empty) or an all-missing array. Probably both are useful, and it could maybe be a keyword argument to ask for the all-missing?

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche implemented the arrow suggestions, got:

__________________________________________ TestConstructors.test_empty ___________________________________________

self = <pandas.tests.extension.arrow.test_bool.TestConstructors object at 0x137094790>
dtype = <pandas.tests.extension.arrow.arrays.ArrowBoolDtype object at 0x1370945b0>

    def test_empty(self, dtype):
>       super().test_empty(dtype)

pandas/tests/extension/arrow/test_bool.py:87: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/extension/base/constructors.py:128: in test_empty
    result = cls._empty((4,), dtype=dtype)
pandas/core/arrays/base.py:1315: in _empty
    result = obj.take(taker, allow_fill=True)
pandas/tests/extension/arrow/arrays.py:155: in take
    result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)
pandas/core/algorithms.py:1510: in take
    result = take_nd(
pandas/core/array_algos/take.py:100: in take_nd
    return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
pandas/core/series.py:859: in take
    nv.validate_take((), kwargs)
pandas/compat/numpy/function.py:76: in __call__
    validate_kwargs(fname, kwargs, self.defaults)
pandas/util/_validators.py:152: in validate_kwargs
    _check_for_invalid_keys(fname, kwargs, compat_args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

fname = 'take', kwargs = {'allow_fill': True, 'fill_value': <pyarrow.NullScalar: None>}
compat_args = {'mode': 'raise', 'out': None}

    def _check_for_invalid_keys(fname, kwargs, compat_args):
        """
        Checks whether 'kwargs' contains any keys that are not
        in 'compat_args' and raises a TypeError if there is one.
        """
        # set(dict) --> set of the dictionary's keys
        diff = set(kwargs) - set(compat_args)
    
        if diff:
            bad_arg = list(diff)[0]
>           raise TypeError(f"{fname}() got an unexpected keyword argument '{bad_arg}'")
E           TypeError: take() got an unexpected keyword argument 'allow_fill'

pandas/util/_validators.py:126: TypeError

@jbrockmendel
Copy link
Member Author

updated to change .empty -> ._empty

@jreback
Copy link
Contributor

jreback commented Apr 12, 2021

how does this help when this is a private method?

@jbrockmendel
Copy link
Member Author

how does this help when this is a private method?

it doesn't until we decide on somewhere we want to expose it. I'm thinking we put an empty function in core.construction (though it might be nice to have a namespace akin to com for things like that)

@jreback
Copy link
Contributor

jreback commented Apr 12, 2021

how does this help when this is a private method?

it doesn't until we decide on somewhere we want to expose it. I'm thinking we put an empty function in core.construction (though it might be nice to have a namespace akin to com for things like that)

ok so this doesn't address the original issue (e.g. categorical is not using this yet)?

@jbrockmendel
Copy link
Member Author

ok so this doesn't address the original issue (e.g. categorical is not using this yet)?

correct

@jreback jreback added this to the 1.3 milestone Apr 13, 2021
@jreback jreback merged commit 4c86d35 into pandas-dev:master Apr 13, 2021
@jreback
Copy link
Contributor

jreback commented Apr 13, 2021

thanks @jbrockmendel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants