ENH: Categorical.empty #40602

jbrockmendel · 2021-03-23T23:28:00Z

closes #xxxx
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

xref #39776
xref dask/fastparquet#576 (comment)

jbrockmendel · 2021-03-24T00:28:33Z

@jorisvandenbossche looks like TestCategoricalConcat::test_categorical_concat with ArrayManager is broken on master

jorisvandenbossche · 2021-03-24T14:54:13Z

looks like TestCategoricalConcat::test_categorical_concat with ArrayManager is broken on master

But it's not failing in CI on master? (but it's a bit strange how this PR would impact that test ..)

jbrockmendel · 2021-03-24T14:59:49Z

But it's not failing in CI on master? (but it's a bit strange how this PR would impact that test ..)

IIUC that file isnt enabled for the ArrayManager tests, so it isnt such a problem that the test fails locally on master, mainly a mystery as to why it was run on this particular CI build?

jorisvandenbossche · 2021-03-24T15:04:29Z

Ah, I think that's the same pytest mystery bug that I saw with the pytables tests. It's sometimes not skipping the first test of a file when using a marker for a full directory (in the __init__.py). For PyTables I solved it by adding the mark to each of the individual files.

jorisvandenbossche · 2021-03-24T15:07:06Z

#39612 will also solve that pytest-mark issue

jorisvandenbossche · 2021-03-24T15:11:16Z

On the actual PR: I very much would like to see such functionality. But:

I think if adding it, it should be a general interface method (and not only added to a subset of our internal EAs), so xref API: ExtensionArray interface method to create an empty / all-NA array for a given dtype #39776
I would add it as a method on the dtype, and not the array (you will typically have the dtype as a start, and then that avoids doing dtype.construct_array_type().empty(shape, dtype=dtype), where dtype.empty(shape) is easier)

jbrockmendel · 2021-03-24T15:17:51Z

I think if adding it, it should be a general interface method (and not only added to a subset of our internal EAs), so xref #39776

I agree, but don't have any bright ideas for a general-case implementation. If it has obj._can_hold_na we can use obj.take([-1]*length).

I would add it as a method on the dtype, and not the array (you will typically have the dtype as a start, and then that avoids doing dtype.construct_array_type().empty(shape, dtype=dtype), where dtype.empty(shape) is easier)

I thought about this, but that leaves out DatetimeArray[naive] and TimedeltaArray. The workarounds for that are't that bad though, so id be open to this.

(in general i think many EA methods would make more sense as EADtype methods, xref #40574)

jbrockmendel · 2021-03-27T02:38:57Z

If it has obj._can_hold_na we can use obj.take([-1]*length).

Tried this and got failures with ArrowBoolArray and StringArray

jreback · 2021-03-29T15:20:46Z

on making this a general purpose method (in this PR), is this possible?

I think this is ok on the array itself, this follows numpy convention.

pandas/tests/arrays/test_ndarray_backed.py

jorisvandenbossche · 2021-03-31T20:39:40Z

I think this is ok on the array itself, this follows numpy convention.

Numpy doesn't have an empty() method (only a function)

I thought about this, but that leaves out DatetimeArray[naive] and TimedeltaArray. The workarounds for that are't that bad though, so id be open to this.

Ah, yes. Now, since those are not proper EAs, they IMO shouldn't direct the EA design, so if the workarounds are doable, I would personally still prefer it as a method on the dtype.

jorisvandenbossche · 2021-03-31T20:42:19Z

If it has obj._can_hold_na we can use obj.take([-1]*length).

Tried this and got failures with ArrowBoolArray and StringArray

You can override the base one with custom implementations for those cases where it doesn't work.
But at least for StringArray it seems to work for me, though (pd.array([], dtype="string").take([-1]*10, allow_fill=True))

jbrockmendel · 2021-03-31T20:46:03Z

they IMO shouldn't direct the EA design, so if the workarounds are doable, I would personally still prefer it as a method on the dtype.

Fair enough. I'll give it a go.

jbrockmendel · 2021-04-05T22:05:10Z

@jorisvandenbossche getting the ArrowExtensionArray test working is going to require making its _from_sequence not-ignore the dtype arg. can i get your help with that? (im fine xfailing that for now)

jorisvandenbossche · 2021-04-06T19:27:53Z

Will take a look tomorrow

jorisvandenbossche · 2021-04-07T18:46:04Z

I might be missing something (I didn't run actual code / the tests), but you can just pass through the dtype, and handle it (cast if specified)? The only thing that's missing is a mapping of the ExtensionDtype to the arrow type:

--- a/pandas/tests/extension/arrow/arrays.py
+++ b/pandas/tests/extension/arrow/arrays.py
@@ -35,6 +35,7 @@ class ArrowBoolDtype(ExtensionDtype):
     kind = "b"
     name = "arrow_bool"
     na_value = pa.NULL
+    arrow_type = pa.bool_()
 
     @classmethod
     def construct_array_type(cls) -> type_t[ArrowBoolArray]:
@@ -59,6 +60,7 @@ class ArrowStringDtype(ExtensionDtype):
     kind = "U"
     name = "arrow_string"
     na_value = pa.NULL
+    arrow_type = pa.string()
 
     @classmethod
     def construct_array_type(cls) -> type_t[ArrowStringArray]:
@@ -76,8 +78,10 @@ class ArrowExtensionArray(OpsMixin, ExtensionArray):
     _data: pa.ChunkedArray
 
     @classmethod
-    def from_scalars(cls, values):
+    def from_scalars(cls, values, dtype=None):
         arr = pa.chunked_array([pa.array(np.asarray(values))])
+        if dtype is not None:
+            arr = arr.cast(dtype.arrow_type)
         return cls(arr)
 
     @classmethod
@@ -87,7 +91,7 @@ class ArrowExtensionArray(OpsMixin, ExtensionArray):
 
     @classmethod
     def _from_sequence(cls, scalars, dtype=None, copy=False):
-        return cls.from_scalars(scalars)
+        return cls.from_scalars(scalars, dtype=dtype)
 
     def __repr__(self):
         return f"{type(self).__name__}({repr(self._data)})"

jorisvandenbossche · 2021-04-07T18:50:33Z

To recap from above:

I am -1 on adding it as a public method on the array class (it doesn't need the array class but the dtype instance, and it's not consistent with numpy)
An aspect that I brought up in API: ExtensionArray interface method to create an empty / all-NA array for a given dtype #39776 is that we should specify if it expects an "uninitialized" array (like np.empty) or an all-missing array. Probably both are useful, and it could maybe be a keyword argument to ask for the all-missing?

jbrockmendel · 2021-04-11T23:00:55Z

@jorisvandenbossche implemented the arrow suggestions, got:

__________________________________________ TestConstructors.test_empty ___________________________________________

self = <pandas.tests.extension.arrow.test_bool.TestConstructors object at 0x137094790>
dtype = <pandas.tests.extension.arrow.arrays.ArrowBoolDtype object at 0x1370945b0>

    def test_empty(self, dtype):
>       super().test_empty(dtype)

pandas/tests/extension/arrow/test_bool.py:87: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/extension/base/constructors.py:128: in test_empty
    result = cls._empty((4,), dtype=dtype)
pandas/core/arrays/base.py:1315: in _empty
    result = obj.take(taker, allow_fill=True)
pandas/tests/extension/arrow/arrays.py:155: in take
    result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)
pandas/core/algorithms.py:1510: in take
    result = take_nd(
pandas/core/array_algos/take.py:100: in take_nd
    return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
pandas/core/series.py:859: in take
    nv.validate_take((), kwargs)
pandas/compat/numpy/function.py:76: in __call__
    validate_kwargs(fname, kwargs, self.defaults)
pandas/util/_validators.py:152: in validate_kwargs
    _check_for_invalid_keys(fname, kwargs, compat_args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

fname = 'take', kwargs = {'allow_fill': True, 'fill_value': <pyarrow.NullScalar: None>}
compat_args = {'mode': 'raise', 'out': None}

    def _check_for_invalid_keys(fname, kwargs, compat_args):
        """
        Checks whether 'kwargs' contains any keys that are not
        in 'compat_args' and raises a TypeError if there is one.
        """
        # set(dict) --> set of the dictionary's keys
        diff = set(kwargs) - set(compat_args)
    
        if diff:
            bad_arg = list(diff)[0]
>           raise TypeError(f"{fname}() got an unexpected keyword argument '{bad_arg}'")
E           TypeError: take() got an unexpected keyword argument 'allow_fill'

pandas/util/_validators.py:126: TypeError

jbrockmendel · 2021-04-11T23:02:01Z

updated to change .empty -> ._empty

jreback · 2021-04-12T12:38:26Z

how does this help when this is a private method?

jbrockmendel · 2021-04-12T16:14:42Z

how does this help when this is a private method?

it doesn't until we decide on somewhere we want to expose it. I'm thinking we put an empty function in core.construction (though it might be nice to have a namespace akin to com for things like that)

jreback · 2021-04-12T16:18:33Z

how does this help when this is a private method?

it doesn't until we decide on somewhere we want to expose it. I'm thinking we put an empty function in core.construction (though it might be nice to have a namespace akin to com for things like that)

ok so this doesn't address the original issue (e.g. categorical is not using this yet)?

jbrockmendel · 2021-04-12T16:30:14Z

ok so this doesn't address the original issue (e.g. categorical is not using this yet)?

correct

jreback · 2021-04-13T14:17:34Z

thanks @jbrockmendel

ENH: Categorical.empty

967c8d0

Merge branch 'master' into enh-categorical-empty

7ccbddc

Merge branch 'master' into enh-categorical-empty

83dbbec

jreback added the Categorical Categorical Data Type label Mar 29, 2021

jreback reviewed Mar 29, 2021

View reviewed changes

pandas/tests/arrays/test_ndarray_backed.py Outdated Show resolved Hide resolved

jreback added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 29, 2021

jbrockmendel added 3 commits March 29, 2021 09:17

Merge branch 'master' into enh-categorical-empty

e9e43b0

remove unused

9565f8f

Merge branch 'master' into enh-categorical-empty

0654fbf

Merge branch 'master' into enh-categorical-empty

1c20fcf

jbrockmendel added 4 commits April 5, 2021 15:48

ENH: implement EA.empty

0fc8215

Merge branch 'master' into enh-categorical-empty

af77718

Merge branch 'master' into enh-categorical-empty

d674661

post-merge fixup

ec77a14

Merge branch 'master' into enh-categorical-empty

6731ec1

empty -> _empty

7d8328b

jreback added this to the 1.3 milestone Apr 13, 2021

jreback merged commit 4c86d35 into pandas-dev:master Apr 13, 2021

jbrockmendel deleted the enh-categorical-empty branch April 13, 2021 14:25

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

ENH: Categorical.empty (pandas-dev#40602)

7563d06

jbrockmendel mentioned this pull request Dec 21, 2021

API: ExtensionArray interface method to create an empty / all-NA array for a given dtype #39776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Categorical.empty #40602

ENH: Categorical.empty #40602

jbrockmendel commented Mar 23, 2021

jbrockmendel commented Mar 24, 2021

jorisvandenbossche commented Mar 24, 2021

jbrockmendel commented Mar 24, 2021

jorisvandenbossche commented Mar 24, 2021 •

edited

Loading

jorisvandenbossche commented Mar 24, 2021

jorisvandenbossche commented Mar 24, 2021

jbrockmendel commented Mar 24, 2021

jbrockmendel commented Mar 27, 2021

jreback commented Mar 29, 2021

jorisvandenbossche commented Mar 31, 2021

jorisvandenbossche commented Mar 31, 2021

jbrockmendel commented Mar 31, 2021

jbrockmendel commented Apr 5, 2021

jorisvandenbossche commented Apr 6, 2021

jorisvandenbossche commented Apr 7, 2021

jorisvandenbossche commented Apr 7, 2021

jbrockmendel commented Apr 11, 2021

jbrockmendel commented Apr 11, 2021

jreback commented Apr 12, 2021

jbrockmendel commented Apr 12, 2021

jreback commented Apr 12, 2021

jbrockmendel commented Apr 12, 2021

jreback commented Apr 13, 2021

ENH: Categorical.empty #40602

ENH: Categorical.empty #40602

Conversation

jbrockmendel commented Mar 23, 2021

jbrockmendel commented Mar 24, 2021

jorisvandenbossche commented Mar 24, 2021

jbrockmendel commented Mar 24, 2021

jorisvandenbossche commented Mar 24, 2021 • edited Loading

jorisvandenbossche commented Mar 24, 2021

jorisvandenbossche commented Mar 24, 2021

jbrockmendel commented Mar 24, 2021

jbrockmendel commented Mar 27, 2021

jreback commented Mar 29, 2021

jorisvandenbossche commented Mar 31, 2021

jorisvandenbossche commented Mar 31, 2021

jbrockmendel commented Mar 31, 2021

jbrockmendel commented Apr 5, 2021

jorisvandenbossche commented Apr 6, 2021

jorisvandenbossche commented Apr 7, 2021

jorisvandenbossche commented Apr 7, 2021

jbrockmendel commented Apr 11, 2021

jbrockmendel commented Apr 11, 2021

jreback commented Apr 12, 2021

jbrockmendel commented Apr 12, 2021

jreback commented Apr 12, 2021

jbrockmendel commented Apr 12, 2021

jreback commented Apr 13, 2021

jorisvandenbossche commented Mar 24, 2021 •

edited

Loading