API: allow nan-likes in StringArray constructor #41412

lithomas1 · 2021-05-10T23:21:07Z

closes API/ENH: Accept nan-likes in StringArray constructor #40839
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Looking at OP #30980 this was done to ensure that a double pass is not made over data, but I don't think I'm doing that here.
This is a precursor for #40687
Marking as needs discussion since it might be controversial which nan-likes to allow in the constructor.

pandas/_libs/lib.pyi

pandas/_libs/lib.pyx

…ringarray-nan

jreback

looks good. pls rebase. question for about should we accept NaT / Decimal(NaN) here though. I am kind of -0 on this.

pandas/_libs/lib.pyi

pandas/core/arrays/string_.py

doc/source/whatsnew/v1.3.0.rst

pandas/core/arrays/string_.py

jorisvandenbossche · 2021-05-22T12:58:16Z

question for about should we accept NaT / Decimal(NaN) here though. I am kind of -0 on this.

I am also -1 on this. I certainly understand the reason to allow np.nan since that's what we generally use now (although I am personally not sure why StringArray(..) needs to accept this, if the more flexible pd.array(..) already does this), but I would not start accepting NaT/Decimal(NaN).

lithomas1 · 2021-05-22T17:52:40Z

I am happy to remove NaT and Decimal('NaN'), as I really only want None and np.nan to be accepted in addition to pd.NA.

The problem with pd.array is that it will cast non-string elements to strings, whereas the StringArray constructor will validate the inputs to make sure they are strings or pd.NA. This difference is important for #40687, where we don't want to cast the elements to strings if they are not strings(We want an error to be raised that we can catch).

lithomas1

Thanks for the reviews guys. I'm not sure how this is going to affect ArrowStringArray, but I think that this shouldn't affect it since it's constructed from pyarrow Arrays directly, and both _from_sequence(which pd.array uses) methods coerce.

pandas/core/arrays/string_.py

doc/source/whatsnew/v1.3.0.rst

pandas/_libs/lib.pyx

…ringarray-nan

jreback · 2021-10-04T00:24:15Z

status here?

pandas/core/arrays/string_.py

jreback · 2021-10-17T22:46:41Z

pandas/_libs/lib.pyx

-        If False, existing na values will be used unchanged in the new array.
+    coerce : {'all', 'null', 'non-null', None}, default 'all'
+        Whether to coerce non-string elements to strings.
+            - 'all' will convert null values and non-null non-string values.


what does 'all' do for null values?

converts them to na_value.

pandas/_libs/lib.pyx

lithomas1 · 2021-11-27T01:24:23Z

Gonna circle back to this one soon in the next few weeks. Unfortunately, this is probably going to be a pain because I need to deal with 2D StringArrays :(.

lithomas1 · 2021-12-28T01:41:23Z

@jreback @jorisvandenbossche @jbrockmendel @simonjayhawkins
This is ready for another look now. The major change now is that we also accept float("nan") now in addition to pd.NA, np.nan, and None now. Originally, only pd.NA, np.nan, and None were accepted, but np.nan wasn't catching all the numpy nans. We are not using np.isnan because np.datetime64("nat") would also be accepted and that is not desirable.
I'll try to benchmark again soon.

lithomas1 · 2021-12-28T01:49:49Z

pandas/core/arrays/string_.py

        if self._ndarray.dtype != "object":
            raise ValueError(
                "StringArray requires a sequence of strings or pandas.NA. Got "
                f"'{self._ndarray.dtype}' dtype instead."
            )
+        try:
+            lib.ensure_string_array(
+                self._ndarray.ravel("K"),


I'm pretty sure this is broken for 2D EAs even with the ravel.
@jbrockmendel Do you know how I can create a 2D StringArray to test this? Would the correct way to iterate over a 2D EA be using the PyArray_GETITEM and PyArray_ITER_DATA combo like the Validator in libs does? Then I would replace values by PyArray_SETITEMing on the array?
(Not familiar at all with Numpy C API, so going to need some help here)

Do you know how I can create a 2D StringArray to test this?

pd.array(["foo", "bar"] * 5).reshape(5, 2)

Would the correct way to iterate over a 2D EA be using the PyArray_GETITEM and PyArray_ITER_DATA combo like the Validator in libs does?

Not sure I understand the question. The 2D EA has a working __iter__ method, but you shouldn't be passing any EA to any of the cython functions.

Then I would replace values by PyArray_SETITEM ing on the array?

I'm still getting the hang of this (intending to make array_to_timedelta64 handle 2D directly to avoid ravels), would advise NOT trying to handle it in the cython code.

jreback · 2021-12-30T00:27:15Z

pandas/core/arrays/string_.py


    @classmethod
-    def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy=False):
+    def _from_sequence(
+        cls, scalars, *, dtype: Dtype | None = None, copy=False, coerce=True


is coerce: bool enough here?

this is like errors='coerce' for coerce=True and errors='raise' for coerce=False, i guess 'ignore' would be meaningless.

but I still think the errors= keyword is better for flexiblity.

will take a look soon-ish. I'm wary of adding keywords here

jreback · 2021-12-30T00:27:52Z

pandas/tests/arrays/string_/test_string.py

-    with pytest.raises(ValueError, match=msg):
-        cls(np.array(["a", pd.NaT], dtype=object))
+@pytest.mark.parametrize("na", [np.nan, np.float64("nan"), float("nan"), None, pd.NA])
+def test_constructor_nan_like(na):


can you parameterize over list as well (assume same result).

pandas/_libs/lib.pyx

jbrockmendel · 2021-12-30T16:44:51Z

pandas/core/arrays/string_.py

        if self._ndarray.dtype != "object":
            raise ValueError(
                "StringArray requires a sequence of strings or pandas.NA. Got "
                f"'{self._ndarray.dtype}' dtype instead."
            )
+        try:
+            lib.ensure_string_array(


why is this going through ensure_string_array instead of e.g. is_string_array? For the latter, the ravel would be unnecessary.

is_string_array will not convert the other nans(None, np.nan, etc.) to the correct na_value of pd.NA. FWIW, switching to ensure_string_array will also align us with _from_sequence.

Does any of this become simpler if done in conjunction with the edits in #45057

I will try to split this one up.

I'm thinking of maybe sticking to is_string_array and then doing another pass over the data to convert the not pd.NA nans, as a quick short term fix to unblock you. This will probably give a perf regression, but since is_string_array got sped up in #44495 on master(not 1.3.x), all I have to do is make the perf regression less than the perf improvement there, so that the regression is not user visible.

jbrockmendel · 2021-12-30T16:47:34Z

pandas/tests/arrays/string_/test_string.py

+    with pytest.raises(
+        ValueError,
+        match="coerce argument must be one of "
+        "'all'|'strict-null'|'null'|'non-null'|None, not abcd",


do the pipes here need backslashes?

jbrockmendel · 2021-12-30T16:49:06Z

pandas/tests/arrays/string_/test_string.py

+def test_from_sequence_no_coerce_invalid(cls, values):
+    with pytest.raises(
+        ValueError,
+        match="Element .* is not a string or valid null."


missing whitespace at the end of this line (and the next)?

jbrockmendel · 2021-12-30T16:58:30Z

Big picture, is there any way to avoid adding a keyword to _from_sequence? Maybe a) use pd.array? b) make the less-strict behavior the only behavior? c) implement _from_sequence_not_strict like DTA/TDA, which is code smell but at least it'll be a matching code smell?

lithomas1 · 2021-12-31T17:08:06Z

Big picture, is there any way to avoid adding a keyword to _from_sequence? Maybe a) use pd.array? b) make the less-strict behavior the only behavior? c) implement _from_sequence_not_strict like DTA/TDA, which is code smell but at least it'll be a matching code smell?

I'm thinking about a variant of c. The use case for this was in #40687 (comment).

pd.array doesn't work b/c it always casts stuff to string, even nans. The less-strict(casting non-strings) behavior is already the default behavior.
e.g.

>>> pd.array(np.array(["a", np.nan, "b"]))
<StringArray>
['a', 'nan', 'b']
Length: 3, dtype: string

jbrockmendel · 2022-02-02T04:16:03Z

@lithomas1 you've made some progress here via other PRs, is this still active?

jreback · 2022-03-06T23:20:17Z

@lithomas1 this looked ok, can you merge master

mroeschke · 2022-04-24T03:19:17Z

Thanks for the pull request, but it appears to have gone stale. Feel free to reopen when you have time to merge the main branch and address the review comments.

shortorian · 2022-05-12T18:46:06Z

If anyone is looking for something to do I'd just like to say I'm one user who would very much appreciate being able to use the new string dtype with arbitrary null dtypes. I'd pick this up if I had the skills!

lithomas1 added 2 commits May 10, 2021 16:19

API: allow nan-likes in StringArray constructor

3e1784d

Revert weird changes & Fix stuff

96ff1da

lithomas1 added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data labels May 11, 2021

lithomas1 requested review from jreback, TomAugspurger, simonjayhawkins and jorisvandenbossche May 11, 2021 03:31

lithomas1 marked this pull request as ready for review May 11, 2021 04:08

lithomas1 added the Needs Discussion Requires discussion from core team before further action label May 11, 2021

Remove failing test

418e1d2

jreback requested changes May 11, 2021

View reviewed changes

pandas/_libs/lib.pyi Outdated Show resolved Hide resolved

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

lithomas1 added 3 commits May 19, 2021 16:23

Changes from code review

25a6c4d

Merge branch 'master' of https://github.com/pandas-dev/pandas into st…

47d68f7

…ringarray-nan

typo

8257dbd

lithomas1 requested a review from jreback May 20, 2021 22:27

jreback added this to the 1.3 milestone May 21, 2021

jreback requested changes May 21, 2021

View reviewed changes

pandas/_libs/lib.pyi Outdated Show resolved Hide resolved

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

Update lib.pyi

922436a

simonjayhawkins reviewed May 22, 2021

View reviewed changes

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved

pandas/core/arrays/string_.py Show resolved Hide resolved

lithomas1 commented May 22, 2021

View reviewed changes

pandas/core/arrays/string_.py Show resolved Hide resolved

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved

jreback requested changes May 25, 2021

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

lithomas1 and others added 5 commits May 29, 2021 11:03

Update lib.pyx

2f28086

Update lib.pyx

3ee2198

Merge branch 'master' of https://github.com/pandas-dev/pandas into st…

9426a52

…ringarray-nan

Updates

3ee55f2

Update lib.pyx

fe4981a

lithomas1 added 2 commits October 4, 2021 07:32

Merge branch 'master' into stringarray-nan

889829a

typo

358000f

lithomas1 mentioned this pull request Oct 16, 2021

ENH: Add nullable dtypes to read_csv #40687

Closed

4 tasks

Merge branch 'master' into stringarray-nan

c649b1d

jreback requested changes Oct 17, 2021

View reviewed changes

alexreg mentioned this pull request Oct 23, 2021

BUG: astype("str") converts NA values to strings #44156

Closed

3 tasks

Merge branch 'master' into stringarray-nan

5e5aa9c

lithomas1 removed the Needs Discussion Requires discussion from core team before further action label Nov 27, 2021

lithomas1 added 4 commits December 17, 2021 20:19

Merge branch 'master' into stringarray-nan

eb7d8f2

Merge branch 'master' into stringarray-nan

2426319

address comments

20817a7

accept any float nan w/ util.is_nan

33d8f9a

lithomas1 commented Dec 28, 2021

View reviewed changes

lithomas1 requested a review from jreback December 29, 2021 03:12

lithomas1 mentioned this pull request Dec 29, 2021

PERF: avoid copies in lib.infer_dtype #45057

Merged

4 tasks

jreback requested changes Dec 30, 2021

View reviewed changes

jbrockmendel reviewed Dec 30, 2021

View reviewed changes

pandas/_libs/lib.pyx Show resolved Hide resolved

jbrockmendel reviewed Dec 30, 2021

View reviewed changes

mroeschke closed this Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: allow nan-likes in StringArray constructor #41412

API: allow nan-likes in StringArray constructor #41412

lithomas1 commented May 10, 2021 •

edited

Loading

jreback left a comment

jorisvandenbossche commented May 22, 2021 •

edited

Loading

lithomas1 commented May 22, 2021

lithomas1 left a comment

jreback commented Oct 4, 2021

jreback Oct 17, 2021

lithomas1 Dec 27, 2021

lithomas1 commented Nov 27, 2021

lithomas1 commented Dec 28, 2021 •

edited

Loading

lithomas1 Dec 28, 2021

jbrockmendel Dec 28, 2021

jreback Dec 30, 2021

jbrockmendel Dec 30, 2021

jreback Dec 30, 2021

jbrockmendel Dec 30, 2021

lithomas1 Dec 31, 2021

jbrockmendel Dec 31, 2021

lithomas1 Jan 2, 2022

jbrockmendel Dec 30, 2021

jbrockmendel Dec 30, 2021

jbrockmendel commented Dec 30, 2021

lithomas1 commented Dec 31, 2021 •

edited

Loading

jbrockmendel commented Feb 2, 2022

jreback commented Mar 6, 2022

mroeschke commented Apr 24, 2022

shortorian commented May 12, 2022

API: allow nan-likes in StringArray constructor #41412

API: allow nan-likes in StringArray constructor #41412

Conversation

lithomas1 commented May 10, 2021 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 22, 2021 • edited Loading

lithomas1 commented May 22, 2021

lithomas1 left a comment

Choose a reason for hiding this comment

jreback commented Oct 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Nov 27, 2021

lithomas1 commented Dec 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Dec 30, 2021

lithomas1 commented Dec 31, 2021 • edited Loading

jbrockmendel commented Feb 2, 2022

jreback commented Mar 6, 2022

mroeschke commented Apr 24, 2022

shortorian commented May 12, 2022

lithomas1 commented May 10, 2021 •

edited

Loading

jorisvandenbossche commented May 22, 2021 •

edited

Loading

lithomas1 commented Dec 28, 2021 •

edited

Loading

lithomas1 commented Dec 31, 2021 •

edited

Loading