Skip to content

API: allow nan-likes in StringArray constructor #41412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented May 10, 2021

Looking at OP #30980 this was done to ensure that a double pass is not made over data, but I don't think I'm doing that here.
This is a precursor for #40687
Marking as needs discussion since it might be controversial which nan-likes to allow in the constructor.

@lithomas1 lithomas1 added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data labels May 11, 2021
@lithomas1 lithomas1 marked this pull request as ready for review May 11, 2021 04:08
@lithomas1 lithomas1 added the Needs Discussion Requires discussion from core team before further action label May 11, 2021
@lithomas1 lithomas1 requested a review from jreback May 20, 2021 22:27
@jreback jreback added this to the 1.3 milestone May 21, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. pls rebase. question for about should we accept NaT / Decimal(NaN) here though. I am kind of -0 on this.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 22, 2021

question for about should we accept NaT / Decimal(NaN) here though. I am kind of -0 on this.

I am also -1 on this. I certainly understand the reason to allow np.nan since that's what we generally use now (although I am personally not sure why StringArray(..) needs to accept this, if the more flexible pd.array(..) already does this), but I would not start accepting NaT/Decimal(NaN).

@lithomas1
Copy link
Member Author

I am happy to remove NaT and Decimal('NaN'), as I really only want None and np.nan to be accepted in addition to pd.NA.

The problem with pd.array is that it will cast non-string elements to strings, whereas the StringArray constructor will validate the inputs to make sure they are strings or pd.NA. This difference is important for #40687, where we don't want to cast the elements to strings if they are not strings(We want an error to be raised that we can catch).

Copy link
Member Author

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews guys. I'm not sure how this is going to affect ArrowStringArray, but I think that this shouldn't affect it since it's constructed from pyarrow Arrays directly, and both _from_sequence(which pd.array uses) methods coerce.

@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

status here?

If False, existing na values will be used unchanged in the new array.
coerce : {'all', 'null', 'non-null', None}, default 'all'
Whether to coerce non-string elements to strings.
- 'all' will convert null values and non-null non-string values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does 'all' do for null values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converts them to na_value.

@lithomas1 lithomas1 removed the Needs Discussion Requires discussion from core team before further action label Nov 27, 2021
@lithomas1
Copy link
Member Author

Gonna circle back to this one soon in the next few weeks. Unfortunately, this is probably going to be a pain because I need to deal with 2D StringArrays :(.

@lithomas1
Copy link
Member Author

lithomas1 commented Dec 28, 2021

@jreback @jorisvandenbossche @jbrockmendel @simonjayhawkins
This is ready for another look now. The major change now is that we also accept float("nan") now in addition to pd.NA, np.nan, and None now. Originally, only pd.NA, np.nan, and None were accepted, but np.nan wasn't catching all the numpy nans. We are not using np.isnan because np.datetime64("nat") would also be accepted and that is not desirable.
I'll try to benchmark again soon.

if self._ndarray.dtype != "object":
raise ValueError(
"StringArray requires a sequence of strings or pandas.NA. Got "
f"'{self._ndarray.dtype}' dtype instead."
)
try:
lib.ensure_string_array(
self._ndarray.ravel("K"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is broken for 2D EAs even with the ravel.
@jbrockmendel Do you know how I can create a 2D StringArray to test this? Would the correct way to iterate over a 2D EA be using the PyArray_GETITEM and PyArray_ITER_DATA combo like the Validator in libs does? Then I would replace values by PyArray_SETITEMing on the array?
(Not familiar at all with Numpy C API, so going to need some help here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how I can create a 2D StringArray to test this?

pd.array(["foo", "bar"] * 5).reshape(5, 2)

Would the correct way to iterate over a 2D EA be using the PyArray_GETITEM and PyArray_ITER_DATA combo like the Validator in libs does?

Not sure I understand the question. The 2D EA has a working __iter__ method, but you shouldn't be passing any EA to any of the cython functions.

Then I would replace values by PyArray_SETITEM ing on the array?

I'm still getting the hang of this (intending to make array_to_timedelta64 handle 2D directly to avoid ravels), would advise NOT trying to handle it in the cython code.


@classmethod
def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy=False):
def _from_sequence(
cls, scalars, *, dtype: Dtype | None = None, copy=False, coerce=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is coerce: bool enough here?

this is like errors='coerce' for coerce=True and errors='raise' for coerce=False, i guess 'ignore' would be meaningless.

but I still think the errors= keyword is better for flexiblity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will take a look soon-ish. I'm wary of adding keywords here

with pytest.raises(ValueError, match=msg):
cls(np.array(["a", pd.NaT], dtype=object))
@pytest.mark.parametrize("na", [np.nan, np.float64("nan"), float("nan"), None, pd.NA])
def test_constructor_nan_like(na):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize over list as well (assume same result).

if self._ndarray.dtype != "object":
raise ValueError(
"StringArray requires a sequence of strings or pandas.NA. Got "
f"'{self._ndarray.dtype}' dtype instead."
)
try:
lib.ensure_string_array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this going through ensure_string_array instead of e.g. is_string_array? For the latter, the ravel would be unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_string_array will not convert the other nans(None, np.nan, etc.) to the correct na_value of pd.NA. FWIW, switching to ensure_string_array will also align us with _from_sequence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any of this become simpler if done in conjunction with the edits in #45057

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to split this one up.

I'm thinking of maybe sticking to is_string_array and then doing another pass over the data to convert the not pd.NA nans, as a quick short term fix to unblock you. This will probably give a perf regression, but since is_string_array got sped up in #44495 on master(not 1.3.x), all I have to do is make the perf regression less than the perf improvement there, so that the regression is not user visible.

with pytest.raises(
ValueError,
match="coerce argument must be one of "
"'all'|'strict-null'|'null'|'non-null'|None, not abcd",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the pipes here need backslashes?

def test_from_sequence_no_coerce_invalid(cls, values):
with pytest.raises(
ValueError,
match="Element .* is not a string or valid null."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing whitespace at the end of this line (and the next)?

@jbrockmendel
Copy link
Member

Big picture, is there any way to avoid adding a keyword to _from_sequence? Maybe a) use pd.array? b) make the less-strict behavior the only behavior? c) implement _from_sequence_not_strict like DTA/TDA, which is code smell but at least it'll be a matching code smell?

@lithomas1
Copy link
Member Author

lithomas1 commented Dec 31, 2021

Big picture, is there any way to avoid adding a keyword to _from_sequence? Maybe a) use pd.array? b) make the less-strict behavior the only behavior? c) implement _from_sequence_not_strict like DTA/TDA, which is code smell but at least it'll be a matching code smell?

I'm thinking about a variant of c. The use case for this was in #40687 (comment).

pd.array doesn't work b/c it always casts stuff to string, even nans. The less-strict(casting non-strings) behavior is already the default behavior.
e.g.

>>> pd.array(np.array(["a", np.nan, "b"]))
<StringArray>
['a', 'nan', 'b']
Length: 3, dtype: string

@jbrockmendel
Copy link
Member

@lithomas1 you've made some progress here via other PRs, is this still active?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2022

@lithomas1 this looked ok, can you merge master

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. Feel free to reopen when you have time to merge the main branch and address the review comments.

@mroeschke mroeschke closed this Apr 24, 2022
@shortorian
Copy link

If anyone is looking for something to do I'd just like to say I'm one user who would very much appreciate being able to use the new string dtype with arbitrary null dtypes. I'd pick this up if I had the skills!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/ENH: Accept nan-likes in StringArray constructor
8 participants