-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: allow nan-likes in StringArray constructor #41412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. pls rebase. question for about should we accept NaT / Decimal(NaN) here though. I am kind of -0 on this.
I am also -1 on this. I certainly understand the reason to allow |
I am happy to remove The problem with pd.array is that it will cast non-string elements to strings, whereas the StringArray constructor will validate the inputs to make sure they are strings or pd.NA. This difference is important for #40687, where we don't want to cast the elements to strings if they are not strings(We want an error to be raised that we can catch). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews guys. I'm not sure how this is going to affect ArrowStringArray, but I think that this shouldn't affect it since it's constructed from pyarrow Arrays directly, and both _from_sequence(which pd.array uses) methods coerce.
status here? |
pandas/_libs/lib.pyx
Outdated
If False, existing na values will be used unchanged in the new array. | ||
coerce : {'all', 'null', 'non-null', None}, default 'all' | ||
Whether to coerce non-string elements to strings. | ||
- 'all' will convert null values and non-null non-string values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does 'all' do for null values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converts them to na_value
.
Gonna circle back to this one soon in the next few weeks. Unfortunately, this is probably going to be a pain because I need to deal with 2D StringArrays :(. |
@jreback @jorisvandenbossche @jbrockmendel @simonjayhawkins |
if self._ndarray.dtype != "object": | ||
raise ValueError( | ||
"StringArray requires a sequence of strings or pandas.NA. Got " | ||
f"'{self._ndarray.dtype}' dtype instead." | ||
) | ||
try: | ||
lib.ensure_string_array( | ||
self._ndarray.ravel("K"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this is broken for 2D EAs even with the ravel.
@jbrockmendel Do you know how I can create a 2D StringArray to test this? Would the correct way to iterate over a 2D EA be using the PyArray_GETITEM
and PyArray_ITER_DATA
combo like the Validator
in libs does? Then I would replace values by PyArray_SETITEM
ing on the array?
(Not familiar at all with Numpy C API, so going to need some help here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how I can create a 2D StringArray to test this?
pd.array(["foo", "bar"] * 5).reshape(5, 2)
Would the correct way to iterate over a 2D EA be using the PyArray_GETITEM and PyArray_ITER_DATA combo like the Validator in libs does?
Not sure I understand the question. The 2D EA has a working __iter__
method, but you shouldn't be passing any EA to any of the cython functions.
Then I would replace values by
PyArray_SETITEM
ing on the array?
I'm still getting the hang of this (intending to make array_to_timedelta64 handle 2D directly to avoid ravels), would advise NOT trying to handle it in the cython code.
|
||
@classmethod | ||
def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy=False): | ||
def _from_sequence( | ||
cls, scalars, *, dtype: Dtype | None = None, copy=False, coerce=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is coerce: bool
enough here?
this is like errors='coerce'
for coerce=True
and errors='raise'
for coerce=False
, i guess 'ignore' would be meaningless.
but I still think the errors=
keyword is better for flexiblity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will take a look soon-ish. I'm wary of adding keywords here
with pytest.raises(ValueError, match=msg): | ||
cls(np.array(["a", pd.NaT], dtype=object)) | ||
@pytest.mark.parametrize("na", [np.nan, np.float64("nan"), float("nan"), None, pd.NA]) | ||
def test_constructor_nan_like(na): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize over list as well (assume same result).
if self._ndarray.dtype != "object": | ||
raise ValueError( | ||
"StringArray requires a sequence of strings or pandas.NA. Got " | ||
f"'{self._ndarray.dtype}' dtype instead." | ||
) | ||
try: | ||
lib.ensure_string_array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this going through ensure_string_array instead of e.g. is_string_array? For the latter, the ravel would be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_string_array will not convert the other nans(None, np.nan, etc.) to the correct na_value of pd.NA. FWIW, switching to ensure_string_array will also align us with _from_sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does any of this become simpler if done in conjunction with the edits in #45057
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to split this one up.
I'm thinking of maybe sticking to is_string_array and then doing another pass over the data to convert the not pd.NA nans, as a quick short term fix to unblock you. This will probably give a perf regression, but since is_string_array got sped up in #44495 on master(not 1.3.x), all I have to do is make the perf regression less than the perf improvement there, so that the regression is not user visible.
with pytest.raises( | ||
ValueError, | ||
match="coerce argument must be one of " | ||
"'all'|'strict-null'|'null'|'non-null'|None, not abcd", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the pipes here need backslashes?
def test_from_sequence_no_coerce_invalid(cls, values): | ||
with pytest.raises( | ||
ValueError, | ||
match="Element .* is not a string or valid null." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing whitespace at the end of this line (and the next)?
Big picture, is there any way to avoid adding a keyword to _from_sequence? Maybe a) use pd.array? b) make the less-strict behavior the only behavior? c) implement _from_sequence_not_strict like DTA/TDA, which is code smell but at least it'll be a matching code smell? |
I'm thinking about a variant of c. The use case for this was in #40687 (comment). pd.array doesn't work b/c it always casts stuff to string, even nans. The less-strict(casting non-strings) behavior is already the default behavior.
|
@lithomas1 you've made some progress here via other PRs, is this still active? |
@lithomas1 this looked ok, can you merge master |
Thanks for the pull request, but it appears to have gone stale. Feel free to reopen when you have time to merge the main branch and address the review comments. |
If anyone is looking for something to do I'd just like to say I'm one user who would very much appreciate being able to use the new string dtype with arbitrary null dtypes. I'd pick this up if I had the skills! |
Looking at OP #30980 this was done to ensure that a double pass is not made over data, but I don't think I'm doing that here.
This is a precursor for #40687
Marking as needs discussion since it might be controversial which nan-likes to allow in the constructor.