Skip to content

Initial Backport of string changes for 2.3 release #59513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Oct 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
b1b8eed
PDEP-14: Dedicated string data type for pandas 3.0 (#58551)
jorisvandenbossche Jul 24, 2024
5778049
TST / string dtype: add env variable to enable future_string and add …
jorisvandenbossche Jul 26, 2024
a494ed8
REF (string dtype): rename using_pyarrow_string_dtype to using_string…
jorisvandenbossche Jul 26, 2024
06dbb7a
TST (string dtype): clean-up xpasssing tests with future string dtype…
jorisvandenbossche Jul 27, 2024
925c21c
String dtype: rename the storage options and add `na_value` keyword i…
jorisvandenbossche Jul 29, 2024
431b246
TST (string dtype): xfail all currently failing tests with future.inf…
WillAyd Sep 20, 2024
6882ef9
TST (string dtype): follow-up on GH-59329 fixing new xfails (#59352)
jorisvandenbossche Jul 30, 2024
99ebd18
TST (string dtype): change any_string_dtype fixture to use actual dty…
jorisvandenbossche Jul 31, 2024
1566042
TST (string dtype): remove usage of arrow_string_storage fixture (#59…
jorisvandenbossche Jul 31, 2024
1d77d0e
TST (string dtype): replace string_storage fixture with explicit stor…
jorisvandenbossche Jul 31, 2024
2465a6d
String dtype: restrict options.mode.string_storage to python|pyarrow …
jorisvandenbossche Aug 1, 2024
35ebe68
API/TST: expand tests for string any/all reduction + fix pyarrow-base…
jorisvandenbossche Aug 6, 2024
463fd91
String dtype: implement object-dtype based StringArray variant with N…
WillAyd Aug 14, 2024
397cb09
REF (string dtype): de-duplicate _str_map methods (#59443)
WillAyd Aug 14, 2024
dd2680c
String dtype: use 'str' string alias and representation for NaN-varia…
WillAyd Sep 20, 2024
a9fd6f1
String dtype: fix alignment sorting in case of python storage (#59448)
jorisvandenbossche Aug 8, 2024
bf7fb01
TST (string dtype): add test build with future strings enabled withou…
WillAyd Aug 14, 2024
81850c8
REF (string dtype): de-duplicate _str_map (2) (#59451)
jbrockmendel Aug 9, 2024
078c5a0
REF (string): de-duplicate str_map_nan_semantics (#59464)
jbrockmendel Aug 9, 2024
fdbd473
BUG (string dtype): convert dictionary input to materialized string a…
jorisvandenbossche Aug 12, 2024
2346acf
String dtype: fix convert_dtypes() to convert NaN-string to NA-string…
jorisvandenbossche Aug 12, 2024
1bd3ce8
String dtype: honor mode.string_storage option (and change default to…
jorisvandenbossche Aug 12, 2024
7e50b16
BUG (string): ArrowEA comparisons with mismatched types (#59505)
jbrockmendel Aug 13, 2024
fa14a19
TST (string dtype): clean up construction of expected string arrays (…
jorisvandenbossche Aug 14, 2024
036e9da
TST (string dtype): clean up construction of expected string arrays (…
WillAyd Aug 22, 2024
4d26bed
TST (string dtype): fix IO dtype_backend tests for storage of str dty…
WillAyd Aug 22, 2024
31153c1
REF (string): Move StringArrayNumpySemantics methods to base class (#…
jbrockmendel Aug 14, 2024
721bf1e
REF (string): remove _str_na_value (#59515)
jbrockmendel Aug 15, 2024
ceee52d
REF (string): move ArrowStringArrayNumpySemantics methods to base cla…
jbrockmendel Aug 15, 2024
38f5b61
API (string): return str dtype for .dt methods, DatetimeIndex methods…
jbrockmendel Aug 16, 2024
a35481f
Pick required fix from 2542674ee9 #56709
WillAyd Aug 27, 2024
172af49
Pick required fix from f4232e7 #58006
WillAyd Aug 22, 2024
7946df1
Pick required fix from #55901 and #59581
WillAyd Sep 20, 2024
6909c47
Remove .pre-commit check for pytest ref #56671
WillAyd Aug 22, 2024
b70cd48
Skip niche issue
WillAyd Aug 22, 2024
1718e4b
Add required skip from #58467
WillAyd Aug 27, 2024
3467d26
Remove tests that will fail without backport of #58437
WillAyd Sep 20, 2024
9142e5e
additional test fixes (for tests that changed or no longer exist on m…
jorisvandenbossche Sep 20, 2024
b61bd23
String dtype: still return nullable NA-variant in object inference (`…
jorisvandenbossche Aug 21, 2024
c3d3980
Enable CoW in the string test build
jorisvandenbossche Sep 23, 2024
e3728c7
Skip test if pyarrow not installed in test_numeric_only
WillAyd Sep 24, 2024
732aa90
pick out stringarray keepdims changes from #59234
lithomas1 Sep 9, 2024
66e26d1
Fix: avoid object dtype inference warning in to_datetime
jorisvandenbossche Oct 2, 2024
e9806c1
xfail tests that trigger dtype inference warnings
jorisvandenbossche Oct 2, 2024
db9aa77
avoid dtype inference warnings by removing explicit dtype=object
jorisvandenbossche Oct 2, 2024
b3257e7
un-xfail tests for replace/fillna downcasting
jorisvandenbossche Oct 2, 2024
cecef0e
xfail tests triggering empty concat warning
jorisvandenbossche Oct 2, 2024
4c0d118
Update xfails for 2.3.x
jorisvandenbossche Oct 2, 2024
fc6bd39
Fix string dtype comparison in value_counts dtype inference deprecation
jorisvandenbossche Oct 2, 2024
bae9be1
string[pyarrow_numpy] -> str
jorisvandenbossche Oct 2, 2024
94b797d
Fix cow ref tracking in replace with list and regex
jorisvandenbossche Oct 3, 2024
a10c5c0
suppress pylint errors
jorisvandenbossche Oct 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/actions/setup-conda/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,9 @@ runs:
condarc-file: ci/.condarc
cache-environment: true
cache-downloads: true

- name: Uninstall pyarrow
if: ${{ env.REMOVE_PYARROW == '1' }}
run: |
micromamba remove -y pyarrow
shell: bash -el {0}
15 changes: 13 additions & 2 deletions .github/workflows/unit-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ jobs:
env_file: [actions-39.yaml, actions-310.yaml, actions-311.yaml, actions-312.yaml]
# Prevent the include jobs from overriding other jobs
pattern: [""]
pandas_future_infer_string: ["0"]
include:
- name: "Downstream Compat"
env_file: actions-311-downstream_compat.yaml
Expand Down Expand Up @@ -85,6 +86,14 @@ jobs:
env_file: actions-39.yaml
pattern: "not slow and not network and not single_cpu"
pandas_copy_on_write: "warn"
- name: "Future infer strings"
env_file: actions-312.yaml
pandas_future_infer_string: "1"
pandas_copy_on_write: "1"
- name: "Future infer strings (without pyarrow)"
env_file: actions-311.yaml
pandas_future_infer_string: "1"
pandas_copy_on_write: "1"
- name: "Pypy"
env_file: actions-pypy-39.yaml
pattern: "not slow and not network and not single_cpu"
Expand All @@ -103,16 +112,18 @@ jobs:
LANG: ${{ matrix.lang || 'C.UTF-8' }}
LC_ALL: ${{ matrix.lc_all || '' }}
PANDAS_COPY_ON_WRITE: ${{ matrix.pandas_copy_on_write || '0' }}
PANDAS_CI: ${{ matrix.pandas_ci || '1' }}
PANDAS_CI: '1'
PANDAS_FUTURE_INFER_STRING: ${{ matrix.pandas_future_infer_string || '0' }}
TEST_ARGS: ${{ matrix.test_args || '' }}
PYTEST_WORKERS: ${{ matrix.pytest_workers || 'auto' }}
PYTEST_TARGET: ${{ matrix.pytest_target || 'pandas' }}
NPY_PROMOTION_STATE: ${{ matrix.env_file == 'actions-311-numpydev.yaml' && 'weak' || 'legacy' }}
# Clipboard tests
QT_QPA_PLATFORM: offscreen
REMOVE_PYARROW: ${{ matrix.name == 'Future infer strings (without pyarrow)' && '1' || '0' }}
concurrency:
# https://github.community/t/concurrecy-not-work-for-push/183068/7
group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.env_file }}-${{ matrix.pattern }}-${{ matrix.extra_apt || '' }}-${{ matrix.pandas_copy_on_write || '' }}
group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.env_file }}-${{ matrix.pattern }}-${{ matrix.extra_apt || '' }}-${{ matrix.pandas_copy_on_write || '' }}-${{ matrix.pandas_future_infer_string }}
cancel-in-progress: true

services:
Expand Down
7 changes: 0 additions & 7 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -274,13 +274,6 @@ repos:
language: python
types: [rst]
files: ^doc/source/(development|reference)/
- id: unwanted-patterns-bare-pytest-raises
name: Check for use of bare pytest raises
language: python
entry: python scripts/validate_unwanted_patterns.py --validation-type="bare_pytest_raises"
types: [python]
files: ^pandas/tests/
exclude: ^pandas/tests/extension/
- id: unwanted-patterns-private-function-across-module
name: Check for use of private functions across modules
language: python
Expand Down
2 changes: 1 addition & 1 deletion pandas/_config/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,6 @@ def using_nullable_dtypes() -> bool:
return _mode_options["nullable_dtypes"]


def using_pyarrow_string_dtype() -> bool:
def using_string_dtype() -> bool:
_mode_options = _global_config["future"]
return _mode_options["infer_string"]
6 changes: 3 additions & 3 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ from cython cimport (
floating,
)

from pandas._config import using_pyarrow_string_dtype
from pandas._config import using_string_dtype

from pandas._libs.missing import check_na_tuples_nonequal

Expand Down Expand Up @@ -2725,10 +2725,10 @@ def maybe_convert_objects(ndarray[object] objects,
seen.object_ = True

elif seen.str_:
if using_pyarrow_string_dtype() and is_string_array(objects, skipna=True):
if using_string_dtype() and is_string_array(objects, skipna=True):
from pandas.core.arrays.string_ import StringDtype

dtype = StringDtype(storage="pyarrow_numpy")
dtype = StringDtype(na_value=np.nan)
return dtype.construct_array_type()._from_sequence(objects, dtype=dtype)

seen.object_ = True
Expand Down
10 changes: 7 additions & 3 deletions pandas/_testing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import numpy as np

from pandas._config import using_string_dtype
from pandas._config.localization import (
can_set_locale,
get_locales,
Expand Down Expand Up @@ -110,7 +111,10 @@
ALL_FLOAT_DTYPES: list[Dtype] = [*FLOAT_NUMPY_DTYPES, *FLOAT_EA_DTYPES]

COMPLEX_DTYPES: list[Dtype] = [complex, "complex64", "complex128"]
STRING_DTYPES: list[Dtype] = [str, "str", "U"]
if using_string_dtype():
STRING_DTYPES: list[Dtype] = [str, "U"]
else:
STRING_DTYPES: list[Dtype] = [str, "str", "U"] # type: ignore[no-redef]
COMPLEX_FLOAT_DTYPES: list[Dtype] = [*COMPLEX_DTYPES, *FLOAT_NUMPY_DTYPES]

DATETIME64_DTYPES: list[Dtype] = ["datetime64[ns]", "M8[ns]"]
Expand Down Expand Up @@ -527,14 +531,14 @@ def shares_memory(left, right) -> bool:
if (
isinstance(left, ExtensionArray)
and is_string_dtype(left.dtype)
and left.dtype.storage in ("pyarrow", "pyarrow_numpy") # type: ignore[attr-defined]
and left.dtype.storage == "pyarrow" # type: ignore[attr-defined]
):
# https://github.com/pandas-dev/pandas/pull/43930#discussion_r736862669
left = cast("ArrowExtensionArray", left)
if (
isinstance(right, ExtensionArray)
and is_string_dtype(right.dtype)
and right.dtype.storage in ("pyarrow", "pyarrow_numpy") # type: ignore[attr-defined]
and right.dtype.storage == "pyarrow" # type: ignore[attr-defined]
):
right = cast("ArrowExtensionArray", right)
left_pa_data = left._pa_array
Expand Down
28 changes: 26 additions & 2 deletions pandas/_testing/asserters.py
Original file line number Diff line number Diff line change
Expand Up @@ -593,13 +593,19 @@ def raise_assert_detail(

if isinstance(left, np.ndarray):
left = pprint_thing(left)
elif isinstance(left, (CategoricalDtype, NumpyEADtype, StringDtype)):
elif isinstance(left, (CategoricalDtype, NumpyEADtype)):
left = repr(left)
elif isinstance(left, StringDtype):
# TODO(infer_string) this special case could be avoided if we have
# a more informative repr https://github.com/pandas-dev/pandas/issues/59342
left = f"StringDtype(storage={left.storage}, na_value={left.na_value})"

if isinstance(right, np.ndarray):
right = pprint_thing(right)
elif isinstance(right, (CategoricalDtype, NumpyEADtype, StringDtype)):
elif isinstance(right, (CategoricalDtype, NumpyEADtype)):
right = repr(right)
elif isinstance(right, StringDtype):
right = f"StringDtype(storage={right.storage}, na_value={right.na_value})"

msg += f"""
[left]: {left}
Expand Down Expand Up @@ -805,6 +811,24 @@ def assert_extension_array_equal(
left_na, right_na, obj=f"{obj} NA mask", index_values=index_values
)

# Specifically for StringArrayNumpySemantics, validate here we have a valid array
if (
isinstance(left.dtype, StringDtype)
and left.dtype.storage == "python"
and left.dtype.na_value is np.nan
):
assert np.all(
[np.isnan(val) for val in left._ndarray[left_na]] # type: ignore[attr-defined]
), "wrong missing value sentinels"
if (
isinstance(right.dtype, StringDtype)
and right.dtype.storage == "python"
and right.dtype.na_value is np.nan
):
assert np.all(
[np.isnan(val) for val in right._ndarray[right_na]] # type: ignore[attr-defined]
), "wrong missing value sentinels"

left_valid = left[~left_na].to_numpy(dtype=object)
right_valid = right[~right_na].to_numpy(dtype=object)
if check_exact:
Expand Down
2 changes: 2 additions & 0 deletions pandas/compat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import pandas.compat.compressors
from pandas.compat.numpy import is_numpy_dev
from pandas.compat.pyarrow import (
HAS_PYARROW,
pa_version_under10p1,
pa_version_under11p0,
pa_version_under13p0,
Expand Down Expand Up @@ -190,6 +191,7 @@ def get_bz2_file() -> type[pandas.compat.compressors.BZ2File]:
"pa_version_under14p1",
"pa_version_under16p0",
"pa_version_under17p0",
"HAS_PYARROW",
"IS64",
"ISMUSL",
"PY310",
Expand Down
2 changes: 2 additions & 0 deletions pandas/compat/pyarrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
pa_version_under15p0 = _palv < Version("15.0.0")
pa_version_under16p0 = _palv < Version("16.0.0")
pa_version_under17p0 = _palv < Version("17.0.0")
HAS_PYARROW = True
except ImportError:
pa_version_under10p1 = True
pa_version_under11p0 = True
Expand All @@ -27,3 +28,4 @@
pa_version_under15p0 = True
pa_version_under16p0 = True
pa_version_under17p0 = True
HAS_PYARROW = False
53 changes: 43 additions & 10 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1248,7 +1248,6 @@ def nullable_string_dtype(request):
params=[
"python",
pytest.param("pyarrow", marks=td.skip_if_no("pyarrow")),
pytest.param("pyarrow_numpy", marks=td.skip_if_no("pyarrow")),
]
)
def string_storage(request):
Expand All @@ -1257,7 +1256,25 @@ def string_storage(request):

* 'python'
* 'pyarrow'
* 'pyarrow_numpy'
"""
return request.param


@pytest.fixture(
params=[
("python", pd.NA),
pytest.param(("pyarrow", pd.NA), marks=td.skip_if_no("pyarrow")),
pytest.param(("pyarrow", np.nan), marks=td.skip_if_no("pyarrow")),
("python", np.nan),
]
)
def string_dtype_arguments(request):
"""
Parametrized fixture for StringDtype storage and na_value.

* 'python' + pd.NA
* 'pyarrow' + pd.NA
* 'pyarrow' + np.nan
"""
return request.param

Expand Down Expand Up @@ -1306,20 +1323,36 @@ def object_dtype(request):

@pytest.fixture(
params=[
"object",
"string[python]",
pytest.param("string[pyarrow]", marks=td.skip_if_no("pyarrow")),
pytest.param("string[pyarrow_numpy]", marks=td.skip_if_no("pyarrow")),
]
np.dtype("object"),
("python", pd.NA),
pytest.param(("pyarrow", pd.NA), marks=td.skip_if_no("pyarrow")),
pytest.param(("pyarrow", np.nan), marks=td.skip_if_no("pyarrow")),
("python", np.nan),
],
ids=[
"string=object",
"string=string[python]",
"string=string[pyarrow]",
"string=str[pyarrow]",
"string=str[python]",
],
)
def any_string_dtype(request):
"""
Parametrized fixture for string dtypes.
* 'object'
* 'string[python]'
* 'string[pyarrow]'
* 'string[python]' (NA variant)
* 'string[pyarrow]' (NA variant)
* 'str' (NaN variant, with pyarrow)
* 'str' (NaN variant, without pyarrow)
"""
return request.param
if isinstance(request.param, np.dtype):
return request.param
else:
# need to instantiate the StringDtype here instead of in the params
# to avoid importing pyarrow during test collection
storage, na_value = request.param
return pd.StringDtype(storage, na_value)


@pytest.fixture(params=tm.DATETIME64_DTYPES)
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -935,7 +935,7 @@ def value_counts_internal(
idx = idx.astype(object)
elif (
idx.dtype != keys.dtype # noqa: PLR1714 # # pylint: disable=R1714
and idx.dtype != "string[pyarrow_numpy]"
and idx.dtype != "string"
):
warnings.warn(
# GH#56161
Expand Down
15 changes: 11 additions & 4 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -570,10 +570,11 @@ def __getitem__(self, item: PositionalIndexer):
if isinstance(item, np.ndarray):
if not len(item):
# Removable once we migrate StringDtype[pyarrow] to ArrowDtype[string]
if self._dtype.name == "string" and self._dtype.storage in (
"pyarrow",
"pyarrow_numpy",
if (
isinstance(self._dtype, StringDtype)
and self._dtype.storage == "pyarrow"
):
# TODO(infer_string) should this be large_string?
pa_dtype = pa.string()
else:
pa_dtype = self._dtype.pyarrow_dtype
Expand Down Expand Up @@ -703,7 +704,13 @@ def _cmp_method(self, other, op):
if isinstance(
other, (ArrowExtensionArray, np.ndarray, list, BaseMaskedArray)
) or isinstance(getattr(other, "dtype", None), CategoricalDtype):
result = pc_func(self._pa_array, self._box_pa(other))
try:
result = pc_func(self._pa_array, self._box_pa(other))
except pa.ArrowNotImplementedError:
# TODO: could this be wrong if other is object dtype?
# in which case we need to operate pointwise?
result = ops.invalid_comparison(self, other, op)
result = pa.array(result, type=pa.bool_())
elif is_scalar(other):
try:
result = pc_func(self._pa_array, self._box_pa(other))
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@

import numpy as np

from pandas._config import using_string_dtype

from pandas._libs import (
algos,
lib,
Expand Down Expand Up @@ -1789,6 +1791,10 @@ def strftime(self, date_format: str) -> npt.NDArray[np.object_]:
dtype='object')
"""
result = self._format_native_types(date_format=date_format, na_rep=np.nan)
if using_string_dtype():
from pandas import StringDtype

return pd_array(result, dtype=StringDtype(na_value=np.nan)) # type: ignore[return-value]
return result.astype(object, copy=False)


Expand Down
17 changes: 17 additions & 0 deletions pandas/core/arrays/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@

import numpy as np

from pandas._config import using_string_dtype

from pandas._libs import (
lib,
tslib,
Expand Down Expand Up @@ -1306,6 +1308,13 @@ def month_name(self, locale=None) -> npt.NDArray[np.object_]:
values, "month_name", locale=locale, reso=self._creso
)
result = self._maybe_mask_results(result, fill_value=None)
if using_string_dtype():
from pandas import (
StringDtype,
array as pd_array,
)

return pd_array(result, dtype=StringDtype(na_value=np.nan)) # type: ignore[return-value]
return result

def day_name(self, locale=None) -> npt.NDArray[np.object_]:
Expand Down Expand Up @@ -1363,6 +1372,14 @@ def day_name(self, locale=None) -> npt.NDArray[np.object_]:
values, "day_name", locale=locale, reso=self._creso
)
result = self._maybe_mask_results(result, fill_value=None)
if using_string_dtype():
# TODO: no tests that check for dtype of result as of 2024-08-15
from pandas import (
StringDtype,
array as pd_array,
)

return pd_array(result, dtype=StringDtype(na_value=np.nan)) # type: ignore[return-value]
return result

@property
Expand Down
4 changes: 0 additions & 4 deletions pandas/core/arrays/numpy_.py
Original file line number Diff line number Diff line change
Expand Up @@ -557,7 +557,3 @@ def _wrap_ndarray_result(self, result: np.ndarray):

return TimedeltaArray._simple_new(result, dtype=result.dtype)
return type(self)(result)

# ------------------------------------------------------------------------
# String methods interface
_str_na_value = np.nan
Loading
Loading