Skip to content

Commit 54a43b3

Browse files
expand Missing value semantics section
1 parent 30c7b43 commit 54a43b3

File tree

1 file changed

+39
-14
lines changed

1 file changed

+39
-14
lines changed

web/pandas/pdeps/0014-string-dtype.md

+39-14
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,9 @@ To be able to move forward with a string data type in pandas 3.0, this PDEP prop
101101
(but slower) version.
102102
2. This default "string" dtype will follow the same behaviour for missing values
103103
as our other default data types, and use `NaN` as the missing value sentinel.
104-
3. The version that is not backed by PyArrow can reuse (with minor code additions) the existing numpy
105-
object-dtype backed StringArray for its implementation.
104+
3. The version that is not backed by PyArrow can reuse (with minor code
105+
additions) the existing numpy object-dtype backed StringArray for its
106+
implementation.
106107
4. We update installation guidelines to clearly encourage users to install
107108
pyarrow for the default user experience.
108109

@@ -111,8 +112,9 @@ experimental.
111112

112113
### Default inference of a string dtype
113114

114-
By default, pandas will infer this new string dtype for string data (when
115-
creating pandas objects, such as in constructors or IO functions).
115+
By default, pandas will infer this new string dtype instead of object dtype for
116+
string data (when creating pandas objects, such as in constructors or IO
117+
functions).
116118

117119
The existing `future.infer_string` option can be used to opt-in to the future
118120
default behaviour:
@@ -130,16 +132,39 @@ This option will be expanded to also work when PyArrow is not installed.
130132

131133
### Missing value semantics
132134

133-
Given that all other default data types use NaN semantics for missing values,
134-
this proposal says that a new default string dtype should still use the same
135-
default semantics. Further, it should result in default data types when doing
136-
operations on the string column that result in a boolean or numeric data type
137-
(e.g., methods like `.str.startswith(..)` or `.str.len(..)`, or comparison
138-
operators like `==`, should result in default `int64` and `bool` data types).
135+
As mentioned in the background section, the original `StringDtype` has used
136+
the experimental `pd.NA` sentinel for missing values. In addition to using
137+
`pd.NA` as the scalar for a missing value, this essentially means
138+
that:
139+
140+
- String columns follow ["NA-semantics"](https://pandas.pydata.org/docs/user_guide/missing_data.html#na-semantics)
141+
for missing values, where `NA` propagates in boolean operations such as
142+
comparisons or predicates.
143+
- Operations on the string column that give a numeric or boolean result use the
144+
nullable Integer/Float/Boolean data types (e.g. `ser.str.len()` returns the
145+
nullable `'Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64`
146+
dtype (or `float64` in case of missing values)).
147+
148+
However, up to this date, all other default data types still use NaN semantics
149+
for missing values. Therefore, this proposal says that a new default string
150+
dtype should also still use the same default missing value semantics and return
151+
default data types when doing operations on the string column, to be consistent
152+
with the other default dtypes at this point.
153+
154+
In practice, this means that the default `"string"` dtype will use `NaN` as
155+
the missing value sentinel, and:
156+
157+
- String columns will follow NaN-semantics for missing values, where `NaN` gives
158+
False in boolean operations such as comparisons or predicates.
159+
- Operations on the string column that give a numeric or boolean result will use
160+
the default data types (i.e. numpy `int64`/`float64`/`bool`).
139161

140162
Because the original `StringDtype` implementations already use `pd.NA` and
141163
return masked integer and boolean arrays in operations, a new variant of the
142-
existing dtypes that uses `NaN` and default data types is needed.
164+
existing dtypes that uses `NaN` and default data types is needed. The original
165+
variant of `StringDtype` using `pd.NA` will still be available for those who
166+
want to keep using it (see below in the "Naming" subsection for how to specify
167+
this).
143168

144169
### Object-dtype "fallback" implementation
145170

@@ -196,7 +221,7 @@ However:
196221

197222
### Why not use the existing StringDtype with `pd.NA`?
198223

199-
Wouldn't adding even more variants of the string dtype will make things only more
224+
Wouldn't adding even more variants of the string dtype make things only more
200225
confusing? Indeed, this proposal unfortunately introduces more variants of the
201226
string dtype. However, the reason for this is to ensure the actual default user
202227
experience is _less_ confusing, and the new string dtype fits better with the
@@ -210,8 +235,8 @@ bool, etc dtypes). This would lead to a very confusing default experience.
210235

211236
With the proposed new variant of the StringDtype, this will ensure that for the
212237
_default_ experience, a user will only see only 1 kind of integer dtype, only
213-
kind of 1 bool dtype, etc. For now, a user should only get columns with an
214-
`ArrowDtype` and/or using `pd.NA` when explicitly opting into this.
238+
kind of 1 bool dtype, etc. For now, a user should only get columns using `pd.NA`
239+
when explicitly opting into this.
215240

216241
## Backward compatibility
217242

0 commit comments

Comments
 (0)