@@ -101,8 +101,9 @@ To be able to move forward with a string data type in pandas 3.0, this PDEP prop
101
101
(but slower) version.
102
102
2 . This default "string" dtype will follow the same behaviour for missing values
103
103
as our other default data types, and use ` NaN ` as the missing value sentinel.
104
- 3 . The version that is not backed by PyArrow can reuse (with minor code additions) the existing numpy
105
- object-dtype backed StringArray for its implementation.
104
+ 3 . The version that is not backed by PyArrow can reuse (with minor code
105
+ additions) the existing numpy object-dtype backed StringArray for its
106
+ implementation.
106
107
4 . We update installation guidelines to clearly encourage users to install
107
108
pyarrow for the default user experience.
108
109
@@ -111,8 +112,9 @@ experimental.
111
112
112
113
### Default inference of a string dtype
113
114
114
- By default, pandas will infer this new string dtype for string data (when
115
- creating pandas objects, such as in constructors or IO functions).
115
+ By default, pandas will infer this new string dtype instead of object dtype for
116
+ string data (when creating pandas objects, such as in constructors or IO
117
+ functions).
116
118
117
119
The existing ` future.infer_string ` option can be used to opt-in to the future
118
120
default behaviour:
@@ -130,16 +132,39 @@ This option will be expanded to also work when PyArrow is not installed.
130
132
131
133
### Missing value semantics
132
134
133
- Given that all other default data types use NaN semantics for missing values,
134
- this proposal says that a new default string dtype should still use the same
135
- default semantics. Further, it should result in default data types when doing
136
- operations on the string column that result in a boolean or numeric data type
137
- (e.g., methods like ` .str.startswith(..) ` or ` .str.len(..) ` , or comparison
138
- operators like ` == ` , should result in default ` int64 ` and ` bool ` data types).
135
+ As mentioned in the background section, the original ` StringDtype ` has used
136
+ the experimental ` pd.NA ` sentinel for missing values. In addition to using
137
+ ` pd.NA ` as the scalar for a missing value, this essentially means
138
+ that:
139
+
140
+ - String columns follow [ "NA-semantics"] ( https://pandas.pydata.org/docs/user_guide/missing_data.html#na-semantics )
141
+ for missing values, where ` NA ` propagates in boolean operations such as
142
+ comparisons or predicates.
143
+ - Operations on the string column that give a numeric or boolean result use the
144
+ nullable Integer/Float/Boolean data types (e.g. ` ser.str.len() ` returns the
145
+ nullable ` 'Int64" ` / ` pd.Int64Dtype() ` dtype instead of the numpy ` int64 `
146
+ dtype (or ` float64 ` in case of missing values)).
147
+
148
+ However, up to this date, all other default data types still use NaN semantics
149
+ for missing values. Therefore, this proposal says that a new default string
150
+ dtype should also still use the same default missing value semantics and return
151
+ default data types when doing operations on the string column, to be consistent
152
+ with the other default dtypes at this point.
153
+
154
+ In practice, this means that the default ` "string" ` dtype will use ` NaN ` as
155
+ the missing value sentinel, and:
156
+
157
+ - String columns will follow NaN-semantics for missing values, where ` NaN ` gives
158
+ False in boolean operations such as comparisons or predicates.
159
+ - Operations on the string column that give a numeric or boolean result will use
160
+ the default data types (i.e. numpy ` int64 ` /` float64 ` /` bool ` ).
139
161
140
162
Because the original ` StringDtype ` implementations already use ` pd.NA ` and
141
163
return masked integer and boolean arrays in operations, a new variant of the
142
- existing dtypes that uses ` NaN ` and default data types is needed.
164
+ existing dtypes that uses ` NaN ` and default data types is needed. The original
165
+ variant of ` StringDtype ` using ` pd.NA ` will still be available for those who
166
+ want to keep using it (see below in the "Naming" subsection for how to specify
167
+ this).
143
168
144
169
### Object-dtype "fallback" implementation
145
170
@@ -196,7 +221,7 @@ However:
196
221
197
222
### Why not use the existing StringDtype with ` pd.NA ` ?
198
223
199
- Wouldn't adding even more variants of the string dtype will make things only more
224
+ Wouldn't adding even more variants of the string dtype make things only more
200
225
confusing? Indeed, this proposal unfortunately introduces more variants of the
201
226
string dtype. However, the reason for this is to ensure the actual default user
202
227
experience is _ less_ confusing, and the new string dtype fits better with the
@@ -210,8 +235,8 @@ bool, etc dtypes). This would lead to a very confusing default experience.
210
235
211
236
With the proposed new variant of the StringDtype, this will ensure that for the
212
237
_ default_ experience, a user will only see only 1 kind of integer dtype, only
213
- kind of 1 bool dtype, etc. For now, a user should only get columns with an
214
- ` ArrowDtype ` and/or using ` pd.NA ` when explicitly opting into this.
238
+ kind of 1 bool dtype, etc. For now, a user should only get columns using ` pd.NA `
239
+ when explicitly opting into this.
215
240
216
241
## Backward compatibility
217
242
0 commit comments