Skip to content

Commit 7ae5b76

Browse files
authored
Merge branch 'master' into fix-17407
2 parents ccb3887 + d43aba8 commit 7ae5b76

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+1597
-1131
lines changed

asv_bench/benchmarks/timeseries.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def setup(self):
5656
self.no_freq = self.rng7[:50000].append(self.rng7[50002:])
5757
self.d_freq = self.rng7[:50000].append(self.rng7[50000:])
5858

59-
self.rng8 = date_range(start='1/1/1700', freq='B', periods=100000)
59+
self.rng8 = date_range(start='1/1/1700', freq='B', periods=75000)
6060
self.b_freq = self.rng8[:50000].append(self.rng8[50000:])
6161

6262
def time_add_timedelta(self):

doc/source/advanced.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup
638638

639639
.. ipython:: python
640640
641+
from pandas.api.types import CategoricalDtype
642+
641643
df = pd.DataFrame({'A': np.arange(6),
642644
'B': list('aabbca')})
643-
df['B'] = df['B'].astype('category', categories=list('cab'))
645+
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
644646
df
645647
df.dtypes
646648
df.B.cat.categories

doc/source/api.rst

+4-1
Original file line numberDiff line numberDiff line change
@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
646646
Categorical
647647
~~~~~~~~~~~
648648

649-
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
649+
.. autoclass:: api.types.CategoricalDtype
650+
:members: categories, ordered
651+
652+
If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
650653
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
651654
following usable methods and properties:
652655

doc/source/categorical.rst

+95-8
Original file line numberDiff line numberDiff line change
@@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
8989
df["B"] = raw_cat
9090
df
9191
92-
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
92+
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
93+
94+
1. categories are inferred from the data
95+
2. categories are unordered.
96+
97+
To control those behaviors, instead of passing ``'category'``, use an instance
98+
of :class:`~pandas.api.types.CategoricalDtype`.
9399

94100
.. ipython:: python
95101
96-
s = pd.Series(["a","b","c","a"])
97-
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
102+
from pandas.api.types import CategoricalDtype
103+
104+
s = pd.Series(["a", "b", "c", "a"])
105+
cat_type = CategoricalDtype(categories=["b", "c", "d"],
106+
ordered=True)
107+
s_cat = s.astype(cat_type)
98108
s_cat
99109
100110
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -133,6 +143,75 @@ constructor to save the factorize step during normal constructor mode:
133143
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
134144
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
135145
146+
.. _categorical.categoricaldtype:
147+
148+
CategoricalDtype
149+
----------------
150+
151+
.. versionchanged:: 0.21.0
152+
153+
A categorical's type is fully described by
154+
155+
1. ``categories``: a sequence of unique values and no missing values
156+
2. ``ordered``: a boolean
157+
158+
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
159+
The ``categories`` argument is optional, which implies that the actual categories
160+
should be inferred from whatever is present in the data when the
161+
:class:`pandas.Categorical` is created. The categories are assumed to be unordered
162+
by default.
163+
164+
.. ipython:: python
165+
166+
from pandas.api.types import CategoricalDtype
167+
168+
CategoricalDtype(['a', 'b', 'c'])
169+
CategoricalDtype(['a', 'b', 'c'], ordered=True)
170+
CategoricalDtype()
171+
172+
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
173+
expects a `dtype`. For example :func:`pandas.read_csv`,
174+
:func:`pandas.DataFrame.astype`, or in the Series constructor.
175+
176+
.. note::
177+
178+
As a convenience, you can use the string ``'category'`` in place of a
179+
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
180+
the categories being unordered, and equal to the set values present in the
181+
array. In other words, ``dtype='category'`` is equivalent to
182+
``dtype=CategoricalDtype()``.
183+
184+
Equality Semantics
185+
~~~~~~~~~~~~~~~~~~
186+
187+
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
188+
whenever they have the same categories and orderedness. When comparing two
189+
unordered categoricals, the order of the ``categories`` is not considered
190+
191+
.. ipython:: python
192+
193+
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
194+
195+
# Equal, since order is not considered when ordered=False
196+
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
197+
198+
# Unequal, since the second CategoricalDtype is ordered
199+
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)
200+
201+
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
202+
203+
.. ipython:: python
204+
205+
c1 == 'category'
206+
207+
.. warning::
208+
209+
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
210+
and since all instances ``CategoricalDtype`` compare equal to ``'category'``,
211+
all instances of ``CategoricalDtype`` compare equal to a
212+
``CategoricalDtype(None, False)``, regardless of ``categories`` or
213+
``ordered``.
214+
136215
Description
137216
-----------
138217

@@ -184,7 +263,7 @@ It's also possible to pass in the categories in a specific order:
184263

185264
.. ipython:: python
186265
187-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
266+
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
188267
s
189268
190269
# categories
@@ -301,7 +380,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
301380
302381
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
303382
s.sort_values(inplace=True)
304-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
383+
s = pd.Series(["a","b","c","a"]).astype(
384+
CategoricalDtype(ordered=True)
385+
)
305386
s.sort_values(inplace=True)
306387
s
307388
s.min(), s.max()
@@ -401,9 +482,15 @@ categories or a categorical with any list-like object, will raise a TypeError.
401482

402483
.. ipython:: python
403484
404-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
405-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
406-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
485+
cat = pd.Series([1,2,3]).astype(
486+
CategoricalDtype([3, 2, 1], ordered=True)
487+
)
488+
cat_base = pd.Series([2,2,2]).astype(
489+
CategoricalDtype([3, 2, 1], ordered=True)
490+
)
491+
cat_base2 = pd.Series([2,2,2]).astype(
492+
CategoricalDtype(ordered=True)
493+
)
407494
408495
cat
409496
cat_base

doc/source/groupby.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -1060,7 +1060,7 @@ To select from a DataFrame or Series the nth item, use the nth method. This is a
10601060
g.nth(-1)
10611061
g.nth(1)
10621062
1063-
If you want to select the nth not-null item, use the ``dropna`` kwarg. For a DataFrame this should be either ``'any'`` or ``'all'`` just like you would pass to dropna, for a Series this just needs to be truthy.
1063+
If you want to select the nth not-null item, use the ``dropna`` kwarg. For a DataFrame this should be either ``'any'`` or ``'all'`` just like you would pass to dropna:
10641064

10651065
.. ipython:: python
10661066
@@ -1072,7 +1072,7 @@ If you want to select the nth not-null item, use the ``dropna`` kwarg. For a Dat
10721072
g.nth(-1, dropna='any') # NaNs denote group exhausted when using dropna
10731073
g.last()
10741074
1075-
g.B.nth(0, dropna=True)
1075+
g.B.nth(0, dropna='all')
10761076
10771077
As with other methods, passing ``as_index=False``, will achieve a filtration, which returns the grouped row.
10781078

doc/source/merging.rst

+8-3
Original file line numberDiff line numberDiff line change
@@ -830,8 +830,10 @@ The left frame.
830830

831831
.. ipython:: python
832832
833+
from pandas.api.types import CategoricalDtype
834+
833835
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
836+
X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
835837
836838
left = pd.DataFrame({'X': X,
837839
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +844,11 @@ The right frame.
842844

843845
.. ipython:: python
844846
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
846-
'Z': [1, 2]})
847+
right = pd.DataFrame({
848+
'X': pd.Series(['foo', 'bar'],
849+
dtype=CategoricalDtype(['foo', 'bar'])),
850+
'Z': [1, 2]
851+
})
847852
right
848853
right.dtypes
849854

doc/source/whatsnew/v0.21.0.txt

+29
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ users upgrade to this version.
1010
Highlights include:
1111

1212
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
13+
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
14+
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
1315

1416
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.
1517

@@ -89,6 +91,31 @@ This does not raise any obvious exceptions, but also does not create a new colum
8991

9092
Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
9193

94+
.. _whatsnew_0210.enhancements.categorical_dtype:
95+
96+
``CategoricalDtype`` for specifying categoricals
97+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
98+
99+
:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
100+
expanded to include the ``categories`` and ``ordered`` attributes. A
101+
``CategoricalDtype`` can be used to specify the set of categories and
102+
orderedness of an array, independent of the data themselves. This can be useful,
103+
e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
104+
:issue:`15078`, :issue:`16015`):
105+
106+
.. ipython:: python
107+
108+
from pandas.api.types import CategoricalDtype
109+
110+
s = pd.Series(['a', 'b', 'c', 'a']) # strings
111+
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
112+
s.astype(dtype)
113+
114+
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
115+
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
116+
117+
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.
118+
92119
.. _whatsnew_0210.enhancements.other:
93120

94121
Other Enhancements
@@ -498,6 +525,7 @@ Conversion
498525
- Bug in :func:`Series.fillna` returns frame when ``inplace=True`` and ``value`` is dict (:issue:`16156`)
499526
- Bug in :attr:`Timestamp.weekday_name` returning a UTC-based weekday name when localized to a timezone (:issue:`17354`)
500527
- Bug in ``Timestamp.replace`` when replacing ``tzinfo`` around DST changes (:issue:`15683`)
528+
- Bug in ``Timedelta`` construction and arithmetic that would not propagate the ``Overflow`` exception (:issue:`17367`)
501529

502530
Indexing
503531
^^^^^^^^
@@ -517,6 +545,7 @@ Indexing
517545
- Bug in ``CategoricalIndex`` reindexing in which specified indices containing duplicates were not being respected (:issue:`17323`)
518546
- Bug in intersection of ``RangeIndex`` with negative step (:issue:`17296`)
519547
- Bug in ``IntervalIndex`` where performing a scalar lookup fails for included right endpoints of non-overlapping monotonic decreasing indexes (:issue:`16417`, :issue:`17271`)
548+
- Bug in :meth:`DataFrame.first_valid_index` and :meth:`DataFrame.last_valid_index` when no valid entry (:issue:`17400`)
520549
- Bug in ``Series.rename`` when called with `str` alters name of series rather than index of series. (:issue:`17407`)
521550

522551
I/O

pandas/_libs/groupby.pyx

-2
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@ cimport cython
77

88
cnp.import_array()
99

10-
cimport util
11-
1210
from numpy cimport (ndarray,
1311
double_t,
1412
int8_t, int16_t, int32_t, int64_t, uint8_t, uint16_t,

pandas/_libs/join.pyx

-2
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ from cython cimport Py_ssize_t
88

99
np.import_array()
1010

11-
cimport util
12-
1311
from numpy cimport (ndarray,
1412
int8_t, int16_t, int32_t, int64_t, uint8_t, uint16_t,
1513
uint32_t, uint64_t, float16_t, float32_t, float64_t)

pandas/_libs/parsers.pyx

+1-1
Original file line numberDiff line numberDiff line change
@@ -255,7 +255,7 @@ cdef extern from "parser/tokenizer.h":
255255

256256
# inline int to_complex(char *item, double *p_real,
257257
# double *p_imag, char sci, char decimal)
258-
inline int to_longlong(char *item, long long *p_value) nogil
258+
int to_longlong(char *item, long long *p_value) nogil
259259
# inline int to_longlong_thousands(char *item, long long *p_value,
260260
# char tsep)
261261
int to_boolean(const char *item, uint8_t *val) nogil

0 commit comments

Comments
 (0)