Skip to content

Commit c2a1e18

Browse files
committed
More flexible describe() via include/exclude type filtering
This enhance describe()'s output via new include/exclude list arguments, letting the user specify the dtypes to be summarized as output. This provides an simple way to overcome the automatic type-filtering done by default; it's also convenient with groupby(). Also includes documentation and changelog entries.
1 parent 4802d0f commit c2a1e18

File tree

4 files changed

+221
-59
lines changed

4 files changed

+221
-59
lines changed

doc/source/basics.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -490,6 +490,24 @@ number of unique values and most frequently occurring values:
490490
s = Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
491491
s.describe()
492492
493+
Note that on a mixed-type DataFrame object, `describe` will restrict the summary to
494+
include only numerical columns or, if none are, only categorical columns:
495+
496+
.. ipython:: python
497+
498+
frame = DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
499+
frame.describe()
500+
501+
This behaviour can be controlled by providing a list of types as ``include``/``exclude``
502+
arguments. The special value ``all`` can also be used:
503+
504+
.. ipython:: python
505+
506+
frame.describe(include=['object'])
507+
frame.describe(include=['number'])
508+
frame.describe(include='all')
509+
510+
That feature relies on :ref:`select_dtypes <basics.selectdtypes>`. Refer to there for details about accepted inputs.
493511

494512
There also is a utility function, ``value_range`` which takes a DataFrame and
495513
returns a series with the minimum/maximum values in the DataFrame.

doc/source/v0.15.0.txt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,24 @@ users upgrade to this version.
5656

5757
API changes
5858
~~~~~~~~~~~
59+
- :func:`describe` on mixed-types DataFrames is more flexible. Type-based column filtering is now possible via the ``include``/``exclude`` arguments (:issue:`8164`).
60+
61+
.. ipython:: python
62+
63+
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
64+
'catB': ['a', 'b', 'c', 'd'] * 6,
65+
'numC': np.arange(24),
66+
'numD': np.arange(24.) + .5})
67+
df.describe(include=["object"])
68+
df.describe(include=["number", "object"], exclude=["float"])
69+
70+
Requesting all columns is possible with the shorthand 'all'
71+
72+
.. ipython:: python
73+
74+
df.describe(include='all')
75+
76+
Without those arguments, 'describe` will behave as before, including only numerical columns or, if none are, only categorical columns. See also the :ref:`docs <basics.describe>`
5977

6078
- Passing multiple levels to `DataFrame.stack()` will now work when multiple level
6179
numbers are passed (:issue:`7660`), and will raise a ``ValueError`` when the

pandas/core/generic.py

Lines changed: 73 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -3658,27 +3658,51 @@ def abs(self):
36583658
The percentiles to include in the output. Should all
36593659
be in the interval [0, 1]. By default `percentiles` is
36603660
[.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
3661+
include, exclude : list-like, 'all', or None (default)
3662+
Specify the form of the returned result. Either:
3663+
3664+
- None to both (default). The result will include only numeric-typed
3665+
columns or, if none are, only categorical columns.
3666+
- A list of dtypes or strings to be included/excluded.
3667+
To select all numeric types use numpy numpy.number. To select
3668+
categorical objects use type object. See also the select_dtypes
3669+
documentation. eg. df.describe(include=['O'])
3670+
- If include is the string 'all', the output column-set will
3671+
match the input one.
36613672
36623673
Returns
36633674
-------
36643675
summary: %(klass)s of summary statistics
36653676
36663677
Notes
36673678
-----
3668-
For numeric dtypes the index includes: count, mean, std, min,
3679+
The output DataFrame index depends on the requested dtypes:
3680+
3681+
For numeric dtypes, it will include: count, mean, std, min,
36693682
max, and lower, 50, and upper percentiles.
36703683
3671-
If self is of object dtypes (e.g. timestamps or strings), the output
3684+
For object dtypes (e.g. timestamps or strings), the index
36723685
will include the count, unique, most common, and frequency of the
36733686
most common. Timestamps also include the first and last items.
36743687
3688+
For mixed dtypes, the index will be the union of the corresponding
3689+
output types. Non-applicable entries will be filled with NaN.
3690+
Note that mixed-dtype outputs can only be returned from mixed-dtype
3691+
inputs and appropriate use of the include/exclude arguments.
3692+
36753693
If multiple values have the highest count, then the
36763694
`count` and `most common` pair will be arbitrarily chosen from
36773695
among those with the highest count.
3696+
3697+
The include, exclude arguments are ignored for Series.
3698+
3699+
See also
3700+
--------
3701+
DataFrame.select_dtypes
36783702
"""
36793703

36803704
@Appender(_shared_docs['describe'] % _shared_doc_kwargs)
3681-
def describe(self, percentile_width=None, percentiles=None):
3705+
def describe(self, percentile_width=None, percentiles=None, include=None, exclude=None ):
36823706
if self.ndim >= 3:
36833707
msg = "describe is not implemented on on Panel or PanelND objects."
36843708
raise NotImplementedError(msg)
@@ -3715,16 +3739,6 @@ def describe(self, percentile_width=None, percentiles=None):
37153739
uh = percentiles[percentiles > .5]
37163740
percentiles = np.hstack([lh, 0.5, uh])
37173741

3718-
# dtypes: numeric only, numeric mixed, objects only
3719-
data = self._get_numeric_data()
3720-
if self.ndim > 1:
3721-
if len(data._info_axis) == 0:
3722-
is_object = True
3723-
else:
3724-
is_object = False
3725-
else:
3726-
is_object = not self._is_numeric_mixed_type
3727-
37283742
def pretty_name(x):
37293743
x *= 100
37303744
if x == int(x):
@@ -3733,10 +3747,12 @@ def pretty_name(x):
37333747
return '%.1f%%' % x
37343748

37353749
def describe_numeric_1d(series, percentiles):
3736-
return ([series.count(), series.mean(), series.std(),
3737-
series.min()] +
3738-
[series.quantile(x) for x in percentiles] +
3739-
[series.max()])
3750+
stat_index = (['count', 'mean', 'std', 'min'] +
3751+
[pretty_name(x) for x in percentiles] + ['max'])
3752+
d = ([series.count(), series.mean(), series.std(), series.min()] +
3753+
[series.quantile(x) for x in percentiles] + [series.max()])
3754+
return pd.Series(d, index=stat_index, name=series.name)
3755+
37403756

37413757
def describe_categorical_1d(data):
37423758
names = ['count', 'unique']
@@ -3749,44 +3765,49 @@ def describe_categorical_1d(data):
37493765
names += ['top', 'freq']
37503766
result += [top, freq]
37513767

3752-
elif issubclass(data.dtype.type, np.datetime64):
3768+
elif com.is_datetime64_dtype(data):
37533769
asint = data.dropna().values.view('i8')
3754-
names += ['first', 'last', 'top', 'freq']
3755-
result += [lib.Timestamp(asint.min()),
3756-
lib.Timestamp(asint.max()),
3757-
lib.Timestamp(top), freq]
3758-
3759-
return pd.Series(result, index=names)
3760-
3761-
if is_object:
3762-
if data.ndim == 1:
3763-
return describe_categorical_1d(self)
3770+
names += ['top', 'freq', 'first', 'last']
3771+
result += [lib.Timestamp(top), freq,
3772+
lib.Timestamp(asint.min()),
3773+
lib.Timestamp(asint.max())]
3774+
3775+
return pd.Series(result, index=names, name=data.name)
3776+
3777+
def describe_1d(data, percentiles):
3778+
if com.is_numeric_dtype(data):
3779+
return describe_numeric_1d(data, percentiles)
3780+
elif com.is_timedelta64_dtype(data):
3781+
return describe_numeric_1d(data, percentiles)
37643782
else:
3765-
result = pd.DataFrame(dict((k, describe_categorical_1d(v))
3766-
for k, v in compat.iteritems(self)),
3767-
columns=self._info_axis,
3768-
index=['count', 'unique', 'first', 'last',
3769-
'top', 'freq'])
3770-
# just objects, no datime
3771-
if pd.isnull(result.loc['first']).all():
3772-
result = result.drop(['first', 'last'], axis=0)
3773-
return result
3774-
else:
3775-
stat_index = (['count', 'mean', 'std', 'min'] +
3776-
[pretty_name(x) for x in percentiles] +
3777-
['max'])
3778-
if data.ndim == 1:
3779-
return pd.Series(describe_numeric_1d(data, percentiles),
3780-
index=stat_index)
3783+
return describe_categorical_1d(data)
3784+
3785+
if self.ndim == 1:
3786+
return describe_1d(self, percentiles)
3787+
elif (include is None) and (exclude is None):
3788+
if len(self._get_numeric_data()._info_axis) > 0:
3789+
# when some numerics are found, keep only numerics
3790+
data = self.select_dtypes(include=[np.number, np.bool])
37813791
else:
3782-
destat = []
3783-
for i in range(len(data._info_axis)): # BAD
3784-
series = data.iloc[:, i]
3785-
destat.append(describe_numeric_1d(series, percentiles))
3786-
3787-
return self._constructor(lmap(list, zip(*destat)),
3788-
index=stat_index,
3789-
columns=data._info_axis)
3792+
data = self
3793+
elif include == 'all':
3794+
if exclude != None:
3795+
msg = "exclude must be None when include is 'all'"
3796+
raise ValueError(msg)
3797+
data = self
3798+
else:
3799+
data = self.select_dtypes(include=include, exclude=exclude)
3800+
3801+
ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
3802+
# set a convenient order for rows
3803+
names = []
3804+
ldesc_indexes = sorted([x.index for x in ldesc], key=len)
3805+
for idxnames in ldesc_indexes:
3806+
for name in idxnames:
3807+
if name not in names:
3808+
names.append(name)
3809+
d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
3810+
return d
37903811

37913812
_shared_docs['pct_change'] = """
37923813
Percent change over given number of periods.

pandas/tests/test_generic.py

Lines changed: 112 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1005,18 +1005,17 @@ def test_describe_objects(self):
10051005
df = DataFrame({"C1": pd.date_range('2010-01-01', periods=4, freq='D')})
10061006
df.loc[4] = pd.Timestamp('2010-01-04')
10071007
result = df.describe()
1008-
expected = DataFrame({"C1": [5, 4, pd.Timestamp('2010-01-01'),
1009-
pd.Timestamp('2010-01-04'),
1010-
pd.Timestamp('2010-01-04'), 2]},
1011-
index=['count', 'unique', 'first', 'last', 'top',
1012-
'freq'])
1008+
expected = DataFrame({"C1": [5, 4, pd.Timestamp('2010-01-04'), 2,
1009+
pd.Timestamp('2010-01-01'),
1010+
pd.Timestamp('2010-01-04')]},
1011+
index=['count', 'unique', 'top', 'freq',
1012+
'first', 'last'])
10131013
assert_frame_equal(result, expected)
10141014

10151015
# mix time and str
10161016
df['C2'] = ['a', 'a', 'b', 'c', 'a']
10171017
result = df.describe()
1018-
# when mix of dateimte / obj the index gets reordered.
1019-
expected['C2'] = [5, 3, np.nan, np.nan, 'a', 3]
1018+
expected['C2'] = [5, 3, 'a', 3, np.nan, np.nan]
10201019
assert_frame_equal(result, expected)
10211020

10221021
# just str
@@ -1036,6 +1035,112 @@ def test_describe_objects(self):
10361035
assert_frame_equal(df[['C1', 'C3']].describe(), df[['C3']].describe())
10371036
assert_frame_equal(df[['C2', 'C3']].describe(), df[['C3']].describe())
10381037

1038+
def test_describe_typefiltering(self):
1039+
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
1040+
'catB': ['a', 'b', 'c', 'd'] * 6,
1041+
'numC': np.arange(24, dtype='int64'),
1042+
'numD': np.arange(24.) + .5,
1043+
'ts': tm.makeTimeSeries()[:24].index})
1044+
1045+
descN = df.describe()
1046+
expected_cols = ['numC', 'numD',]
1047+
expected = DataFrame(dict((k, df[k].describe())
1048+
for k in expected_cols),
1049+
columns=expected_cols)
1050+
assert_frame_equal(descN, expected)
1051+
1052+
desc = df.describe(include=['number'])
1053+
assert_frame_equal(desc, descN)
1054+
desc = df.describe(exclude=['object', 'datetime'])
1055+
assert_frame_equal(desc, descN)
1056+
desc = df.describe(include=['float'])
1057+
assert_frame_equal(desc, descN.drop('numC',1))
1058+
1059+
descC = df.describe(include=['O'])
1060+
expected_cols = ['catA', 'catB']
1061+
expected = DataFrame(dict((k, df[k].describe())
1062+
for k in expected_cols),
1063+
columns=expected_cols)
1064+
assert_frame_equal(descC, expected)
1065+
1066+
descD = df.describe(include=['datetime'])
1067+
assert_series_equal( descD.ts, df.ts.describe())
1068+
1069+
desc = df.describe(include=['object','number', 'datetime'])
1070+
assert_frame_equal(desc.loc[:,["numC","numD"]].dropna(), descN)
1071+
assert_frame_equal(desc.loc[:,["catA","catB"]].dropna(), descC)
1072+
descDs = descD.sort_index() # the index order change for mixed-types
1073+
assert_frame_equal(desc.loc[:,"ts":].dropna().sort_index(), descDs)
1074+
1075+
desc = df.loc[:,'catA':'catB'].describe(include='all')
1076+
assert_frame_equal(desc, descC)
1077+
desc = df.loc[:,'numC':'numD'].describe(include='all')
1078+
assert_frame_equal(desc, descN)
1079+
1080+
desc = df.describe(percentiles = [], include='all')
1081+
cnt = Series(data=[4,4,6,6,6], index=['catA','catB','numC','numD','ts'])
1082+
assert_series_equal( desc.count(), cnt)
1083+
self.assertTrue('count' in desc.index)
1084+
self.assertTrue('unique' in desc.index)
1085+
self.assertTrue('50%' in desc.index)
1086+
self.assertTrue('first' in desc.index)
1087+
1088+
desc = df.drop("ts", 1).describe(percentiles = [], include='all')
1089+
assert_series_equal( desc.count(), cnt.drop("ts"))
1090+
self.assertTrue('first' not in desc.index)
1091+
desc = df.drop(["numC","numD"], 1).describe(percentiles = [], include='all')
1092+
assert_series_equal( desc.count(), cnt.drop(["numC","numD"]))
1093+
self.assertTrue('50%' not in desc.index)
1094+
1095+
def test_describe_typefiltering_category_bool(self):
1096+
df = DataFrame({'A_cat': pd.Categorical(['foo', 'foo', 'bar'] * 8),
1097+
'B_str': ['a', 'b', 'c', 'd'] * 6,
1098+
'C_bool': [True] * 12 + [False] * 12,
1099+
'D_num': np.arange(24.) + .5,
1100+
'E_ts': tm.makeTimeSeries()[:24].index})
1101+
1102+
# bool is considered numeric in describe, although not an np.number
1103+
desc = df.describe()
1104+
expected_cols = ['C_bool', 'D_num']
1105+
expected = DataFrame(dict((k, df[k].describe())
1106+
for k in expected_cols),
1107+
columns=expected_cols)
1108+
assert_frame_equal(desc, expected)
1109+
1110+
desc = df.describe(include=["category"])
1111+
self.assertTrue(desc.columns.tolist() == ["A_cat"])
1112+
1113+
# 'all' includes numpy-dtypes + category
1114+
desc1 = df.describe(include="all")
1115+
desc2 = df.describe(include=[np.generic, "category"])
1116+
assert_frame_equal(desc1, desc2)
1117+
1118+
def test_describe_timedelta(self):
1119+
df = DataFrame({"td": pd.to_timedelta(np.arange(24)%20,"D")})
1120+
self.assertTrue(df.describe().loc["mean"][0] == pd.to_timedelta("8d4h"))
1121+
1122+
def test_describe_typefiltering_dupcol(self):
1123+
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
1124+
'catB': ['a', 'b', 'c', 'd'] * 6,
1125+
'numC': np.arange(24),
1126+
'numD': np.arange(24.) + .5,
1127+
'ts': tm.makeTimeSeries()[:24].index})
1128+
s = df.describe(include='all').shape[1]
1129+
df = pd.concat([df, df], axis=1)
1130+
s2 = df.describe(include='all').shape[1]
1131+
self.assertTrue(s2 == 2 * s)
1132+
1133+
def test_describe_typefiltering_groupby(self):
1134+
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
1135+
'catB': ['a', 'b', 'c', 'd'] * 6,
1136+
'numC': np.arange(24),
1137+
'numD': np.arange(24.) + .5,
1138+
'ts': tm.makeTimeSeries()[:24].index})
1139+
G = df.groupby('catA')
1140+
self.assertTrue(G.describe(include=['number']).shape == (16, 2))
1141+
self.assertTrue(G.describe(include=['number', 'object']).shape == (22, 3))
1142+
self.assertTrue(G.describe(include='all').shape == (26, 4))
1143+
10391144
def test_no_order(self):
10401145
tm._skip_if_no_scipy()
10411146
s = Series([0, 1, np.nan, 3])

0 commit comments

Comments
 (0)