PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395

topper-123 · 2018-03-17T20:19:37Z

ORIGINAL: 13.8 ms
EDIT: After #21369 was merged the result of %timeit df2.loc['b'] has improved to 3.8 ms.
EDIT: After #21618 was merged the result of %timeit df2.loc['b'] has improved to 3.3 ms.
EDIT: After #21659 was merged the result of %timeit df2.loc['b'] has improved to 1.6 ms.
EDIT: After #23235 was merged the result of %timeit df2.loc['b'] has improved to 159 µs. Issue resolved.

Code Sample

>>> n = 100_000
>>> df1 = pd.DataFrame(dict(A=range(n*3)), index=list('a'*n + 'b'*n + 'c'*n))
>>> df1.index.is_monotonic_increasing
True
>>> df2 = df1.copy()
>>> df2.index = pd.CategoricalIndex(df2.index)
>>> df2.index.is_monotonic_increasing
True
>>> %timeit df1.loc['b']
125 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df2.loc['b']
13.8 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

Selecting on a CategoricalIndex is 100x slower than selecting on a normal Index.

I've tested this on master ( a few days old) and on v0.22, with same result for both versions. The speed is even worse than the speed for a full columns scan:

>>> df3 = df2.reset_index()
>>> %timeit df3[df3['index'] == 'b']
6.58 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A guess is that the binary search is bypassed and a full index scan is being done + some extra stuff so it's even slower than a normal full columns scan.

Expected Output

The output is as expected, but the speed is very slow for CategoricalIndex.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: a7a7f8c
python: 3.6.3.final.0
python-bits: 32
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0.dev0+870.ga7a7f8c
pytest: 3.3.1
pip: 9.0.1
setuptools: 38.2.5
Cython: 0.26.1
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

topper-123 · 2018-05-09T23:24:07Z

I've retested this on RC2, and this issue is still there. Anyone can confirm this issue?

Is this a known limitation of using CategoricalIndex rather than Index?

david-liu-brattle-1 · 2018-05-18T02:54:26Z

It looks to me like CategoricalIndex is slower every step of the way, even with the improvements in #21022

key = df1.loc.obj.index._engine.get_loc('b')
result = df1.loc.obj.iloc[key]

%timeit key = df1.loc.obj.index._engine.get_loc('b')

9.86 µs ± 635 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit result = df1.loc.obj.iloc[key]

73.9 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

codes = df2.loc.obj.index.categories.get_loc('b')
key = df2.loc.obj.index._engine.get_loc(codes)
result = df2.loc.obj.iloc[key]

%timeit codes = df2.loc.obj.index.categories.get_loc('b')
%timeit key = df2.loc.obj.index._engine.get_loc(codes)

2.15 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
857 µs ± 144 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit result = df2.loc.obj.iloc[key]

429 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Mar 18, 2018

topper-123 mentioned this issue May 13, 2018

BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019

Closed

4 tasks

fjetter mentioned this issue May 13, 2018

PERF: __contains__ method for Categorical #21022

Closed

4 tasks

This was referenced Jun 6, 2018

PERF: Add __contains__ to CategoricalIndex #21342

Closed

PERF: Add __contains__ to CategoricalIndex #21369

Merged

topper-123 mentioned this issue Jun 15, 2018

PERF: avoid unnecessary recoding in CategoricalIndex._create_categorical #21506

Closed

4 tasks

topper-123 mentioned this issue Oct 19, 2018

PERF: speed up CategoricalIndex.get_loc #23235

Merged

4 tasks

jreback added this to the 0.24.0 milestone Oct 19, 2018

jreback closed this as completed in #23235 Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395

topper-123 commented Mar 17, 2018 •

edited

Loading

INSTALLED VERSIONS

topper-123 commented May 9, 2018

david-liu-brattle-1 commented May 18, 2018

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395

Comments

topper-123 commented Mar 17, 2018 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

topper-123 commented May 9, 2018

david-liu-brattle-1 commented May 18, 2018

topper-123 commented Mar 17, 2018 •

edited

Loading

Output of `pd.show_versions()`