PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
Labels
Indexing
Related to indexing on series/frames, not to indexes themselves
Performance
Memory or execution speed performance
Milestone
ORIGINAL: 13.8 ms
EDIT: After #21369 was merged the result of
%timeit df2.loc['b']
has improved to 3.8 ms.EDIT: After #21618 was merged the result of
%timeit df2.loc['b']
has improved to 3.3 ms.EDIT: After #21659 was merged the result of
%timeit df2.loc['b']
has improved to 1.6 ms.EDIT: After #23235 was merged the result of
%timeit df2.loc['b']
has improved to 159 µs. Issue resolved.Code Sample
Problem description
Selecting on a
CategoricalIndex
is 100x slower than selecting on a normalIndex
.I've tested this on master ( a few days old) and on v0.22, with same result for both versions. The speed is even worse than the speed for a full columns scan:
A guess is that the binary search is bypassed and a full index scan is being done + some extra stuff so it's even slower than a normal full columns scan.
Expected Output
The output is as expected, but the speed is very slow for
CategoricalIndex
.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: a7a7f8c
python: 3.6.3.final.0
python-bits: 32
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0.dev0+870.ga7a7f8c
pytest: 3.3.1
pip: 9.0.1
setuptools: 38.2.5
Cython: 0.26.1
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: