-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Performance of merge for categorical index vs category column #30513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pls check this on master PR to patch welcome |
Thanks, I have just checked on master and the issue still exists: 0.68 seconds to merge on a categorical column vs 3.1 seconds to merge when that column is set as an index. |
This looks a bit better on master. Could use a benchmark
|
I can confirm, this improved between version 1.2.5 and 1.3.0 (>x20 speed improvement for Merging on category index example above). |
ok let's see if we have an asv which covers this case and can close |
take |
Code Sample
Problem description
I noticed that when I perform a merge on a unique category dtype column, the merge gets significantly (x4) slower if I set this column as the index. This was unexpected for me as for all other dtypes I have tried (str, int, datetime etc.) the merge is significantly faster when using the indexed column.
I investigated this a bit further. Even though the category column "A" in the above example is unique (no repeated categories), the merge method still calls the
pandas.core.indexes.category.CategoricalIndex.get_indexer_non_unique
method.This is because the check
self.is_unique and self.equals(target)
evaluates to False onpandas/pandas/core/indexes/category.py
Line 662 in 67ee16a
self.equals(target)
is Falsepandas/pandas/core/indexes/category.py
Line 309 in 67ee16a
Additionally, I am not sure why get_index_non_unique for a categorical index should be significantly slower than a merge on a categorical column.
Let me know if you have any other insights into why the merge would be slower for a category type when using the index, or if I can provide any additional information / assistance. Thanks for your help!
Output of
pd.show_versions()
pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: