Skip to content

Performance of merge for categorical index vs category column #30513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timhunderwood opened this issue Dec 27, 2019 · 6 comments · Fixed by #45169
Closed

Performance of merge for categorical index vs category column #30513

timhunderwood opened this issue Dec 27, 2019 · 6 comments · Fixed by #45169
Assignees
Labels
Benchmark Performance (ASV) benchmarks Categorical Categorical Data Type good first issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@timhunderwood
Copy link
Contributor

timhunderwood commented Dec 27, 2019

Code Sample

import pandas
import numpy
import time
numpy.random.seed(11)
N=1_000_000

### 1. Merging on category index (3.85 seconds -slower)
# Create two (random order) dataframes with a column A with unique category dtype set as index
array_1 = numpy.arange(N)
numpy.random.shuffle(array_1)
df_1 = pandas.DataFrame(array_1, columns=["A"])
df_1["B"] = df_1["A"] * 2
df_1["A"] = df_1["A"].astype("category")
df_1 = df_1.set_index("A")

array_2 = array_1.copy()
numpy.random.shuffle(array_2)
df_2 = pandas.DataFrame(array_2, columns=["A"])
df_2["B"] = df_2["A"] * 2
df_2["A"] = df_2["A"].astype("category")
df_2 = df_2.set_index("A")

# merge two dataframes and time
start = time.time()
df_1.merge(df_2, how="left", left_index=True, right_index=True)
print(time.time()-start)

### 2. Merging on category column (0.85 seconds - faster)
# Create two (random order) dataframes with a column A with unique category dtype
array_1 = numpy.arange(N)
numpy.random.shuffle(array_1)
df_1 = pandas.DataFrame(array_1, columns=["A"])
df_1["B"] = df_1["A"] * 2
df_1["A"] = df_1["A"].astype("category")

array_2 = array_1.copy()
numpy.random.shuffle(array_2)
df_2 = pandas.DataFrame(array_2, columns=["A"])
df_2["B"] = df_2["A"] * 2
df_2["A"] = df_2["A"].astype("category")

# merge two dataframes and time
start = time.time()
df_1.merge(df_2, how="left", left_on="A", right_on="A")
print(time.time()-start)

Problem description

I noticed that when I perform a merge on a unique category dtype column, the merge gets significantly (x4) slower if I set this column as the index. This was unexpected for me as for all other dtypes I have tried (str, int, datetime etc.) the merge is significantly faster when using the indexed column.

I investigated this a bit further. Even though the category column "A" in the above example is unique (no repeated categories), the merge method still calls the pandas.core.indexes.category.CategoricalIndex.get_indexer_non_unique method.

This is because the check self.is_unique and self.equals(target) evaluates to False on

if self.is_unique and self.equals(target):
. This is because self.equals(target) is False
def equals(self, other):
. I am not sure if this should be evaluating as True or False? Should two categorical indexes be equal if they have the exact same categories in different orders?

Additionally, I am not sure why get_index_non_unique for a categorical index should be significantly slower than a merge on a categorical column.

Let me know if you have any other insights into why the merge would be slower for a category type when using the index, or if I can provide any additional information / assistance. Thanks for your help!

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.5.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

pls check this on master

PR to patch welcome

@timhunderwood
Copy link
Contributor Author

Thanks, I have just checked on master and the issue still exists:

0.68 seconds to merge on a categorical column vs 3.1 seconds to merge when that column is set as an index.

@jbrockmendel jbrockmendel added Categorical Categorical Data Type Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 7, 2020
@mroeschke
Copy link
Member

This looks a bit better on master. Could use a benchmark

   ...: df_2["B"] = df_2["A"] * 2
   ...: df_2["A"] = df_2["A"].astype("category")
   ...: df_2 = df_2.set_index("A")
   ...:
   ...: # merge two dataframes and time
   ...: start = time.time()
   ...: df_1.merge(df_2, how="left", left_index=True, right_index=True)
   ...: print(time.time()-start)
   ...:
   ...: ### 2. Merging on category column (0.85 seconds - faster)
   ...: # Create two (random order) dataframes with a column A with unique category dtype
   ...: array_1 = numpy.arange(N)
   ...: numpy.random.shuffle(array_1)
   ...: df_1 = pandas.DataFrame(array_1, columns=["A"])
   ...: df_1["B"] = df_1["A"] * 2
   ...: df_1["A"] = df_1["A"].astype("category")
   ...:
   ...: array_2 = array_1.copy()
   ...: numpy.random.shuffle(array_2)
   ...: df_2 = pandas.DataFrame(array_2, columns=["A"])
   ...: df_2["B"] = df_2["A"] * 2
   ...: df_2["A"] = df_2["A"].astype("category")
   ...:
   ...: # merge two dataframes and time
   ...: start = time.time()
   ...: df_1.merge(df_2, how="left", left_on="A", right_on="A")
   ...: print(time.time()-start)
0.07997989654541016
0.33429503440856934

@mroeschke mroeschke added Benchmark Performance (ASV) benchmarks good first issue and removed Performance Memory or execution speed performance labels Jul 25, 2021
@timhunderwood
Copy link
Contributor Author

I can confirm, this improved between version 1.2.5 and 1.3.0 (>x20 speed improvement for Merging on category index example above).

@jreback
Copy link
Contributor

jreback commented Jul 25, 2021

ok let's see if we have an asv which covers this case and can close

@lukemanley
Copy link
Member

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks Categorical Categorical Data Type good first issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants