Skip to content

Conversation

@ueshin
Copy link
Collaborator

@ueshin ueshin commented Apr 30, 2020

pandas' DataFrame.reset_index() raises an error if the index name is the same as one of columns but allow it when drop=True.

>>> import pandas as pd
>>> import numpy as np
>>> pdf = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, index=np.random.rand(3))
>>> pdf.index.name = "a"
>>> pdf.reset_index()
Traceback (most recent call last):
...
ValueError: cannot insert a, already exists
>>> pdf.reset_index(drop=True)
   a  b
0  1  4
1  2  5
2  3  6

whereas Koalas raises another error for both cases:

>>> ks.from_pandas(pdf).reset_index()
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: "Reference 'a' is ambiguous, could be: a, a.;"

>>> ks.from_pandas(pdf).reset_index(drop=True)
...
pyspark.sql.utils.AnalysisException: "Reference 'a' is ambiguous, could be: a, a.;"

@ueshin ueshin requested a review from HyukjinKwon April 30, 2020 00:57
@itholic
Copy link
Contributor

itholic commented May 1, 2020

LGTM.

@ueshin
Copy link
Collaborator Author

ueshin commented May 1, 2020

Thanks! merging.

@ueshin ueshin merged commit 414a6fb into databricks:master May 1, 2020
@ueshin ueshin deleted the reset_index branch May 1, 2020 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants