Skip to content

pd.concat with copy=True doesn't copy columns index #29879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mwtoews opened this issue Nov 27, 2019 · 4 comments · Fixed by #31119
Closed

pd.concat with copy=True doesn't copy columns index #29879

mwtoews opened this issue Nov 27, 2019 · 4 comments · Fixed by #31119
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@mwtoews
Copy link
Contributor

mwtoews commented Nov 27, 2019

Code Sample

import pandas as pd
print(pd.__version__)  # '0.25.3'

# simple frame with a name attribute for columns
df = pd.DataFrame({'a': [1.1], 'b': [2.2]})
df.columns.name = 'x'
print(df)
# x    a    b
# 0  1.1  2.2

# tile rows to a new frame, apparently using the copy of the first frame
df2 = pd.concat([df] * 2, copy=True)
print(df2)
# x    a    b
# 0  1.1  2.2
# 0  1.1  2.2

# clear the column name for the first frame
df.columns.name = None

# and observe the name for the second
assert df2.columns.name is None
print(df2)
#      a    b
# 0  1.1  2.2
# 0  1.1  2.2

# aha, it's the same object
assert df.columns is df2.columns

Problem description

It is expected that copy=True should make a copy of the columns index. However, the demonstration shows it's the same object. Modifications, such as the name property, are made to the source frame and the output frame created by pd.concat.

Expected Output

It is expected that df2.columns is a copy of the original df.columns. This would prevent (e.g.) changing the name property for both frames.

A workaround is to manually copy this property after pd.concat:

df2 = pd.concat([df] * 2, copy=True)
df2.columns = df2.columns.copy()

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.5.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.1.post20191125
Cython : 0.29.14
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : 1.2.6
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.11
tables : 3.6.1
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.6

@jschendel
Copy link
Member

Another example of this that hits a slightly different code path:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+958.g545d17529'

In [2]: df = pd.Series(list('abc'), name='foo').to_frame()

In [3]: df2 = pd.Series(list('xyz'), name='foo').to_frame()

In [4]: df.columns is df2.columns  # verifying not the same
Out[4]: False

In [5]: df3 = pd.concat([df, df2], copy=True)

In [6]: df.columns is df3.columns  # shouldn't be the same but are
Out[6]: True

In [7]: df2.columns is df3.columns  # this is okay
Out[7]: False

@jschendel jschendel added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 27, 2019
@jschendel jschendel added this to the Contributions Welcome milestone Nov 27, 2019
@jreback
Copy link
Contributor

jreback commented Nov 27, 2019

this analysis is not correct. Indexes are immutable, thus they are shared, concat copy=True does not guarantee a copy of an Index as its not necessary.

I would expect this

In [6]: df.columns is df3.columns  # shouldn't be the same but are
Out[6]: True

@mwtoews
Copy link
Contributor Author

mwtoews commented Nov 27, 2019

Further testing shows that this issue is more generic, not just columns. For instance with the other axis:

df = pd.DataFrame({'a': [1.1], 'b': [2.2]})
df3 = pd.concat([df] * 2, copy=True, axis=1)
print(df3)
#      a    b    a    b
# 0  1.1  2.2  1.1  2.2
print(df3.index is df.index)  # True
print(df3.columns is df.columns)  # False

@jorisvandenbossche
Copy link
Member

Indexes are immutable, thus they are shared,

We should share the underlying data, but not the actual Index object? (so make a new Index object from the same data)
As it's only the data that is considered immutable, not the full object (eg the name can change).

@jreback jreback modified the milestones: Contributions Welcome, 1.1 Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants