Skip to content

API / CoW: constructing DataFrame from DataFrame creates lazy copy #50499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
4 changes: 4 additions & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -951,6 +951,10 @@ Indexing
- Bug in :class:`BusinessHour` would cause creation of :class:`DatetimeIndex` to fail when no opening hour was included in the index (:issue:`49835`)
-

Copy on write
^^^^^^^^^^^^^
- Bug in :class:`DataFrame` constructor not tracking reference if called with another :class:`DataFrame` (:issue:`50499`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the equivalent Series change I listed this in the "Copy-on-Write improvements" section instead of seeing it as a bug fix (it was a known gap left out of the initial implementation, and it also changes behaviour, so at this stage I wouldn't yet label it as a bug fix)


Missing
^^^^^^^
- Bug in :meth:`Index.equals` raising ``TypeError`` when :class:`Index` consists of tuples that contain ``NA`` (:issue:`48446`)
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,7 @@
to_arrays,
treat_as_nested,
)
from pandas.core.internals.managers import using_copy_on_write
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import can now be removed (after a merge with main)

from pandas.core.reshape.melt import melt
from pandas.core.series import Series
from pandas.core.shared_docs import _shared_docs
Expand Down Expand Up @@ -637,6 +638,8 @@ def __init__(

if isinstance(data, DataFrame):
data = data._mgr
if not copy and using_copy_on_write():
data = data.copy(deep=False)

if isinstance(data, (BlockManager, ArrayManager)):
# first check if a Manager is passed without any other arguments
Expand Down
7 changes: 6 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -5257,7 +5257,12 @@ def _reindex_with_indexers(
# If we've made a copy once, no need to make another one
copy = False

if (copy or copy is None) and new_data is self._mgr:
if (
(copy or copy is None)
and new_data is self._mgr
or not copy
and using_copy_on_write()
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a bit how this change is related? We are passing here a Manager object to the constructor, I suppose, and you only changed the constructor when it receives a DataFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, this is unrelated to the constructor change, I'll make a separate pr.

Currently, when reindexing with copy=False we don't track references, causing problems.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have some skips in #50536 related to reindex, there are several things not yet fully working it seems

new_data = new_data.copy(deep=copy)

return self._constructor(new_data).__finalize__(self)
Expand Down
24 changes: 24 additions & 0 deletions pandas/tests/copy_view/test_constructors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import numpy as np
import pytest

from pandas import DataFrame
import pandas._testing as tm
from pandas.tests.copy_view.util import get_array


@pytest.mark.parametrize("columns", [None, ["a"]])
def test_dataframe_constructor_mgr(using_copy_on_write, columns):
df = DataFrame({"a": [1, 2, 3]})
df_orig = df.copy()

new_df = DataFrame(df)

assert np.shares_memory(get_array(df, "a"), get_array(new_df, "a"))
new_df.iloc[0] = 100

if using_copy_on_write:
assert not np.shares_memory(get_array(df, "a"), get_array(new_df, "a"))
tm.assert_frame_equal(df, df_orig)
else:
assert np.shares_memory(get_array(df, "a"), get_array(new_df, "a"))
tm.assert_frame_equal(df, new_df)
6 changes: 5 additions & 1 deletion pandas/tests/frame/methods/test_align.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
date_range,
)
import pandas._testing as tm
from pandas.core.internals.managers import using_copy_on_write


class TestDataFrameAlign:
Expand Down Expand Up @@ -45,7 +46,10 @@ def test_align_float(self, float_frame):
assert af._mgr is not float_frame._mgr

af, bf = float_frame.align(float_frame, copy=False)
assert af._mgr is float_frame._mgr
if using_copy_on_write():
assert af._mgr is not float_frame._mgr
else:
assert af._mgr is float_frame._mgr

# axis = 0
other = float_frame.iloc[:-5, :3]
Expand Down
6 changes: 4 additions & 2 deletions pandas/tests/indexing/test_iloc.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,9 @@ class TestiLocBaseIndependent:
],
)
@pytest.mark.parametrize("indexer", [tm.loc, tm.iloc])
def test_iloc_setitem_fullcol_categorical(self, indexer, key, using_array_manager):
def test_iloc_setitem_fullcol_categorical(
self, indexer, key, using_array_manager, using_copy_on_write
):
frame = DataFrame({0: range(3)}, dtype=object)

cat = Categorical(["alpha", "beta", "gamma"])
Expand All @@ -90,7 +92,7 @@ def test_iloc_setitem_fullcol_categorical(self, indexer, key, using_array_manage
indexer(df)[key, 0] = cat

expected = DataFrame({0: cat}).astype(object)
if not using_array_manager:
if not using_array_manager and not using_copy_on_write:
assert np.shares_memory(df[0].values, orig_vals)

tm.assert_frame_equal(df, expected)
Expand Down