Skip to content

BUG: Fix initialization of DataFrame from dict with NaN as key #18600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 1, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions asv_bench/benchmarks/frame_ctor.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ class FromDicts(object):

def setup(self):
N, K = 5000, 50
index = tm.makeStringIndex(N)
columns = tm.makeStringIndex(K)
frame = DataFrame(np.random.randn(N, K), index=index, columns=columns)
self.index = tm.makeStringIndex(N)
self.columns = tm.makeStringIndex(K)
frame = DataFrame(np.random.randn(N, K), index=self.index,
columns=self.columns)
self.data = frame.to_dict()
self.some_dict = list(self.data.values())[0]
self.dict_list = frame.to_dict(orient='records')
self.data2 = {i: {j: float(j) for j in range(100)}
for i in range(2000)}
Expand All @@ -31,8 +31,14 @@ def time_list_of_dict(self):
def time_nested_dict(self):
DataFrame(self.data)

def time_dict(self):
Series(self.some_dict)
def time_nested_dict_index(self):
DataFrame(self.data, index=self.index)

def time_nested_dict_columns(self):
DataFrame(self.data, columns=self.columns)

def time_nested_dict_index_columns(self):
DataFrame(self.data, index=self.index, columns=self.columns)

def time_nested_dict_int64(self):
# nested dict, integer indexes, regression described in #621
Expand Down
4 changes: 4 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1135,6 +1135,10 @@ Reshaping
- Bug in :func:`DataFrame.unstack` which casts int to float if ``columns`` is a ``MultiIndex`` with unused levels (:issue:`17845`)
- Bug in :func:`DataFrame.unstack` which raises an error if ``index`` is a ``MultiIndex`` with unused labels on the unstacked level (:issue:`18562`)
- Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`)
- Fixed construction of a :class:`DataFrame` from a ``dict`` containing ``NaN`` as key (:issue:`18455`)
- Suppressed error in the construction of a :class:`DataFrame` from a ``dict`` containing scalar values when the corresponding keys are not included in the passed index (:issue:`18600`)

- Fixed (changed from ``object`` to ``float64``) dtype of :class:`DataFrame` initialized with axes, no data, and ``dtype=int`` (:issue:`19646`)
- Bug in :func:`Series.rank` where ``Series`` containing ``NaT`` modifies the ``Series`` inplace (:issue:`18521`)
- Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`)
- Bug in :func:`DataFrame.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`)
Expand Down
50 changes: 16 additions & 34 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
from pandas.core.dtypes.cast import (
maybe_upcast,
cast_scalar_to_array,
construct_1d_arraylike_from_scalar,
maybe_cast_to_datetime,
maybe_infer_to_datetimelike,
maybe_convert_platform,
Expand Down Expand Up @@ -429,44 +430,27 @@ def _init_dict(self, data, index, columns, dtype=None):
Needs to handle a lot of exceptional cases.
"""
if columns is not None:
columns = _ensure_index(columns)
arrays = Series(data, index=columns, dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an asv that actually hits this path here, e.g. not-none columns and a dict as input? I am concerned that this Series conversion to object is going to cause issues (and an asv or 2 will determine this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some, see below

data_names = arrays.index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be a perf issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe... but right now it seems to be worse...

     [d163de70]       [f7447b3f]
-      47.9±0.3ms       43.5±0.4ms     0.91  frame_ctor.FromDicts.time_frame_ctor_nested_dict
-      31.0±0.1ms       28.1±0.3ms     0.91  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BusinessDay', 2)
-        31.3±1ms       28.2±0.2ms     0.90  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2)
-      31.8±0.3ms       28.0±0.4ms     0.88  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessDay', 2)
-      32.8±0.3ms       28.2±0.2ms     0.86  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Day', 1)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There does seem to be a performance loss on very small dfs. E.g. for pd.DataFrame(data) with data = {1 : [2], 3 : [4], 5 : [6]} I get results around 530 µs for per loop before and 570 µs after. So we are talking about a ~10% gain on large dfs vs. a ~7.5% loss on small dfs.

... or I can avoid that Series and sort manually, at the cost of a bit of added complexity, probably ~10 LoCs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm... those asv results also seem pretty unstable:

      before           after         ratio
     [d163de70]       [f7447b3f]
+      30.7±0.2ms         41.1±3ms     1.34  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('QuarterBegin', 1)
+      29.4±0.1ms         39.1±4ms     1.33  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CBMonthBegin', 2)
+      30.7±0.7ms         40.4±3ms     1.32  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2)
+      31.1±0.7ms         39.1±5ms     1.26  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('SemiMonthEnd', 2)
+      30.4±0.1ms         38.0±4ms     1.25  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Hour', 2)
-        33.8±1ms       30.4±0.8ms     0.90  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Micro', 1)
-      48.3±0.6ms       42.6±0.8ms     0.88  frame_ctor.FromDicts.time_frame_ctor_nested_dict
-        23.8±1ms       20.5±0.2ms     0.86  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253Quarter_1', 2)
-      41.2±0.7ms         30.4±2ms     0.74  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessHour', 2)
-      8.35±0.9ms      6.05±0.01ms     0.72  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253_2', 2)
-        42.4±2ms       30.5±0.5ms     0.72  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BMonthEnd', 2)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I'll try to sort manually and see how it goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually doubt we have good benchmarks on this you are measuring the same benchmark here

we need benchmarks that contruct with different dtypes

and reducing code complexity is paramount here (though of course don’t want to sacrifice perf)


# GH10856
# raise ValueError if only scalars in dict
missing = arrays.isnull()
if index is None:
extract_index(list(data.values()))

# prefilter if columns passed
data = {k: v for k, v in compat.iteritems(data) if k in columns}

if index is None:
index = extract_index(list(data.values()))

# GH10856
# raise ValueError if only scalars in dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need the .tolist()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(removed)

index = extract_index(arrays[~missing])
else:
index = _ensure_index(index)

arrays = []
data_names = []
for k in columns:
if k not in data:
# no obvious "empty" int column
if dtype is not None and issubclass(dtype.type,
np.integer):
continue

if dtype is None:
# 1783
v = np.empty(len(index), dtype=object)
elif np.issubdtype(dtype, np.flexible):
v = np.empty(len(index), dtype=object)
else:
v = np.empty(len(index), dtype=dtype)

v.fill(np.nan)
# no obvious "empty" int column
if missing.any() and not is_integer_dtype(dtype):
if dtype is None or np.issubdtype(dtype, np.flexible):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the flexible needed here? is this actually hit by a test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would appreciate an actual explanation. we do not check for this dtype anywhere else in the codebase. so at the very least this needs a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I would also appreciate an explanation (on that code @ajcr wrote and you committed).

# 1783
nan_dtype = object
else:
v = data[k]
data_names.append(k)
arrays.append(v)
nan_dtype = dtype
v = construct_1d_arraylike_from_scalar(np.nan, len(index),
nan_dtype)
arrays.loc[missing] = [v] * missing.sum()

else:
keys = com._dict_keys_to_ordered_list(data)
Expand Down Expand Up @@ -7253,8 +7237,6 @@ def _arrays_to_mgr(arrays, arr_names, index, columns, dtype=None):
# figure out the index, if necessary
if index is None:
index = extract_index(arrays)
else:
index = _ensure_index(index)

# don't force copy because getting jammed in an ndarray anyway
arrays = _homogenize(arrays, index, dtype)
Expand Down
1 change: 0 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7341,7 +7341,6 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
if not is_bool_dtype(dt):
raise ValueError(msg.format(dtype=dt))

cond = cond.astype(bool, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what caused you to change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useless (bool dtype is checked just above)... but it's admittedly unrelated to the rest of the PR (it just came out debugging it).

cond = -cond if inplace else cond

# try to align with other
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -4841,7 +4841,7 @@ def form_blocks(arrays, names, axes):
items_dict = defaultdict(list)
extra_locs = []

names_idx = Index(names)
names_idx = _ensure_index(names)
if names_idx.equals(axes[0]):
names_indexer = np.arange(len(names_idx))
else:
Expand Down
13 changes: 11 additions & 2 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
is_extension_array_dtype,
is_datetime64tz_dtype,
is_timedelta64_dtype,
is_object_dtype,
is_list_like,
is_hashable,
is_iterator,
Expand All @@ -38,7 +39,8 @@
maybe_upcast, infer_dtype_from_scalar,
maybe_convert_platform,
maybe_cast_to_datetime, maybe_castable,
construct_1d_arraylike_from_scalar)
construct_1d_arraylike_from_scalar,
construct_1d_object_array_from_listlike)
from pandas.core.dtypes.missing import isna, notna, remove_na_arraylike

from pandas.core.index import (Index, MultiIndex, InvalidIndexError,
Expand Down Expand Up @@ -297,6 +299,7 @@ def _init_dict(self, data, index=None, dtype=None):
# raises KeyError), so we iterate the entire dict, and align
if data:
keys, values = zip(*compat.iteritems(data))
values = list(values)
else:
keys, values = [], []

Expand Down Expand Up @@ -4042,7 +4045,13 @@ def _try_cast(arr, take_fast_path):

try:
subarr = maybe_cast_to_datetime(arr, dtype)
if not is_extension_type(subarr):
# Take care in creating object arrays (but iterators are not
# supported):
if is_object_dtype(dtype) and (is_list_like(subarr) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty hard to read, but ok for now, see if can simplify in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for sure we will need some unified mechanism to process iterators

not (is_iterator(subarr) or
isinstance(subarr, np.ndarray))):
subarr = construct_1d_object_array_from_listlike(subarr)
elif not is_extension_type(subarr):
subarr = np.array(subarr, dtype=dtype, copy=copy)
except (ValueError, TypeError):
if is_categorical_dtype(dtype):
Expand Down
50 changes: 46 additions & 4 deletions pandas/tests/frame/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,8 +287,50 @@ def test_constructor_dict(self):
with tm.assert_raises_regex(ValueError, msg):
DataFrame({'a': 0.7}, columns=['a'])

with tm.assert_raises_regex(ValueError, msg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a separate test (this change), with a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done)

DataFrame({'a': 0.7}, columns=['b'])
@pytest.mark.parametrize("scalar", [2, np.nan, None, 'D'])
def test_constructor_invalid_items_unused(self, scalar):
# No error if invalid (scalar) value is in fact not used:
result = DataFrame({'a': scalar}, columns=['b'])
expected = DataFrame(columns=['b'])
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("value", [2, np.nan, None, float('nan')])
def test_constructor_dict_nan_key(self, value):
# GH 18455
cols = [1, value, 3]
idx = ['a', value]
values = [[0, 3], [1, 4], [2, 5]]
data = {cols[c]: Series(values[c], index=idx) for c in range(3)}
result = DataFrame(data).sort_values(1).sort_values('a', axis=1)
expected = DataFrame(np.arange(6, dtype='int64').reshape(2, 3),
index=idx, columns=cols)
tm.assert_frame_equal(result, expected)

result = DataFrame(data, index=idx).sort_values('a', axis=1)
tm.assert_frame_equal(result, expected)

result = DataFrame(data, index=idx, columns=cols)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("value", [np.nan, None, float('nan')])
def test_constructor_dict_nan_tuple_key(self, value):
# GH 18455
cols = Index([(11, 21), (value, 22), (13, value)])
idx = Index([('a', value), (value, 2)])
values = [[0, 3], [1, 4], [2, 5]]
data = {cols[c]: Series(values[c], index=idx) for c in range(3)}
result = (DataFrame(data)
.sort_values((11, 21))
.sort_values(('a', value), axis=1))
expected = DataFrame(np.arange(6, dtype='int64').reshape(2, 3),
index=idx, columns=cols)
tm.assert_frame_equal(result, expected)

result = DataFrame(data, index=idx).sort_values(('a', value), axis=1)
tm.assert_frame_equal(result, expected)

result = DataFrame(data, index=idx, columns=cols)
tm.assert_frame_equal(result, expected)

@pytest.mark.skipif(not PY36, reason='Insertion order for Python>=3.6')
def test_constructor_dict_order_insertion(self):
Expand Down Expand Up @@ -753,15 +795,15 @@ def test_constructor_corner(self):

# does not error but ends up float
df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int)
assert df.values.dtype == np.object_
assert df.values.dtype == np.dtype('float64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was wrong: an int should not upcast to object (the passed dtype is currently not considered). issue? whatsnew?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah this looks suspect. I would make a new issue for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# #1783 empty dtype object
df = DataFrame({}, columns=['foo', 'bar'])
assert df.values.dtype == np.object_

df = DataFrame({'b': 1}, index=lrange(10), columns=list('abc'),
dtype=int)
assert df.values.dtype == np.object_
assert df.values.dtype == np.dtype('float64')

def test_constructor_scalar_inference(self):
data = {'int': 1, 'bool': True,
Expand Down
8 changes: 1 addition & 7 deletions pandas/tests/frame/test_reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -603,13 +603,7 @@ def test_unstack_unused_levels(self):
cols = pd.MultiIndex.from_product([[0, 1], col_level])
expected = pd.DataFrame(exp_data.reshape(3, 6),
index=idx_level, columns=cols)
# Broken (GH 18455):
# tm.assert_frame_equal(result, expected)
diff = result - expected
assert(diff.sum().sum() == 0)
assert((diff + 1).sum().sum() == 8)

assert((result.columns.levels[1] == idx.levels[level]).all())
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("cols", [['A', 'C'], slice(None)])
def test_unstack_unused_level(self, cols):
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/io/test_excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,7 @@ def test_read_one_empty_col_with_header(self, ext):
)
expected_header_none = DataFrame(pd.Series([0], dtype='int64'))
tm.assert_frame_equal(actual_header_none, expected_header_none)
expected_header_zero = DataFrame(columns=[0], dtype='int64')
expected_header_zero = DataFrame(columns=[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was wrong and worked by accident. The result is, and should be, of object dtype; but the "expected" one was too, because the passed dtype wasn't being considered (see above).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok again add this as an example in a new issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again #19646

tm.assert_frame_equal(actual_header_zero, expected_header_zero)

@td.skip_if_no('openpyxl')
Expand Down