-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix initialization of DataFrame from dict with NaN as key #18600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,6 +27,7 @@ | |
from pandas.core.dtypes.cast import ( | ||
maybe_upcast, | ||
cast_scalar_to_array, | ||
construct_1d_arraylike_from_scalar, | ||
maybe_cast_to_datetime, | ||
maybe_infer_to_datetimelike, | ||
maybe_convert_platform, | ||
|
@@ -429,44 +430,27 @@ def _init_dict(self, data, index, columns, dtype=None): | |
Needs to handle a lot of exceptional cases. | ||
""" | ||
if columns is not None: | ||
columns = _ensure_index(columns) | ||
arrays = Series(data, index=columns, dtype=object) | ||
data_names = arrays.index | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this will be a perf issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe... but right now it seems to be worse...
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There does seem to be a performance loss on very small dfs. E.g. for ... or I can avoid that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. uhm... those asv results also seem pretty unstable:
I'll try to sort manually and see how it goes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i actually doubt we have good benchmarks on this you are measuring the same benchmark here we need benchmarks that contruct with different dtypes and reducing code complexity is paramount here (though of course don’t want to sacrifice perf) |
||
|
||
# GH10856 | ||
# raise ValueError if only scalars in dict | ||
missing = arrays.isnull() | ||
if index is None: | ||
extract_index(list(data.values())) | ||
|
||
# prefilter if columns passed | ||
data = {k: v for k, v in compat.iteritems(data) if k in columns} | ||
|
||
if index is None: | ||
index = extract_index(list(data.values())) | ||
|
||
# GH10856 | ||
# raise ValueError if only scalars in dict | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you need the .tolist()? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (removed) |
||
index = extract_index(arrays[~missing]) | ||
else: | ||
index = _ensure_index(index) | ||
|
||
arrays = [] | ||
data_names = [] | ||
for k in columns: | ||
if k not in data: | ||
# no obvious "empty" int column | ||
if dtype is not None and issubclass(dtype.type, | ||
np.integer): | ||
continue | ||
|
||
if dtype is None: | ||
# 1783 | ||
v = np.empty(len(index), dtype=object) | ||
elif np.issubdtype(dtype, np.flexible): | ||
v = np.empty(len(index), dtype=object) | ||
else: | ||
v = np.empty(len(index), dtype=dtype) | ||
|
||
v.fill(np.nan) | ||
# no obvious "empty" int column | ||
if missing.any() and not is_integer_dtype(dtype): | ||
if dtype is None or np.issubdtype(dtype, np.flexible): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is the flexible needed here? is this actually hit by a test? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i would appreciate an actual explanation. we do not check for this dtype anywhere else in the codebase. so at the very least this needs a comment There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I would also appreciate an explanation (on that code @ajcr wrote and you committed). |
||
# 1783 | ||
nan_dtype = object | ||
else: | ||
v = data[k] | ||
data_names.append(k) | ||
arrays.append(v) | ||
nan_dtype = dtype | ||
v = construct_1d_arraylike_from_scalar(np.nan, len(index), | ||
nan_dtype) | ||
arrays.loc[missing] = [v] * missing.sum() | ||
|
||
else: | ||
keys = com._dict_keys_to_ordered_list(data) | ||
|
@@ -7253,8 +7237,6 @@ def _arrays_to_mgr(arrays, arr_names, index, columns, dtype=None): | |
# figure out the index, if necessary | ||
if index is None: | ||
index = extract_index(arrays) | ||
else: | ||
index = _ensure_index(index) | ||
|
||
# don't force copy because getting jammed in an ndarray anyway | ||
arrays = _homogenize(arrays, index, dtype) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7341,7 +7341,6 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None, | |
if not is_bool_dtype(dt): | ||
raise ValueError(msg.format(dtype=dt)) | ||
|
||
cond = cond.astype(bool, copy=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what caused you to change this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's useless (bool dtype is checked just above)... but it's admittedly unrelated to the rest of the PR (it just came out debugging it). |
||
cond = -cond if inplace else cond | ||
|
||
# try to align with other | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,7 @@ | |
is_extension_array_dtype, | ||
is_datetime64tz_dtype, | ||
is_timedelta64_dtype, | ||
is_object_dtype, | ||
is_list_like, | ||
is_hashable, | ||
is_iterator, | ||
|
@@ -38,7 +39,8 @@ | |
maybe_upcast, infer_dtype_from_scalar, | ||
maybe_convert_platform, | ||
maybe_cast_to_datetime, maybe_castable, | ||
construct_1d_arraylike_from_scalar) | ||
construct_1d_arraylike_from_scalar, | ||
construct_1d_object_array_from_listlike) | ||
from pandas.core.dtypes.missing import isna, notna, remove_na_arraylike | ||
|
||
from pandas.core.index import (Index, MultiIndex, InvalidIndexError, | ||
|
@@ -297,6 +299,7 @@ def _init_dict(self, data, index=None, dtype=None): | |
# raises KeyError), so we iterate the entire dict, and align | ||
if data: | ||
keys, values = zip(*compat.iteritems(data)) | ||
values = list(values) | ||
else: | ||
keys, values = [], [] | ||
|
||
|
@@ -4042,7 +4045,13 @@ def _try_cast(arr, take_fast_path): | |
|
||
try: | ||
subarr = maybe_cast_to_datetime(arr, dtype) | ||
if not is_extension_type(subarr): | ||
# Take care in creating object arrays (but iterators are not | ||
# supported): | ||
if is_object_dtype(dtype) and (is_list_like(subarr) and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is pretty hard to read, but ok for now, see if can simplify in the future There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, for sure we will need some unified mechanism to process iterators |
||
not (is_iterator(subarr) or | ||
isinstance(subarr, np.ndarray))): | ||
subarr = construct_1d_object_array_from_listlike(subarr) | ||
elif not is_extension_type(subarr): | ||
subarr = np.array(subarr, dtype=dtype, copy=copy) | ||
except (ValueError, TypeError): | ||
if is_categorical_dtype(dtype): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -287,8 +287,50 @@ def test_constructor_dict(self): | |
with tm.assert_raises_regex(ValueError, msg): | ||
DataFrame({'a': 0.7}, columns=['a']) | ||
|
||
with tm.assert_raises_regex(ValueError, msg): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. make a separate test (this change), with a comment There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (done) |
||
DataFrame({'a': 0.7}, columns=['b']) | ||
@pytest.mark.parametrize("scalar", [2, np.nan, None, 'D']) | ||
def test_constructor_invalid_items_unused(self, scalar): | ||
# No error if invalid (scalar) value is in fact not used: | ||
result = DataFrame({'a': scalar}, columns=['b']) | ||
expected = DataFrame(columns=['b']) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
@pytest.mark.parametrize("value", [2, np.nan, None, float('nan')]) | ||
def test_constructor_dict_nan_key(self, value): | ||
# GH 18455 | ||
cols = [1, value, 3] | ||
idx = ['a', value] | ||
values = [[0, 3], [1, 4], [2, 5]] | ||
data = {cols[c]: Series(values[c], index=idx) for c in range(3)} | ||
result = DataFrame(data).sort_values(1).sort_values('a', axis=1) | ||
expected = DataFrame(np.arange(6, dtype='int64').reshape(2, 3), | ||
index=idx, columns=cols) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
result = DataFrame(data, index=idx).sort_values('a', axis=1) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
result = DataFrame(data, index=idx, columns=cols) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
@pytest.mark.parametrize("value", [np.nan, None, float('nan')]) | ||
def test_constructor_dict_nan_tuple_key(self, value): | ||
# GH 18455 | ||
cols = Index([(11, 21), (value, 22), (13, value)]) | ||
idx = Index([('a', value), (value, 2)]) | ||
values = [[0, 3], [1, 4], [2, 5]] | ||
data = {cols[c]: Series(values[c], index=idx) for c in range(3)} | ||
result = (DataFrame(data) | ||
.sort_values((11, 21)) | ||
.sort_values(('a', value), axis=1)) | ||
expected = DataFrame(np.arange(6, dtype='int64').reshape(2, 3), | ||
index=idx, columns=cols) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
result = DataFrame(data, index=idx).sort_values(('a', value), axis=1) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
result = DataFrame(data, index=idx, columns=cols) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
@pytest.mark.skipif(not PY36, reason='Insertion order for Python>=3.6') | ||
def test_constructor_dict_order_insertion(self): | ||
|
@@ -753,15 +795,15 @@ def test_constructor_corner(self): | |
|
||
# does not error but ends up float | ||
df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int) | ||
assert df.values.dtype == np.object_ | ||
assert df.values.dtype == np.dtype('float64') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this changing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because it was wrong: an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, yeah this looks suspect. I would make a new issue for this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
# #1783 empty dtype object | ||
df = DataFrame({}, columns=['foo', 'bar']) | ||
assert df.values.dtype == np.object_ | ||
|
||
df = DataFrame({'b': 1}, index=lrange(10), columns=list('abc'), | ||
dtype=int) | ||
assert df.values.dtype == np.object_ | ||
assert df.values.dtype == np.dtype('float64') | ||
|
||
def test_constructor_scalar_inference(self): | ||
data = {'int': 1, 'bool': True, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -461,7 +461,7 @@ def test_read_one_empty_col_with_header(self, ext): | |
) | ||
expected_header_none = DataFrame(pd.Series([0], dtype='int64')) | ||
tm.assert_frame_equal(actual_header_none, expected_header_none) | ||
expected_header_zero = DataFrame(columns=[0], dtype='int64') | ||
expected_header_zero = DataFrame(columns=[0]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this changing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test was wrong and worked by accident. The result is, and should be, of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok again add this as an example in a new issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again #19646 |
||
tm.assert_frame_equal(actual_header_zero, expected_header_zero) | ||
|
||
@td.skip_if_no('openpyxl') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have an asv that actually hits this path here, e.g. not-none columns and a dict as input? I am concerned that this Series conversion to object is going to cause issues (and an asv or 2 will determine this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some, see below