-
-
Notifications
You must be signed in to change notification settings - Fork 19.5k
Description
-
I have checked that this issue has not already been reported. (at least I couldn't find one)
-
I have confirmed this bug exists on the latest version of pandas. (1.1.0)
-
(optional) I have confirmed this bug exists on the master branch of pandas. (
934e9f840ebd2e8b5a5181b19a23e033bd3985a5)
Code Sample, a copy-pastable example
This is some high-level example that lead to the investion. It relies on rle-array (commit dfa79295a580d533ee9d2ea901e8808496dbcdc9 was used), because the pandas-provided DatetimeArray uses a NumPy dtype or DatetimeTZDtype. Both cases somewhat work (see "Problem description").
import pandas as pd
from rle_array import RLEArray
array = RLEArray._from_sequence([], dtype="datetime64[ns]")
df = pd.DataFrame({"x": array})Traceback (most recent call last):
File "bug.py", line 5, in <module>
pd.DataFrame({"x": array})
File ".../lib/python3.8/site-packages/pandas/core/frame.py", line 467, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File ".../lib/python3.8/site-packages/pandas/core/internals/construction.py", line 283, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File ".../lib/python3.8/site-packages/pandas/core/internals/construction.py", line 93, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File ".../lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1650, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File ".../lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1703, in form_blocks
block_type = get_block_type(v)
File ".../lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 2672, in get_block_type
assert not is_datetime64tz_dtype(values.dtype)
AssertionError
Problem description
See here:
pandas/pandas/core/internals/blocks.py
Lines 2647 to 2690 in 934e9f8
| def get_block_type(values, dtype=None): | |
| """ | |
| Find the appropriate Block subclass to use for the given values and dtype. | |
| Parameters | |
| ---------- | |
| values : ndarray-like | |
| dtype : numpy or pandas dtype | |
| Returns | |
| ------- | |
| cls : class, subclass of Block | |
| """ | |
| dtype = dtype or values.dtype | |
| vtype = dtype.type | |
| if is_sparse(dtype): | |
| # Need this first(ish) so that Sparse[datetime] is sparse | |
| cls = ExtensionBlock | |
| elif is_categorical_dtype(values.dtype): | |
| cls = CategoricalBlock | |
| elif issubclass(vtype, np.datetime64): | |
| assert not is_datetime64tz_dtype(values.dtype) | |
| cls = DatetimeBlock | |
| elif is_datetime64tz_dtype(values.dtype): | |
| cls = DatetimeTZBlock | |
| elif is_interval_dtype(dtype) or is_period_dtype(dtype): | |
| cls = ObjectValuesExtensionBlock | |
| elif is_extension_array_dtype(values.dtype): | |
| cls = ExtensionBlock | |
| elif issubclass(vtype, np.floating): | |
| cls = FloatBlock | |
| elif issubclass(vtype, np.timedelta64): | |
| assert issubclass(vtype, np.integer) | |
| cls = TimeDeltaBlock | |
| elif issubclass(vtype, np.complexfloating): | |
| cls = ComplexBlock | |
| elif issubclass(vtype, np.integer): | |
| cls = IntBlock | |
| elif dtype == np.bool_: | |
| cls = BoolBlock | |
| else: | |
| cls = ObjectBlock | |
| return cls |
datetime (and also interval) types are checked BEFORE extension types which means that extension datetime types never end up in ExtensionBlocks. The latter one would be useful if:
- the datetime objects is not compatible with NumPy
- the data should not be converted to to NumPy (e.g. due to compression, like in the
rle-arraycase)
Furthermore the invariant issubclass(vtype, np.datetime64) => not is_datetime64tz_dtype(values.dtype) does NOT hold for all extension dtypes, at least not under the current implementation of is_datetime64tz_dtype:
pandas/pandas/core/dtypes/common.py
Lines 415 to 421 in 934e9f8
| if isinstance(arr_or_dtype, ExtensionDtype): | |
| # GH#33400 fastpath for dtype object | |
| return arr_or_dtype.kind == "M" | |
| if arr_or_dtype is None: | |
| return False | |
| return DatetimeTZDtype.is_dtype(arr_or_dtype) |
Expected Output
The code example works and df._data shows that the data ends up in an ExtensionBlock.
Output of pd.show_versions()
Details
INSTALLED VERSIONS
------------------
commit : d9fff2792bf16178d4e450fe7384244e50635733
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.0
Cython : None
pytest : 6.0.1
hypothesis : None
sphinx : 3.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.50.1