Skip to content

BUG: read_parquet wrongly returns empty index if asked to read empty column list #59028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
batterseapower opened this issue Jun 17, 2024 · 3 comments
Open
2 of 3 tasks
Labels
Bug IO Parquet parquet, feather Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@batterseapower
Copy link
Contributor

batterseapower commented Jun 17, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

pd.DataFrame(index=['A', 'B'], columns=['C', 'D']).to_parquet('temp.parquet')

# All correctly print 2
print(len(pd.read_parquet('temp.parquet', columns=['C']).index))
print(len(pd.read_parquet('temp.parquet', columns=[]).index))
print(pq.read_table('temp.parquet', columns=[]).num_rows)
print(len(pq.read_table('temp.parquet', columns=[]).to_pandas().index))

pd.DataFrame(index=pd.RangeIndex(2), columns=['C', 'D']).to_parquet('temp.parquet')

# Correctly prints 2
print(len(pd.read_parquet('temp.parquet', columns=['C']).index))
# BUG: prints 0!
print(len(pd.read_parquet('temp.parquet', columns=[]).index))
# Correctly prints 2
print(pq.read_table('temp.parquet', columns=[]).num_rows)
print(len(pq.read_table('temp.parquet', columns=[]).to_pandas().index))

pd.DataFrame(index=pd.RangeIndex(2), columns=[]).to_parquet('temp.parquet')

# BUG: all incorrectly print 0
print(len(pd.read_parquet('temp.parquet', columns=[]).index))
print(len(pd.read_parquet('temp.parquet', columns=None).index))
print(pq.read_table('temp.parquet', columns=[]).num_rows)
print(len(pq.read_table('temp.parquet', columns=[]).to_pandas().index))

pq.write_table(pa.Table.from_pandas(pd.DataFrame(index=pd.RangeIndex(2), columns=[])), 'temp.parquet')

# BUG: all incorrectly print 0
print(len(pd.read_parquet('temp.parquet', columns=[]).index))
print(len(pd.read_parquet('temp.parquet', columns=None).index))
print(pq.read_table('temp.parquet', columns=[]).num_rows)
print(len(pq.read_table('temp.parquet', columns=[]).to_pandas().index))

Issue Description

If you do pd.read_parquet(columns=[]) the resulting DataFrame should have no columns, but should still have the expected index. However, in the special case where the expected index is a trivial RangeIndex we are instead getting an empty index with entirely the wrong length.

Expected Behavior

If column 'c' exists in a file, it should be the case that pd.read_parquet(path, columns=[c]).index.equals(pd.read_parquet(path, columns=[]).index)

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2e python : 3.11.3.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-348.20.1.el8_5.x86_64 Version : #1 SMP Thu Mar 10 20:59:28 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 2.2.2
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.13.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.9.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@batterseapower batterseapower added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 17, 2024
@batterseapower
Copy link
Contributor Author

My guess based on the observable behaviour is that there are two seperate bugs here:

  1. Pandas has a bug in the code for pd.read_parquet in the columns=[] case
  2. pq.write_table has a bug in the case where the Table it is asked to serialize is empty, causing it to write a parquet file with 0 rows rather than the true row count. The similar bug in DataFrame.to_parquet is a consequence of this pyarrow bug.

@batterseapower
Copy link
Contributor Author

batterseapower commented Jun 17, 2024

Actually, maybe the bug is purely in pyarrow since engine='fastparquet' fixes both things?

pd.DataFrame(index=pd.RangeIndex(2), columns=['C', 'D']).to_parquet('temp.parquet')

# Prints 2
print(len(pd.read_parquet('temp.parquet', columns=[], engine='fastparquet').index))

pd.DataFrame(index=pd.RangeIndex(2), columns=[]).to_parquet('temp.parquet', engine='fastparquet')

# Prints 2
print(pq.read_table('temp.parquet', columns=[]).num_rows)

@rhshadrach rhshadrach added the IO Parquet parquet, feather label Jun 19, 2024
@luke396
Copy link
Contributor

luke396 commented Jun 29, 2024

The reason for the inconsistent behavior is the use of use_pandas_metadata=True, introduced by #34500

kwargs["use_pandas_metadata"] = True

df = pd.DataFrame(index=pd.RangeIndex(2), columns=['C', 'D'])
df.to_parquet('temp.parquet')

import pyarrow.parquet as pq

print(pq.read_table('temp.parquet', columns=[], use_pandas_metadata=True).to_pandas())
print(pq.read_table('temp.parquet', columns=[]).to_pandas())

import fastparquet as fp


print(fp.ParquetFile('temp.parquet').to_pandas(columns=[]))

# Empty DataFrame
# Columns: []
# Index: []
# Empty DataFrame
# Columns: []
# Index: [0, 1]
# Empty DataFrame
# Columns: []
# Index: [0, 1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants