Skip to content

COMPAT: reading generic PyTables Table format fails with sub-selection #26818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
19dc304
TST: add TestReadPyTablesHDF5 test scaffold
jgehrcke Jun 12, 2019
e9c7c39
BUG: this fixes #11188
jgehrcke Jun 12, 2019
b7a082a
CLN: do not piggyback idx size on table obj
jgehrcke Jun 14, 2019
964cba1
DOC: update whatsnew for #11188
jgehrcke Jun 12, 2019
04b8423
DOC: improve changelog (squash later)
jgehrcke Jun 14, 2019
7d200a5
squash: remove kwargs, add a bit of docstring
jgehrcke Jun 14, 2019
70e78c9
squash: rework tests (use fixture, assert_frame_equal)
jgehrcke Jun 14, 2019
e7eb30a
squash: address line length errors
jgehrcke Jun 18, 2019
5b881d8
squash: pytables.py: remove one assignment
jgehrcke Jun 18, 2019
9641111
squash: tests: result/expected, Base, fixture scope
jgehrcke Jun 19, 2019
d2a9882
CLN: s/existing/exiting
jgehrcke Jun 19, 2019
aa83473
TST: use ensure_clean_path context manager
jgehrcke Jun 19, 2019
065054f
squash: do not use line cont backslash
jgehrcke Jun 19, 2019
9cc85ce
squash: pass start/stop explicitly into a.convert()
jgehrcke Jun 19, 2019
cd69c0b
DOC: add some docstring
jgehrcke Jun 19, 2019
a9c6f15
squash: flake8 error
jgehrcke Jun 19, 2019
8adf459
squash: change docstrings one more time
jgehrcke Jun 21, 2019
aed78ff
squash: tests: cleanup
jgehrcke Jun 21, 2019
53dba1a
TST: move tests to io/pytables/test_compat.py
jgehrcke Jun 21, 2019
dfec26e
TST: use pytest fixture instead of ensure_clean_path
jgehrcke Jun 21, 2019
b9421af
squash: address linter errors
jgehrcke Jun 21, 2019
9c0e96b
squash: pass path as text
jgehrcke Jun 21, 2019
79bed6a
squash: tests: improve code comment
jgehrcke Jun 21, 2019
bca6ee6
squash: use ensure_clean_path again
jgehrcke Jun 21, 2019
60d37e0
squash: tests: cleanup
jgehrcke Jun 21, 2019
1ce1a70
squash: tests: fix isort error
jgehrcke Jun 21, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -699,6 +699,7 @@ I/O
- Bug in :func:`read_csv` not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (:issue:`15086`)
- Improved performance in :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` when converting columns that have missing values (:issue:`25772`)
- Bug in :meth:`DataFrame.to_html` where header numbers would ignore display options when rounding (:issue:`17280`)
- Bug in :func:`read_hdf` where reading a table from an HDF5 file written directly with PyTables fails with a ``ValueError`` when using a sub-selection via the ``start`` or ``stop`` arguments (:issue:`11188`)
- Bug in :func:`read_hdf` not properly closing store after a ``KeyError`` is raised (:issue:`25766`)
- Bug in ``read_csv`` which would not raise ``ValueError`` if a column index in ``usecols`` was out of bounds (:issue:`25623`)
- Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (:issue:`25772`)
Expand Down
36 changes: 30 additions & 6 deletions pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -1624,7 +1624,8 @@ def infer(self, handler):
new_self.read_metadata(handler)
return new_self

def convert(self, values, nan_rep, encoding, errors):
def convert(self, values, nan_rep, encoding, errors, start=None,
stop=None):
""" set the values from this selection: take = take ownership """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types are a bonus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add start=None and stop=None instead of kwargs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add start=None and stop=None instead of kwargs

done, commit upcoming

can you add a doc-string here

I would love to add one! I have difficulties understanding the general purpose of the convert() method though and documenting values, nan_rep, encoding, and errors is pretty challenging given my lack of understanding. As far as I see the start and stop arguments are not really meaningful arguments here (in IndexCol.convert()), so I am also a bit helpless with documenting them! :-) How would you document start and stop here?


# values is a recarray
Expand Down Expand Up @@ -1813,10 +1814,29 @@ class GenericIndexCol(IndexCol):
def is_indexed(self):
return False

def convert(self, values, nan_rep, encoding, errors):
""" set the values from this selection: take = take ownership """
def convert(self, values, nan_rep, encoding, errors, start=None,
stop=None):
""" set the values from this selection: take = take ownership

Parameters
----------

values : np.ndarray
nan_rep : str
encoding : str
errors : str
start : int, optional
Table row number: the start of the sub-selection.
stop : int, optional
Table row number: the end of the sub-selection. Values larger than
the underlying table's row count are normalized to that.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least here I understand the meaning of start and stop and have tried to document it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, can you add all the parameters. See if you can put something down for them, some doc-string is better than none.

Copy link
Contributor Author

@jgehrcke jgehrcke Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Adding statements for values, nan_rep, encoding, and errors.

The biggest challenge for me still is that I do not feel like I understand the general purpose of the convert() method and its arguments.

"""

start = start if start is not None else 0
stop = (min(stop, self.table.nrows)
if stop is not None else self.table.nrows)
self.values = Int64Index(np.arange(stop - start))

self.values = Int64Index(np.arange(self.table.nrows))
return self

def get_attr(self):
Expand Down Expand Up @@ -2159,7 +2179,8 @@ def validate_attr(self, append):
raise ValueError("appended items dtype do not match existing "
"items dtype in table!")

def convert(self, values, nan_rep, encoding, errors):
def convert(self, values, nan_rep, encoding, errors, start=None,
stop=None):
"""set the data from this selection (and convert to the correct dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

if we can)
"""
Expand Down Expand Up @@ -3431,8 +3452,11 @@ def read_axes(self, where, **kwargs):
# convert the data
for a in self.axes:
a.set_info(self.info)
# `kwargs` may contain `start` and `stop` arguments if passed to
# `store.select()`. If set they determine the index size.
a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding,
errors=self.errors)
errors=self.errors, start=kwargs.get('start'),
stop=kwargs.get('stop'))

return True

Expand Down
76 changes: 76 additions & 0 deletions pandas/tests/io/pytables/test_compat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import pytest

import pandas as pd
from pandas.tests.io.test_pytables import ensure_clean_path
from pandas.util.testing import assert_frame_equal

tables = pytest.importorskip('tables')


@pytest.fixture
def pytables_hdf5_file():
"""Use PyTables to create a simple HDF5 file."""

table_schema = {
'c0': tables.Time64Col(pos=0),
'c1': tables.StringCol(5, pos=1),
'c2': tables.Int64Col(pos=2),
}

t0 = 1561105000.0

testsamples = [
{'c0': t0, 'c1': 'aaaaa', 'c2': 1},
{'c0': t0 + 1, 'c1': 'bbbbb', 'c2': 2},
{'c0': t0 + 2, 'c1': 'ccccc', 'c2': 10**5},
{'c0': t0 + 3, 'c1': 'ddddd', 'c2': 4294967295},
]

objname = 'pandas_test_timeseries'

with ensure_clean_path('written_with_pytables.h5') as path:
# The `ensure_clean_path` context mgr removes the temp file upon exit.
with tables.open_file(path, mode='w') as f:
t = f.create_table('/', name=objname, description=table_schema)
for sample in testsamples:
for key, value in sample.items():
t.row[key] = value
t.row.append()

yield path, objname, pd.DataFrame(testsamples)


class TestReadPyTablesHDF5:
"""
A group of tests which covers reading HDF5 files written by plain PyTables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number here in the comment

(not written by pandas).

Was introduced for regression-testing issue 11188.
"""

def test_read_complete(self, pytables_hdf5_file):
path, objname, df = pytables_hdf5_file
result = pd.read_hdf(path, key=objname)
expected = df
assert_frame_equal(result, expected)

def test_read_with_start(self, pytables_hdf5_file):
path, objname, df = pytables_hdf5_file
# This is a regression test for pandas-dev/pandas/issues/11188
result = pd.read_hdf(path, key=objname, start=1)
expected = df[1:].reset_index(drop=True)
assert_frame_equal(result, expected)

def test_read_with_stop(self, pytables_hdf5_file):
path, objname, df = pytables_hdf5_file
# This is a regression test for pandas-dev/pandas/issues/11188
result = pd.read_hdf(path, key=objname, stop=1)
expected = df[:1].reset_index(drop=True)
assert_frame_equal(result, expected)

def test_read_with_startstop(self, pytables_hdf5_file):
path, objname, df = pytables_hdf5_file
# This is a regression test for pandas-dev/pandas/issues/11188
result = pd.read_hdf(path, key=objname, start=1, stop=2)
expected = df[1:2].reset_index(drop=True)
assert_frame_equal(result, expected)
2 changes: 1 addition & 1 deletion pandas/tests/io/test_pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def ensure_clean_store(path, mode='a', complevel=None, complib=None,
def ensure_clean_path(path):
"""
return essentially a named temporary file that is not opened
and deleted on existing; if path is a list, then create and
and deleted on exiting; if path is a list, then create and
return list of filenames
"""
try:
Expand Down