Skip to content

COMPAT: reading generic PyTables Table format fails with sub-selection #26818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 21, 2019

Conversation

jgehrcke
Copy link
Contributor

@jgehrcke jgehrcke commented Jun 12, 2019

(not written by pandas).
"""

def _create_simple_hdf5_file_with_pytables(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary or could you just add a file in pandas\tests\io\data

Copy link
Contributor Author

@jgehrcke jgehrcke Jun 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary or could you just add a file in pandas\tests\io\data

Oh, that's some early feedback. Thank you! Much appreciated.

I think a proper test for the fix for #11188 should confirm that the minimal working example for reproducing the problem does not fail anymore.

I also think this is more readable (when the test itself shows how the data file was created, in contrast to committing a binary file of which it's not 100 % clear and discoverable of how it has been created).

Also, I'd like to think of this (using the PyTables API to create an HDF5 file) as an important aspect of this test: it effectively confirms a common way to use PyTables and pandas in tandem; PyTables for writing an HDF5 file, and pandas for reading it. These tests are not so much about the contents in the file, but about that a specific way to create the file matches a specific way to parse the file.

I'll try to make this a little more compact, though.

Well, let's see about the "fix", looking forward to your feedback there (commit upcoming). I think this will be more controversial.

@gfyoung gfyoung added Bug IO HDF5 read_hdf, HDFStore labels Jun 12, 2019
@pep8speaks
Copy link

pep8speaks commented Jun 12, 2019

Hello @jgehrcke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-21 13:33:33 UTC

@codecov
Copy link

codecov bot commented Jun 12, 2019

Codecov Report

Merging #26818 into master will decrease coverage by 50.76%.
The diff coverage is 75%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26818       +/-   ##
===========================================
- Coverage   91.86%    41.1%   -50.77%     
===========================================
  Files         179      179               
  Lines       50707    50710        +3     
===========================================
- Hits        46583    20842    -25741     
- Misses       4124    29868    +25744
Flag Coverage Δ
#multiple ?
#single 41.1% <75%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/io/pytables.py 89.3% <75%> (-1%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/plotting/_matplotlib/__init__.py 0% <0%> (-100%) ⬇️
pandas/io/gcs.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/io/s3.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 634577e...7f21dbe. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 12, 2019

Codecov Report

Merging #26818 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26818      +/-   ##
==========================================
- Coverage   91.98%   91.98%   -0.01%     
==========================================
  Files         180      180              
  Lines       50772    50774       +2     
==========================================
- Hits        46704    46702       -2     
- Misses       4068     4072       +4
Flag Coverage Δ
#multiple 90.62% <100%> (+0.04%) ⬆️
#single 41.82% <100%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/io/pytables.py 90.3% <100%> (ø) ⬆️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.89% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2243629...1ce1a70. Read the comment docs.

@jgehrcke
Copy link
Contributor Author

Funny codecov bot :-). Well, let's wait for this round of CI to report back.

@jgehrcke jgehrcke changed the title Work in progress: fix issue #11188 Fix issue #11188 Jun 12, 2019
@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jun 12, 2019

@gfyoung @simonjayhawkins @jreback I would appreciate your feedback, especially about the "fix" itself for starters. The workaround is not necessarily nice. All tests pass, but could it be that this approach (introducing table._nrows_to_read = stop - start dynamically, and using it dynamically instead of table.nrows) has negative side effects? I want to say that I didn't quite see a meaningful way to pass start and stop explicitly down to GenericIndexCol; by the moment we talk about GenericIndexCol these two parameters are more table parameters than index col parameters. WDYT?

# Piggy-back normalized `nrows` (considering start and stop) onto
# PyTables tables object so that the GenericIndexCol constructor
# knows how large the index should be.
self.s.table._nrows_to_read = stop - start
Copy link
Member

@gfyoung gfyoung Jun 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start and stop - why not make these instance attributes of self ?

Or even just compute nrows as an attribute?

Copy link
Contributor Author

@jgehrcke jgehrcke Jun 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huhu!

start and stop - why not make these instance attributes of self ?

Hm, what do you have in mind about doing with them after having set them as instance attributes of self? In the three lines below we're doing exactly that (we set self.start and self.stop; did you mean something else?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even just compute nrows as an attribute?

Interesting. I thought about this a bit but I am not sure what you mean. Where; on the PyTables table object?

That would be quite invasive. A PyTables table object might use its nrows property internally.

Adding a new private attribute to an object (guaranteed to not mess with the object's intrinsics) is not beautiful, but a little less invasive than overwriting an object property which might mess with that object's internal consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgehrcke : I'm talking about attaching a new attribute to the TableIterator instance itself.

Attaching attributes to the TableIterator's s.table is riskier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgehrcke : I'm talking about attaching a new attribute to the TableIterator instance itself.

Gotcha! I'll look into that. Thanks for the feedback, much appreciated.

Attaching attributes to the TableIterator's s.table is riskier.

Ack!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is however no obvious way for me to access the TableIterator instance from the GenericIndexCol instance. The latter inherits from IndexCol which gets its .table attribute set through a set_table() method, but as far as I can see there is no existing obvious relationship to reach back to the TableIterator... I think I had the problem before which is why I chose the table object to piggy-back the piece of information on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung so this is the context where we call into the convert() method of the GenericIndexCol:

a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding,

There are certainly other ways to pass along the _nrows_to_read concept but I didn't see a cleaner way yet, specifically I didn't see a way that does not add additional scaffold, passing arguments around etc.

Did you have something very specific in mind that I simply didn't see?

Copy link
Member

@gfyoung gfyoung Jun 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting...we might have to then propagate an nrows parameter throughout the IndexCol instantiations.

Attaching a parameter to an object we have no information about is definitely riskier and hackier (and is therefore a practice I would discourage) compared to passing arguments around.

@jreback : Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attaching a parameter to an object we have no information about is definitely riskier and hackier (and is therefore a practice I would discourage) compared to passing arguments around.

I do agree.

@jreback jreback changed the title Fix issue #11188 COMPAT: reading generic PyTables Table format fails with sub-selection Jun 13, 2019
@jgehrcke
Copy link
Contributor Author

@gfyoung @jreback after following the quite entangled code path I think I found a cleaner solution.

Key insight was the following:

In HDFStore.select() we define a callback func:

        # function to call on iteration
        def func(_start, _stop, _where):
            return s.read(start=_start, stop=_stop,
                          where=_where,
                          columns=columns)

        # create the iterator
        it = TableIterator(self, s, func, where=where, nrows=s.nrows,
                           start=start, stop=stop, iterator=iterator,
                           chunksize=chunksize, auto_close=auto_close)

        return it.get_result()

In the problematic code path s.read() actually translates to Table.read_axes() which is prepared to receive arbitrary keyword arguments:

    def read_axes(self, where, **kwargs):
        """create and return the axes sniffed from the table: return boolean
        for success
        """
        ...

As of the callback being s.read(start=_start, stop=_stop, ....) we know that the start and stop parameters are available as keyword arguments within Table.read_axes().

read_axes() calls into GenericIndexCol.convert() and so this is a decent opportunity to pass the start and stop parameters directly to convert(). I have added a corresponding commit. Please have another look, thanks!

@@ -656,6 +656,7 @@ I/O
- Bug in :func:`read_csv` not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (:issue:`15086`)
- Improved performance in :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` when converting columns that have missing values (:issue:`25772`)
- Bug in :meth:`DataFrame.to_html` where header numbers would ignore display options when rounding (:issue:`17280`)
- Bug in :func:`read_hdf` where the `start` and `stop` arguments would raise a ``ValueError`` indicating an index size mismatch (:issue:`11188`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not very informative of what you are actually addressing, can you re-word

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit upcoming

@@ -1627,7 +1627,7 @@ def infer(self, handler):
new_self.read_metadata(handler)
return new_self

def convert(self, values, nan_rep, encoding, errors):
def convert(self, values, nan_rep, encoding, errors, **kwargs):
""" set the values from this selection: take = take ownership """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types are a bonus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add start=None and stop=None instead of kwargs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add start=None and stop=None instead of kwargs

done, commit upcoming

can you add a doc-string here

I would love to add one! I have difficulties understanding the general purpose of the convert() method though and documenting values, nan_rep, encoding, and errors is pretty challenging given my lack of understanding. As far as I see the start and stop arguments are not really meaningful arguments here (in IndexCol.convert()), so I am also a bit helpless with documenting them! :-) How would you document start and stop here?

@@ -1816,10 +1816,14 @@ class GenericIndexCol(IndexCol):
def is_indexed(self):
return False

def convert(self, values, nan_rep, encoding, errors):
def convert(self, values, nan_rep, encoding, errors, **kwargs):
""" set the values from this selection: take = take ownership """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 adding a docstring explaining the start and stop` parameters, commit upcoming

@@ -2162,7 +2166,7 @@ def validate_attr(self, append):
raise ValueError("appended items dtype do not match existing "
"items dtype in table!")

def convert(self, values, nan_rep, encoding, errors):
def convert(self, values, nan_rep, encoding, errors, **kwargs):
"""set the data from this selection (and convert to the correct dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

]

# This returns a path and does not open the file.
tmpfilepath = create_tempfile(self.path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this should just be a fixture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

I now turned the entire thing into a module-scoped fixture, providing the path to the HDF5 file and details about its contents.


def _compare(self, df, samples):
"""Compare the reference `samples` with the contents in DataFrame `df`.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't write a custom comparator, use assert_frame_equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Interesting:

E       AssertionError: Attributes are different
E       
E       Attribute "dtype" are different
E       [left]:  uint32
E       [right]: int64

But that's probably expected, right? So, we expect the values in the UInt32Col in the PyTables file to end up being of type int64, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the review, I have committed and pushed changes and would appreciate another round of feedback.

🚀

Table row number: the start of the sub-selection.
stop : int, optional
Table row number: the end of the sub-selection. Values larger than
the underlying table's row count are normalized to that.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least here I understand the meaning of start and stop and have tried to document it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, can you add all the parameters. See if you can put something down for them, some doc-string is better than none.

Copy link
Contributor Author

@jgehrcke jgehrcke Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Adding statements for values, nan_rep, encoding, and errors.

The biggest challenge for me still is that I do not feel like I understand the general purpose of the convert() method and its arguments.

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jun 15, 2019

Will polish (squash commits, address line length issues) after another round of comments.

Update: just force-pushed after rebase on current master:

  • removed the unrelated requirements-dev.txt patch
  • addressed linter errors (line length)

jgehrcke added a commit to jgehrcke/goeffel that referenced this pull request Jun 18, 2019
--start and --stop upon read_hdf() do not yet work as of
pandas-dev/pandas#26818
Table row number: the start of the sub-selection.
stop : int, optional
Table row number: the end of the sub-selection. Values larger than
the underlying table's row count are normalized to that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, can you add all the parameters. See if you can put something down for them, some doc-string is better than none.

"""

start = start if start is not None else 0
stop = min(stop, self.table.nrows) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use perens rather than a line-continuation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit upcoming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding,
errors=self.errors)
errors=self.errors, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this kwargs coming from? do you mean to pass ing start/stop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this kwargs coming from

I have tried to answer this question with the code comment added right above that line, and all I know is that kwargs may contain start and stop. I could not infer any guarantees.

Following the kwargs flow(s) is a challenging part of reading io/pytables.py. It's pretty non-obvious.

There are two calls into that function that pass **kwargs where it's non-obvious where it's coming from:

if not self.read_axes(where=where, **kwargs):

if not self.read_axes(where=where, **kwargs):

For following just the latter flow I am now looking for read() being called and this is where things get tricky...

$ grep -nr 'read(' pandas/io/pytables.py | grep -v 'def ' | grep -v validate_read
721:            return s.read(start=_start, stop=_stop,
837:            objs = [t.read(where=_where, columns=columns, start=_start,
1421:        return s.read(**kwargs)
2934:            sdict[c] = s.read()
4241:        s = super().read(columns=columns, **kwargs)
4342:        df = super().read(**kwargs)
4740:        return self.table.table.read(start=self.start, stop=self.stop)

From here it looks like there is more kwargs to follow. I know one definite call path that comes along line 837 and there we know that start and stop are being passed explicitly (that's the code path coming from read_hdf()). But I am not sure about other possible code paths.

Relying on the test suite to not break, but also relying on not relying on start and stop being passed explicitly.

I am now doing this instead:

-                      errors=self.errors, **kwargs)
+                      errors=self.errors, start=kwargs.get('start'),
+                      stop=kwargs.get('stop'))

So that things still don't break when start or stop are not in kwargs. Sound good?

@@ -5167,3 +5168,71 @@ def test_dst_transitions(self):
store.append('df', df)
result = store.select('df')
assert_frame_equal(result, df)


@pytest.fixture(scope='module')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessary, function scoped is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 commit upcoming

{'c0': t0 + 2, 'c1': 'ccccc', 'c2': 10**5},
{'c0': t0 + 3, 'c1': 'ddddd', 'c2': 4294967295},
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ensure_clean_path here

Copy link
Contributor Author

@jgehrcke jgehrcke Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, commit upcoming.

Food for thought (I don't expect answers for this pull request here, it's probably a rabbit hole):

  • If a test fails, don't we want to retain the corresponding HDF5 file in the file system to enable later inspection? If the answer is "yes" then we should not use a context manager for (quite reliable) file removal but do it in the fixture teardown only upon test success.

  • What do we think about using a pytest-provided fixture for temporary file management?

@pytest.fixture()
def pytables_hdf5_file(tmp_path):
    [...]

In that case the file path would contain the test name, e.g. tmp_path = PosixPath('PYTEST_TMPDIR/test_create_file0') as documented here: https://docs.pytest.org/en/latest/tmpdir.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_clean_path already does this, not really necessary to do anything else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_clean_path already does this

The way I read the code it will always remove the file, also upon test failure.

Anyway :).

return tmpfilepath, objectname, pd.DataFrame(testsamples)


class TestReadPyTablesHDF5(Base):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to inherit from Base?

Copy link
Contributor Author

@jgehrcke jgehrcke Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to inherit from Base?

I do not really know. Not sure if tm.reset_testing_mode() is important at all in my context here, # Pytables 3.0.0 deprecates lots of things kind of sounds important though:


    @classmethod
    def setup_class(cls):

        # Pytables 3.0.0 deprecates lots of things
        tm.reset_testing_mode()

    @classmethod
    def teardown_class(cls):

        # Pytables 3.0.0 deprecates lots of things
        tm.set_testing_mode()

Given that you doubt that I need it I have added a commit and don't inherit from Base anymore. The tests still pass on my system.


def test_read_complete(self, pytables_hdf5_file):
path, objname, expected_df = pytables_hdf5_file
assert_frame_equal(pd.read_hdf(path, key=objname), expected_df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write the asserts like

result =
expected =
assert_frame_equal(result, expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks

@jgehrcke
Copy link
Contributor Author

@jreback I tried to address all points in your last round of feedback. I feel like we're getting closer. Would very much appreciate another look, thanks for your time 🕙 !

----------

values :
Expected to be passed but ignored in this implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you write that these are, e.g.

values: np.ndarray
nan_rep : str
encoding: str
errors : str (I think)

values :
Expected to be passed but ignored in this implementation.
nan_rep :
Expected to be passed but ignored in this implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need these comments

@@ -5,6 +5,7 @@
from io import BytesIO
import os
import tempfile
import time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you using this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used this for dynamically generating unix time stamps (time.time()) before writing them to the HDF5 file, but made this static now.

{'c0': t0 + 2, 'c1': 'ccccc', 'c2': 10**5},
{'c0': t0 + 3, 'c1': 'ddddd', 'c2': 4294967295},
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_clean_path already does this, not really necessary to do anything else

@@ -5167,3 +5168,70 @@ def test_dst_transitions(self):
store.append('df', df)
result = store.select('df')
assert_frame_equal(result, df)


@pytest.fixture()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the ()

@@ -5167,3 +5168,70 @@ def test_dst_transitions(self):
store.append('df', df)
result = store.select('df')
assert_frame_equal(result, df)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move these to a separate (new) file: test_pytables_compat.py (even better to move to

pandas/tests/io/pytables/test_compat.py (you will need to add an init.py in the subdir)

we are starting to split the test_pytables up a bit

Copy link
Contributor Author

@jgehrcke jgehrcke Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(even better to move to pandas/tests/io/pytables/test_compat.py )

Doing so!

(you will need to add an init.py in the subdir)

Why?

It does not need to be a Python package for pytest to discover tests in the directory:

$ pytest --collect-only pandas/tests
[...]
<Module pandas/tests/io/pytables/test_compat.py>
  <Class TestReadPyTablesHDF5>
      <Function test_read_complete>
      <Function test_read_with_start>
      <Function test_read_with_stop>
      <Function test_read_with_startstop>
[...]

Any other reason?

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jun 21, 2019

@jreback thanks for your feedback once again.

I have

  • rebased on upstream/master (clean)
  • added four new commits to address your last comments

In particular, I have now moved the new tests to pandas/tests/io/pytables/test_compat.py as suggested. While doing so I have not created a pandas/tests/io/pytables/__init__.py file because I don't see the reason yet (it's not for test discovery, but is it for linters?).

I have also not used ensure_clean_path in the new test module (because then I would have needed to refactor it out of where it is right now). Instead I have used the pytest tmp_path fixture which I think provides everything we need with less code: it guarantees file system isolation between tests, it retains files for later inspection (as opposed to the ensure_clean_path context manager) and removes them in a rolling fashion so that storage space requirement stays predictable (does not grow indefinitely with the number of invocations). For instance, this is after invoking pytest 11 times (files from the last three invocations are retained, the rest got rotated away):

$ tree
.
├── pytest-10
│   ├── test_read_complete0
│   │   └── written_with_pytables.h5
│   ├── test_read_completecurrent -> /tmp/pytest-of-jp/pytest-10/test_read_complete0
│   ├── test_read_with_start0
│   │   └── written_with_pytables.h5
│   ├── test_read_with_startcurrent -> /tmp/pytest-of-jp/pytest-10/test_read_with_start0
│   ├── test_read_with_startstop0
│   │   └── written_with_pytables.h5
│   ├── test_read_with_startstopcurrent -> /tmp/pytest-of-jp/pytest-10/test_read_with_startstop0
│   ├── test_read_with_stop0
│   │   └── written_with_pytables.h5
│   └── test_read_with_stopcurrent -> /tmp/pytest-of-jp/pytest-10/test_read_with_stop0
├── pytest-8
│   ├── test_read_complete0
│   │   └── written_with_pytables.h5
│   ├── test_read_completecurrent -> /tmp/pytest-of-jp/pytest-8/test_read_complete0
│   ├── test_read_with_start0
│   │   └── written_with_pytables.h5
│   ├── test_read_with_startcurrent -> /tmp/pytest-of-jp/pytest-8/test_read_with_start0
│   ├── test_read_with_startstop0
│   │   └── written_with_pytables.h5
│   ├── test_read_with_startstopcurrent -> /tmp/pytest-of-jp/pytest-8/test_read_with_startstop0
│   ├── test_read_with_stop0
│   │   └── written_with_pytables.h5
│   └── test_read_with_stopcurrent -> /tmp/pytest-of-jp/pytest-8/test_read_with_stop0
├── pytest-9
│   ├── test_read_complete0
│   │   └── written_with_pytables.h5
│   ├── test_read_completecurrent -> /tmp/pytest-of-jp/pytest-9/test_read_complete0
│   ├── test_read_with_start0
│   │   └── written_with_pytables.h5
│   ├── test_read_with_startcurrent -> /tmp/pytest-of-jp/pytest-9/test_read_with_start0
│   ├── test_read_with_startstop0
│   │   └── written_with_pytables.h5
│   ├── test_read_with_startstopcurrent -> /tmp/pytest-of-jp/pytest-9/test_read_with_startstop0
│   ├── test_read_with_stop0
│   │   └── written_with_pytables.h5
│   └── test_read_with_stopcurrent -> /tmp/pytest-of-jp/pytest-9/test_read_with_stop0
└── pytest-current -> /tmp/pytest-of-jp/pytest-10


objname = 'pandas_test_timeseries'

# The `tmp_path` fixture provides a temporary directory unique to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use ensure_clean_path; we don't use the pytest fixture at this time, though you can open an issue to change (many occurrences of this I suppose).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but need to clean up the test files, ping on green.

def pytables_hdf5_file(tmp_path):
"""Use PyTables to create a simple HDF5 file.

There is no need for temporary file cleanup: pytest's `tmp_path` fixture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment as well. see my comment below. we don't want to introduce a different idiom for testing (but as I said you can open an issue for discussion).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can open an issue for discussion

#26984 :)



@pytest.fixture
def pytables_hdf5_file(tmp_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this fixture anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this fixture anymore

Thanks man. I hope one of the linters would have told me! :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i don’t think we have that check in linting :)


class TestReadPyTablesHDF5:
"""
A group of tests which covers reading HDF5 files written by plain PyTables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number here in the comment

@jgehrcke
Copy link
Contributor Author

@jreback it's green now. Wanna have another look before I squash some commits?

@jreback jreback added this to the 0.25.0 milestone Jun 21, 2019
@jreback jreback merged commit dda4c1a into pandas-dev:master Jun 21, 2019
@jreback
Copy link
Contributor

jreback commented Jun 21, 2019

thanks @jgehrcke nice patch. We squash on merge anyhow, so no need!

@jreback
Copy link
Contributor

jreback commented Jun 21, 2019

If interested would like to move the existing test_pytables test files (test_pytables.py and test_pytables_missing.py) to pandas/test/io/pytables (just a simple move). We want to split the main file, but that will be separate.

and I think we usually add an __init__.py file in the pytables/ dir as well

@jgehrcke
Copy link
Contributor Author

thanks @jgehrcke nice patch

Thank you for the nice collaboration!

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Jun 21, 2019

If interested would like to move the existing test_pytables test files (test_pytables.py and test_pytables_missing.py) to pandas/test/io/pytables (just a simple move).

Sure, PR upcoming.

Update: #26986

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COMPAT: reading generic PyTables Table format fails with sub-selection
5 participants