Skip to content

Add pathlib.Path support to open_(mf)dataset #1514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Sep 1, 2017
Merged

Add pathlib.Path support to open_(mf)dataset #1514

merged 24 commits into from
Sep 1, 2017

Conversation

willirath
Copy link
Contributor

@willirath willirath commented Aug 21, 2017

  • Closes Support for pathlib.Path #799
  • Tests added / passed
  • Passes git diff upstream/master | flake8 --diff
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

This is meant to eventually make xarray.open_dataset and xarray.open_mfdataset work with pathlib.Path objects. I think this can be achieved as follows:

  1. In xarray.open_dataset, cast any pathlib.Path object to string

  2. In xarray.open_mfdataset, make sure to handle generators. This is necessary, because pathlib.Path('some-path').glob() returns generators.

Curently, tests with Python 2 are failing, because there is no explicit pathlib dependency yet.

With Python 3, everything seems to work. I am not happy with the tests I've added so far, though.

@max-sixty
Copy link
Collaborator

Are you sure pathlib exists in py2? I had thought you needed to install pathlib2 (I may be wrong on the specifics)

@max-sixty
Copy link
Collaborator

Curently, tests with Python 2 are failing, because there is no explicit pathlib dependency yet.

I missed this, apologies.

Can we add pathlib2 as an optional dependency and handle the case where it's not installed?

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great for a start. For consistency, it would be nice for this to also work with open_dataarray, Dataset.to_netcdf, DataArray.to_netcdfandsave_mfdataset`.

if isinstance(paths, GeneratorType):
paths = list(paths)
if isinstance(paths[0], Path):
paths = sorted(str(p) for p in paths)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should sort here. We don't do this for lists of strings because (I think) the order in which paths are provided can affect the order of data in the resulting Dataset.

if isinstance(paths, basestring):
paths = sorted(glob(paths))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than explicitly checking for GeneratorType above, let's add an else clause here:

else:
    # also converts iterables of Path objects
    paths = [str(p) if isinstance(p, Path) else p
             for path for paths]

# handle output of pathlib.Path.glob()
if isinstance(paths, GeneratorType):
paths = list(paths)
if isinstance(paths[0], Path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like only checking the first element here. There is lots of overhead for opening a file, so I don't think it should be a problem to check isinstance on every element of the paths argument.

@@ -570,6 +585,20 @@ def create_tmp_file(suffix='.nc', allow_cleanup_failure=False):


@contextlib.contextmanager
def create_tmp_file_pathlib(suffix='.nc', allow_cleanup_failure=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than repeating everything from create_tmp_file, can you simply wrap the output from create_tmp_file in Path?

autoclose=self.autoclose) as actual:
self.assertEqual(actual.foo.variable.data.chunks,
((3, 2, 3, 2),))
for _create_tmp_file in [create_tmp_file, create_tmp_file_pathlib]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a separate test_open_mfdataset_path test rather than squeezing this into test_open_mfdataset?

Some copy/paste is OK, but it doesn't need to all the check we have here around chunks and dask arrays. A simple self.assertDatasetAllClose(original, actual) would suffice.

@willirath
Copy link
Contributor Author

Thanks for the review @shoyer !

On my machine, I've already stared adapting the other backends (to_netcdf et al) to support pathlib as well. I'll include it here.

TomAugspurger and others added 6 commits August 25, 2017 15:58
* Added show_commit_url to asv.conf

This should setup the proper links from the published output to the commit on Github.

FYI the benchmarks should be running stably now, and posted to http://pandas.pydata.org/speed/xarray. http://pandas.pydata.org/speed/xarray/regressions.xml has an RSS feed to the regressions.

* Update asv.conf.json
* Clarify in docs that inferring DataArray dimensions is deprecated

* Fix DataArray docstring

* Clarify DataArray coords documentation
This follows
<jazzband/pathlib2#8 (comment)>
who argues for sticking to pathlib2.
setup.py Outdated
@@ -37,6 +37,9 @@

INSTALL_REQUIRES = ['numpy >= 1.7', 'pandas >= 0.15.0']
TESTS_REQUIRE = ['pytest >= 2.7.1']
if sys.version_info < (3, 0):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a check for Python < 3 to setup.py. This seems to work. I'm not sure if this really is an accepted way of creating conditional requirements.

@@ -6,6 +6,11 @@
from glob import glob
from io import BytesIO
from numbers import Number
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prefers pathlib2 and only falls back to pathlib if necessary. I'm following jazzband/pathlib2#8 (comment) here although preferring the stdlib module would feel better.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I would prefer stdlib module.

@willirath
Copy link
Contributor Author

  • Tests added / passed

Tests are already done on my personal Travis Account (skipped intermediate commits there).

  • Passes git diff upstream/master | flake8 --diff

Actually true only for git diff upstream/master xarray/ | flake8 --diff. But I guess that's what should pass the linter?

To me, aae32a8 looks like pathlib support is now present whereever it makes sense.

@willirath
Copy link
Contributor Author

One more thing:

Can we add pathlib2 as an optional dependency and handle the case where it's not installed?

Currently, I've set pathlib2 as a mandatory requirement for Python 2. How'd we achieve pathlib being optional? Test for pathlib upon import and wrap all the if isinstace(path, Path): path = str(path) in a function just passing unmodified path if pathlib's not present?

@shoyer
Copy link
Member

shoyer commented Aug 25, 2017

@willirath take a look at what we do for handling the optional dask dependency:

try:
# solely for isinstance checks
import dask.array
dask_array_type = (dask.array.Array,)
except ImportError: # pragma: no cover
dask_array_type = ()

@willirath
Copy link
Contributor Author

take a look at what we do for handling the optional dask dependency

Ah, that's nice!

And then I remove the pathlib2 dependency I've introduced in setup.py again?

@shoyer
Copy link
Member

shoyer commented Aug 25, 2017

And then I remove the pathlib2 dependency I've introduced in setup.py again?

Yes, pathlib2 should not be a required dependency.

@willirath willirath changed the title WIP: Add pathlib.Path support to open_(mf)dataset Add pathlib.Path support to open_(mf)dataset Aug 25, 2017
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good, just a few minor tweaks from my perspective.

from pathlib import Path
except ImportError:
from pathlib2 import Path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this to tests/__init__.py. We'll also want a requires_pathlib defined there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the requires_pathlib decorator but couldn't remove this nested import block from test_backends.py, because Path() is explicitly used to set up the pathlib tests.

@@ -1343,6 +1362,19 @@ def test_save_mfdataset_invalid(self):
with self.assertRaisesRegexp(ValueError, 'same length'):
save_mfdataset([ds, ds], ['only one path'])

def test_save_mfdataset_pathlib_roundtrip(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @requires_pathlib decorator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1934,3 +1966,13 @@ def test_open_dataarray_options(self):
expected = data.drop('y')
with open_dataarray(tmp, drop_variables=['y']) as loaded:
self.assertDataArrayIdentical(expected, loaded)

def test_dataarray_to_netcdf_no_name_pathlib(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @requires_pathlib decorator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

from pathlib2 import Path
path_type = (Path, )
except ImportError as e:
path_type = ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks right to me but we should put it in pycompat.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -36,7 +36,7 @@
"install_timeout": 600,

// the base URL to show a commit for the project.
// "show_commit_url": "http://github.com/owner/project/commit/",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already on master, can we roll this back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased onto master some point. Reverted it now.

@@ -6,6 +6,11 @@
from glob import glob
from io import BytesIO
from numbers import Number
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I would prefer stdlib module.

(only netCDF3 supported).
filename_or_obj : str, Path, file or xarray.backends.*DataStore
Strings and Path objects are interpreted as a path to a netCDF file
oran OpenDAP URL and opened with python-netCDF4, unless the filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oran -> or an

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@willirath
Copy link
Contributor Author

flake8 now fails because of the unused pathlib in xarray/tests/__init__.py.

@willirath
Copy link
Contributor Author

AppVeyor test failed with HTTP time-outs when trying to get repodata.json. Can somebody trigger the build again?

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another commit or @shoyer can trigger another build on Appveyor.

@@ -21,9 +21,38 @@ v0.9.7 (unreleased)
Enhancements
~~~~~~~~~~~~

- Support for `pathlib.Path` objects added to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we typically cite the issue number (e.g. :issue: 799:). Would be nice to include here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit to add this

@@ -496,6 +501,9 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
"""
if isinstance(paths, basestring):
paths = sorted(glob(paths))
else:
paths = [str(p) if isinstance(p, path_type) else p for p in paths]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have already discussed this with @shoyer but can you remind me why we're not sorting in the same way we do for the glob path above? I guess we're assuming all the paths are expanded already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sort after glob() since the iteration order in arbitrary. But we don't sort in general, since the order of the provided filenames might be intentional.

Unfortunately, there isn't any way to detect a generator created by pathlib's glob() method, since it's just a Python generator.

@@ -21,9 +21,38 @@ v0.9.7 (unreleased)
Enhancements
~~~~~~~~~~~~

- Support for `pathlib.Path` objects added to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit to add this

<xarray.Dataset>
[...]

In [6]: all_files = data_dir.glob("dta_for_month_*.nc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this example from What's New for two reason:

  1. The section was getting a little longer than essential
  2. It's actually a bit of an anti-pattern, since the order of paths matters to xarray but is arbtirary from glob. The right way to write this is sorted(data_dir.glob(...)).

@willirath
Copy link
Contributor Author

willirath commented Sep 1, 2017 via email

@shoyer shoyer merged commit 4a15cfa into pydata:master Sep 1, 2017
@shoyer
Copy link
Member

shoyer commented Sep 1, 2017

CI failures are all the issue with dask distributed (#1540), so I'm going ahead and merging.

Thanks @willirath !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants