-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add pathlib.Path
support to open_(mf)dataset
#1514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Are you sure |
I missed this, apologies. Can we add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great for a start. For consistency, it would be nice for this to also work with open_dataarray
, Dataset.to_netcdf
, DataArray.to_netcdfand
save_mfdataset`.
xarray/backends/api.py
Outdated
if isinstance(paths, GeneratorType): | ||
paths = list(paths) | ||
if isinstance(paths[0], Path): | ||
paths = sorted(str(p) for p in paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should sort here. We don't do this for lists of strings because (I think) the order in which paths are provided can affect the order of data in the resulting Dataset.
xarray/backends/api.py
Outdated
if isinstance(paths, basestring): | ||
paths = sorted(glob(paths)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than explicitly checking for GeneratorType
above, let's add an else
clause here:
else:
# also converts iterables of Path objects
paths = [str(p) if isinstance(p, Path) else p
for path for paths]
xarray/backends/api.py
Outdated
# handle output of pathlib.Path.glob() | ||
if isinstance(paths, GeneratorType): | ||
paths = list(paths) | ||
if isinstance(paths[0], Path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like only checking the first element here. There is lots of overhead for opening a file, so I don't think it should be a problem to check isinstance
on every element of the paths argument.
xarray/tests/test_backends.py
Outdated
@@ -570,6 +585,20 @@ def create_tmp_file(suffix='.nc', allow_cleanup_failure=False): | |||
|
|||
|
|||
@contextlib.contextmanager | |||
def create_tmp_file_pathlib(suffix='.nc', allow_cleanup_failure=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than repeating everything from create_tmp_file
, can you simply wrap the output from create_tmp_file
in Path
?
xarray/tests/test_backends.py
Outdated
autoclose=self.autoclose) as actual: | ||
self.assertEqual(actual.foo.variable.data.chunks, | ||
((3, 2, 3, 2),)) | ||
for _create_tmp_file in [create_tmp_file, create_tmp_file_pathlib]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a separate test_open_mfdataset_path
test rather than squeezing this into test_open_mfdataset
?
Some copy/paste is OK, but it doesn't need to all the check we have here around chunks and dask arrays. A simple self.assertDatasetAllClose(original, actual)
would suffice.
Thanks for the review @shoyer ! On my machine, I've already stared adapting the other backends ( |
* Added show_commit_url to asv.conf This should setup the proper links from the published output to the commit on Github. FYI the benchmarks should be running stably now, and posted to http://pandas.pydata.org/speed/xarray. http://pandas.pydata.org/speed/xarray/regressions.xml has an RSS feed to the regressions. * Update asv.conf.json
* Clarify in docs that inferring DataArray dimensions is deprecated * Fix DataArray docstring * Clarify DataArray coords documentation
This follows <jazzband/pathlib2#8 (comment)> who argues for sticking to pathlib2.
setup.py
Outdated
@@ -37,6 +37,9 @@ | |||
|
|||
INSTALL_REQUIRES = ['numpy >= 1.7', 'pandas >= 0.15.0'] | |||
TESTS_REQUIRE = ['pytest >= 2.7.1'] | |||
if sys.version_info < (3, 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a check for Python < 3 to setup.py
. This seems to work. I'm not sure if this really is an accepted way of creating conditional requirements.
xarray/backends/api.py
Outdated
@@ -6,6 +6,11 @@ | |||
from glob import glob | |||
from io import BytesIO | |||
from numbers import Number | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prefers pathlib2
and only falls back to pathlib
if necessary. I'm following jazzband/pathlib2#8 (comment) here although preferring the stdlib module would feel better.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I would prefer stdlib module.
Tests are already done on my personal Travis Account (skipped intermediate commits there).
Actually true only for To me, aae32a8 looks like pathlib support is now present whereever it makes sense. |
One more thing:
Currently, I've set |
@willirath take a look at what we do for handling the optional dask dependency: xarray/xarray/core/pycompat.py Lines 55 to 60 in bcd6081
|
Ah, that's nice! And then I remove the |
Yes, |
pathlib.Path
support to open_(mf)dataset
pathlib.Path
support to open_(mf)dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good, just a few minor tweaks from my perspective.
xarray/tests/test_backends.py
Outdated
from pathlib import Path | ||
except ImportError: | ||
from pathlib2 import Path | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this to tests/__init__.py
. We'll also want a requires_pathlib
defined there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the requires_pathlib
decorator but couldn't remove this nested import block from test_backends.py
, because Path()
is explicitly used to set up the pathlib tests.
xarray/tests/test_backends.py
Outdated
@@ -1343,6 +1362,19 @@ def test_save_mfdataset_invalid(self): | |||
with self.assertRaisesRegexp(ValueError, 'same length'): | |||
save_mfdataset([ds, ds], ['only one path']) | |||
|
|||
def test_save_mfdataset_pathlib_roundtrip(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @requires_pathlib
decorator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
xarray/tests/test_backends.py
Outdated
@@ -1934,3 +1966,13 @@ def test_open_dataarray_options(self): | |||
expected = data.drop('y') | |||
with open_dataarray(tmp, drop_variables=['y']) as loaded: | |||
self.assertDataArrayIdentical(expected, loaded) | |||
|
|||
def test_dataarray_to_netcdf_no_name_pathlib(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @requires_pathlib
decorator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
xarray/backends/api.py
Outdated
from pathlib2 import Path | ||
path_type = (Path, ) | ||
except ImportError as e: | ||
path_type = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks right to me but we should put it in pycompat.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
asv_bench/asv.conf.json
Outdated
@@ -36,7 +36,7 @@ | |||
"install_timeout": 600, | |||
|
|||
// the base URL to show a commit for the project. | |||
// "show_commit_url": "http://github.com/owner/project/commit/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is already on master, can we roll this back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rebased onto master some point. Reverted it now.
xarray/backends/api.py
Outdated
@@ -6,6 +6,11 @@ | |||
from glob import glob | |||
from io import BytesIO | |||
from numbers import Number | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I would prefer stdlib module.
xarray/backends/api.py
Outdated
(only netCDF3 supported). | ||
filename_or_obj : str, Path, file or xarray.backends.*DataStore | ||
Strings and Path objects are interpreted as a path to a netCDF file | ||
oran OpenDAP URL and opened with python-netCDF4, unless the filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oran -> or an
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
AppVeyor test failed with HTTP time-outs when trying to get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another commit or @shoyer can trigger another build on Appveyor.
@@ -21,9 +21,38 @@ v0.9.7 (unreleased) | |||
Enhancements | |||
~~~~~~~~~~~~ | |||
|
|||
- Support for `pathlib.Path` objects added to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we typically cite the issue number (e.g. :issue: 799:
). Would be nice to include here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit to add this
@@ -496,6 +501,9 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT, | |||
""" | |||
if isinstance(paths, basestring): | |||
paths = sorted(glob(paths)) | |||
else: | |||
paths = [str(p) if isinstance(p, path_type) else p for p in paths] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may have already discussed this with @shoyer but can you remind me why we're not sorting in the same way we do for the glob path above? I guess we're assuming all the paths are expanded already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We sort after glob()
since the iteration order in arbitrary. But we don't sort in general, since the order of the provided filenames might be intentional.
Unfortunately, there isn't any way to detect a generator created by pathlib
's glob()
method, since it's just a Python generator.
@@ -21,9 +21,38 @@ v0.9.7 (unreleased) | |||
Enhancements | |||
~~~~~~~~~~~~ | |||
|
|||
- Support for `pathlib.Path` objects added to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit to add this
doc/whats-new.rst
Outdated
<xarray.Dataset> | ||
[...] | ||
|
||
In [6]: all_files = data_dir.glob("dta_for_month_*.nc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this example from What's New for two reason:
- The section was getting a little longer than essential
- It's actually a bit of an anti-pattern, since the order of paths matters to xarray but is arbtirary from glob. The right way to write this is
sorted(data_dir.glob(...))
.
Thanks @shoyer!
I like the shorter example. It shows the essence of why pathlib is such a
nice thing.
Is there anything more to do in this PR?
Am 31. August 2017 17:48:24 schrieb Stephan Hoyer <[email protected]>:
… shoyer approved this pull request.
> @@ -21,9 +21,38 @@ v0.9.7 (unreleased)
Enhancements
~~~~~~~~~~~~
+- Support for `pathlib.Path` objects added to
I just pushed a commit to add this
> + .. ipython::
+ :verbatim:
+ In [1]: import xarray as xr
+
+ In [2]: from pathlib import Path # In Python 2, use pathlib2!
+
+ In [3]: data_dir = Path("data/")
+
+ In [4]: one_file = data_dir / "dta_for_month_01.nc"
+
+ In [5]: print(xr.open_dataset(one_file))
+ Out[5]:
+ <xarray.Dataset>
+ [...]
+
+ In [6]: all_files = data_dir.glob("dta_for_month_*.nc")
I removed this example from What's New for two reason:
1. The section was getting a little longer than essential
2. It's actually a bit of an anti-pattern, since the order of paths matters
to xarray but is arbtirary from glob. The right way to write this is
`sorted(data_dir.glob(...))`.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1514 (review)
|
CI failures are all the issue with dask distributed (#1540), so I'm going ahead and merging. Thanks @willirath ! |
git diff upstream/master | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new APIThis is meant to eventually make
xarray.open_dataset
andxarray.open_mfdataset
work withpathlib.Path
objects. I think this can be achieved as follows:In
xarray.open_dataset
, cast anypathlib.Path
object to stringIn
xarray.open_mfdataset
, make sure to handle generators. This is necessary, becausepathlib.Path('some-path').glob()
returns generators.Curently, tests with Python 2 are failing, because there is no explicit
pathlib
dependency yet.With Python 3, everything seems to work. I am not happy with the tests I've added so far, though.