-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Some queries #1173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For (1), take a look at save_mfdataset for saving to multiple files. |
did not know about that, thanks!! |
@shoyer: does this also work with dask.distributed? The doc seems to only mention a thread pool. |
I don't know if anyone has tested writing netCDFs with dask.distributed yet. I suspect the only immediate issue would be errors from dask because |
Thanks. how big of an endeavour is this? I see some free time from 2-3rd week of Jan, and |
Some related issues: Will write more shortly. |
There have been some efforts and progress in using many NetCDF files on a distributed POSIX filesystem (NFS, gluster, not HDFS) but there is still some pain here. We should probably circle back up and figure out what still needs to be done (do you have a firm understanding of this @shoyer ?) HDF5 on HDFS is, I suspect, sufficiently painful so that I would be more tempted to either avoid HDFS or to try other formats like ZArr (which I'm somewhat biased towards) (cc @alimanfoo). However my experience has been that most climate data lives on a POSIX file system, so experimentation here may not be high priority. @JoyMonteiro if you have time then the first thing to do is to probably start using things and report where they're broken. I'm confident that small things will present themselves quickly :) |
Playing around with things sounds like much more fun :) I can see how this will be useful, will start thinking of some test cases to code. |
@JoyMonteiro and @shoyer, as I've been thinking about this more and especially regarding #463, I was planning on building on opener from #1128 to essentially open, read, and then close a file each time a read |
…writing Fixes pydata#1172 The serializable lock will be useful for dask.distributed or multi-processing (xref pydata#798, pydata#1173, among others).
#1179 will make use of |
…ing (#1179) * Switch to shared Lock (SerializableLock if possible) for reading and writing Fixes #1172 The serializable lock will be useful for dask.distributed or multi-processing (xref #798, #1173, among others). * Test serializable lock * Use conda-forge for builds * remove broken/fragile .test_lock
Closing this old issue. We've taken care of most of the issues discussed here through the various backend updates over the past two years. |
Hello @shoyer @pwolfram @mrocklin @rabernat ,
I was trying to write a design/requirements doc with ref. to the Columbia meetup,
and I had a few queries, on which I wanted your inputs (basically to ask whether
they make sense or not!)
a single file, which is not really a good option if you want to eventually do distributed
processing of the data. Things like HDFS/lustreFS can split files, but that is not really
what we want. How do you think this issue could be solved within the xarray+dask
framework?
adding a new method that would split the DataArray (based on some user guidelines) into multiple files?
The text was updated successfully, but these errors were encountered: