-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
open_mfdataset -> to_netcdf() randomly leading to dead workers #4710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've run in to this and usually just call How are you setting up your dask cluster? Is |
We are not setting up a dask cluster at all.
OGGM is using python multiprocessing to distribute its work.
And the distributed workers then are calling xarray functions.
|
Perhaps this is related to #3961? Did you try to call open_mfdataset with lock=False? |
Hi, @fmaussion |
This is:
I'm not sure a github issue is the right place to report this, but I'm not sure where else, so here it is.
I just had two very long weeks of debugging stalled (i.e. "dead") OGGM jobs in a cluster environment. I finally nailed it down to
ds.to_netcdf(path)
in this situation:tmp_paths
are a few netcdf files (from 2 to about 60). The combined dataset is nothing close to big (a few hundred MB at most).Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring.
What I can give as additional information:
ds.to_netcdf(path)
tods.load().to_netcdf(path)
solves the problemIs this is some kind of weird interaction between our own multiprocessing and dask? Is it more an IO problem that occurs only on cluster? I don't know.
I know this is a crappy bug report, but the fact that I lost a lot of time on this recently has gone on my nerves 😉 (I'm mostly angry at myself for taking so long to find out that these two lines were the problem).
In order to make a question out of this crappy report: how can I possibly debug this? I solved my problem now (with
ds.load()
), but this is not really satisfying. Any tip is appreciated!cc @TimoRoth our cluster IT whom I annoyed a lot before finding out that the problem was in xarray/dask
The text was updated successfully, but these errors were encountered: