Skip to content

open_mfdataset -> to_netcdf() randomly leading to dead workers #4710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fmaussion opened this issue Dec 18, 2020 · 5 comments
Closed

open_mfdataset -> to_netcdf() randomly leading to dead workers #4710

fmaussion opened this issue Dec 18, 2020 · 5 comments

Comments

@fmaussion
Copy link
Member

fmaussion commented Dec 18, 2020

This is:

  • xarray: 0.16.2
  • dask: 2.30.0

I'm not sure a github issue is the right place to report this, but I'm not sure where else, so here it is.

I just had two very long weeks of debugging stalled (i.e. "dead") OGGM jobs in a cluster environment. I finally nailed it down to ds.to_netcdf(path) in this situation:

with xr.open_mfdataset(tmp_paths, combine='nested', concat_dim='rgi_id') as ds:
    ds.to_netcdf(path)

tmp_paths are a few netcdf files (from 2 to about 60). The combined dataset is nothing close to big (a few hundred MB at most).

Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring.

What I can give as additional information:

  • changing ds.to_netcdf(path) to ds.load().to_netcdf(path) solves the problem
  • the problem became worse (i.e. more often) when the files to concatenate increased in the number of variables (the final size of the concatenated file doesn't seem to matter at all, it occurs also with files < 1 MB)
  • I can't reproduce the problem locally. The files are here if someone's interested, but I don't think the files are the issue here.
  • the files use gzip compression
  • On cluster, we are dealing with 64 core nodes, which do a lot of work before arriving to these two lines. We use python multiprocessing ourselves before that, create our own pool and use it, etc. But at the moment the job hits these two lines, no other job is running.

Is this is some kind of weird interaction between our own multiprocessing and dask? Is it more an IO problem that occurs only on cluster? I don't know.

I know this is a crappy bug report, but the fact that I lost a lot of time on this recently has gone on my nerves 😉 (I'm mostly angry at myself for taking so long to find out that these two lines were the problem).

In order to make a question out of this crappy report: how can I possibly debug this? I solved my problem now (with ds.load()), but this is not really satisfying. Any tip is appreciated!

cc @TimoRoth our cluster IT whom I annoyed a lot before finding out that the problem was in xarray/dask

@dcherian
Copy link
Contributor

I've run in to this and usually just call .load or write to zarr instead :)

How are you setting up your dask cluster? Is distributed involved?

@TimoRoth
Copy link
Contributor

TimoRoth commented Dec 18, 2020 via email

@markelg
Copy link
Contributor

markelg commented Dec 22, 2020

Perhaps this is related to #3961? Did you try to call open_mfdataset with lock=False?

@fmaussion
Copy link
Member Author

Thanks for the tip @markelg.

yes, it seems indeed very much related. Closing this in favor of #3961

@millet5818
Copy link

This is:

  • xarray: 0.16.2
  • dask: 2.30.0

I'm not sure a github issue is the right place to report this, but I'm not sure where else, so here it is.

I just had two very long weeks of debugging stalled (i.e. "dead") OGGM jobs in a cluster environment. I finally nailed it down to ds.to_netcdf(path) in this situation:

with xr.open_mfdataset(tmp_paths, combine='nested', concat_dim='rgi_id') as ds:
    ds.to_netcdf(path)

tmp_paths are a few netcdf files (from 2 to about 60). The combined dataset is nothing close to big (a few hundred MB at most).

Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring.

What I can give as additional information:

  • changing ds.to_netcdf(path) to ds.load().to_netcdf(path) solves the problem
  • the problem became worse (i.e. more often) when the files to concatenate increased in the number of variables (the final size of the concatenated file doesn't seem to matter at all, it occurs also with files < 1 MB)
  • I can't reproduce the problem locally. The files are here if someone's interested, but I don't think the files are the issue here.
  • the files use gzip compression
  • On cluster, we are dealing with 64 core nodes, which do a lot of work before arriving to these two lines. We use python multiprocessing ourselves before that, create our own pool and use it, etc. But at the moment the job hits these two lines, no other job is running.

Is this is some kind of weird interaction between our own multiprocessing and dask? Is it more an IO problem that occurs only on cluster? I don't know.

I know this is a crappy bug report, but the fact that I lost a lot of time on this recently has gone on my nerves 😉 (I'm mostly angry at myself for taking so long to find out that these two lines were the problem).

In order to make a question out of this crappy report: how can I possibly debug this? I solved my problem now (with ds.load()), but this is not really satisfying. Any tip is appreciated!

cc @TimoRoth our cluster IT whom I annoyed a lot before finding out that the problem was in xarray/dask

Hi, @fmaussion
I encountered the same problem as you, and I solved it in the same way(ds.load().to_netcdf()). However, when the file is very large, the "load" function does not solve the problem very well. Do you have a better solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants