-
Notifications
You must be signed in to change notification settings - Fork 52
occasional lockups during dask reads #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah, a timeout on the read would be reasonable. Then we could have that timeout trigger a retry via whatever logic we implement for #18. @mukhery I'm curious if you have a reproducer for this, or have noticed cases/datasets/patterns that tend to cause it more often? For now, you might try playing with setting Maybe something like: retry_env = stackstac.DEFAULT_GDAL_ENV.updated(dict(
GDAL_HTTP_TIMEOUT=45,
GDAL_HTTP_MAX_RETRY=5,
GDAL_HTTP_RETRY_DELAY=0.5
))
stackstac.stack(..., gdal_env=retry_env) |
I tried to come up with something to reproduce but haven't been able to. We've also been seeing several other network-related/comms issues, so it's possible that our specific workload and how we've implemented the processing is causing some of these issues. I ended up just adding timeouts to the task futures and then cancelling and/or restarting the cluster if needed to meet our current need. Feel free to close this issue if you'd like and I can reopen later if I'm able to reliably reproduce. |
I'll keep it open, since I think it's a reasonable thing to implement.
Curious how you implemented this? |
Sounds good, thanks! I did something like this: try:
fut = cluster.client.compute(<task_involving_stackstac_data>)
dask.distributed.wait(fut, timeout=600)
except dask.distributed.TimeoutError as curr_exception:
error_text = f'{curr_exception}'[:100] #sometimes the error messages are crazy long
print(f'task failed with exception: {error_text}') |
Nice! That makes sense. |
It seems that stackstac will occasionally hang indefinitely while doing a dataset read:

call stack:
Is it possible to pass in a timeout parameter or something like that or would I be better off just cancelling the job entirely when something like this happens?
The text was updated successfully, but these errors were encountered: