Retry and resume functionality for downloader#17
Conversation
- Resume has been tested - Retry has not been tested
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17 +/- ##
=======================================
Coverage 39.45% 39.45%
=======================================
Files 10 10
Lines 185 185
=======================================
Hits 73 73
Misses 112 112 ☔ View full report in Codecov by Sentry. |
aritraghsh09
left a comment
There was a problem hiding this comment.
The changes look good to me, and I have no objections. But I have two comments:-
-
The files that fail even after the specified number of attempts -- can we keep a running list of those object ids somewhere and then dump them as a .npy array or something similar?
-
Alternatively, I guess a separate check-script can be written that verifies whether an
object_id.fitsexists for eachobject_idin the download table; and then makes an array of all theobject_ids for which image files were not found.
| """ | ||
| # Load resume data so we start at the appropriate chunk. | ||
| if not os.path.exists(resume_data_filename): | ||
| return 0 |
There was a problem hiding this comment.
Was this a RuntimeError before? Seems like it makes sense for it to be a raised exception, but curious if there's a good reason for it to be return 0.
There was a problem hiding this comment.
Yeah, the reason for this is so if we're called with resume=True but there is no resume data, we will just download from the beginning.
This avoids a CLI flag (or other mechanism) in download.py or higher which has to know whether the user intends to resume or not.
Maybe resume should really be called resume_if_possible since that is what it really means.
There was a problem hiding this comment.
oh, ok, I see what you mean here. I don't think I have a strong opinion here other than perhaps cleaning up the docstring and perhaps leaving a comment like "can't find the file, so starting from index 0" or something.
drewoldag
left a comment
There was a problem hiding this comment.
This looks pretty good. Only one little comment about return 0 vs. raise Exception.
Implementation of retry and resume within downloadCutout.py.
Retry means that when a request fails we will try again (defaults to 3 attempts). This is intended to address connection drops and the like. We use configurable exponential backoff to avoid a thundering herd if load from our client is causing some backend failure.
Resume describes the situation where the download fails for unrecoverable reasons (HSC infra goes down) or is terminated (e.g. downloads only occur at night). This generates a resume_download.toml file in the download directory, which allows the exact same download to resume from the chunk that was in progress when the interruption occurred. This resume functionality is off by default to preserve the
download()interface used bydownloadCutout.py's CLI which does not support resume.