Optimize task creation from CS without manifest #9827
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and context
Related: #9757
When a raw images task is created from a CS without a manifest attached, CVAT downloads image headers to get image resolution. This operation can be quite time-consuming for big tasks, but it can be optimized quite simply.
The default chunk size used in the downloader is 64KB. For most image formats the required information is available in the first 1KB, while 64KB (the previous value) can be the size of the whole file. It is tempting to change it to a lower value, e.g. 1500 (the default Ethernet v2 MTU size) and it works fine, except for jpgs that include an embedded thumbnail (preview) image in the header, which can basically be of any size.
Probably, this can be implemented with a more advanced JPEG parser. It doesn't seem reasonable to use the reduced chunk size and download the whole image for such images, as such jpegs seem to be quite common, but maybe it can be implemented as an exception just for the jpg format.Now, multiple header sizes are attempted per file.AWS connections limits are floating and depend on the data filenames. In the worst case, we can expect about 100 connections per prefix, up to infinite in the best case (random prefixes). It's also possible to get throttled by AWS (e.g. 503 Slow Down), it should be handled by the boto library itself.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
https://stackoverflow.com/questions/37432285/maximum-no-of-connections-that-can-be-held-by-s3
https://repost.aws/knowledge-center/http-5xx-errors-s3
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html
Details:
test dataset: 26822 .jpg images
before: 550s
with queue: 320 - 350s
with reduced chunk size (1 MTU): 220 - 280s
with improved connection reuse for AWS: up to 82s (64 connections for 16 cores, up from 10 kept alive with the default config)
How has this been tested?
Sample script for testing
Checklist
develop
branchLicense
Feel free to contact the maintainers if that's a concern.