Optimize task creation from CS without manifest #9827

zhiltsov-max · 2025-09-15T11:40:01Z

Motivation and context

Related: #9757

When a raw images task is created from a CS without a manifest attached, CVAT downloads image headers to get image resolution. This operation can be quite time-consuming for big tasks, but it can be optimized quite simply.

Improved performance of CS image header downloading and manifest creation ~2-6x

The default chunk size used in the downloader is 64KB. For most image formats the required information is available in the first 1KB, while 64KB (the previous value) can be the size of the whole file. It is tempting to change it to a lower value, e.g. 1500 (the default Ethernet v2 MTU size) and it works fine, except for jpgs that include an embedded thumbnail (preview) image in the header, which can basically be of any size. Probably, this can be implemented with a more advanced JPEG parser. It doesn't seem reasonable to use the reduced chunk size and download the whole image for such images, as such jpegs seem to be quite common, but maybe it can be implemented as an exception just for the jpg format. Now, multiple header sizes are attempted per file.

AWS connections limits are floating and depend on the data filenames. In the worst case, we can expect about 100 connections per prefix, up to infinite in the best case (random prefixes). It's also possible to get throttled by AWS (e.g. 503 Slow Down), it should be handled by the boto library itself.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
https://stackoverflow.com/questions/37432285/maximum-no-of-connections-that-can-be-held-by-s3
https://repost.aws/knowledge-center/http-5xx-errors-s3
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html

Details:
test dataset: 26822 .jpg images
before: 550s
with queue: 320 - 350s
with reduced chunk size (1 MTU): 220 - 280s
with improved connection reuse for AWS: up to 82s (64 connections for 16 cores, up from 10 kept alive with the default config)

How has this been tested?

Sample script for testing

from time import perf_counter
from tempfile import TemporaryDirectory

from tqdm import tqdm

from cvat.apps.engine import models, cloud_provider
from utils.dataset_manifest.core import ImageManifestManager


cloud_storage = models.CloudStorage.objects.get(id=yourcs)
storage_client = cloud_provider.db_storage_to_storage_instance(cloud_storage)

media = [v["name"] for v in storage_client.list_files(prefix="images/", _use_flat_listing=True)]

header_downloader = cloud_provider.HeaderFirstMediaDownloader.create(
    models.DimensionType.DIM_2D, client=storage_client
)
content_generator = (
    v
    for v in tqdm(
        storage_client.bulk_download_to_memory(media, object_downloader=header_downloader.download),
        total=len(media),
    )
)


with TemporaryDirectory() as tempdir:
    start_time = perf_counter()

    manifest = ImageManifestManager(tempdir, upload_dir=tempdir, create_index=False)
    manifest.link(
        sources=content_generator,
        stop=len(media) - 1,
        DIM_3D=False,
    )
    manifest.create()

    duration = perf_counter() - start_time
    print(
        f"Manifest for {len(media)} files created in",
        duration,
        "seconds",
        f"avg. {duration / (len(media) or 1)}s.",
    )

# run with  
# cat test_cs_downloading.py | python manage.py shell

Checklist

I submit my changes into the develop branch
I have created a changelog fragment
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues (see GitHub docs)

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.

codecov-commenter · 2025-09-15T12:13:38Z

Codecov Report

❌ Patch coverage is 75.00000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.44%. Comparing base (5499aa9) to head (ee42567).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9827      +/-   ##
===========================================
+ Coverage    73.39%   75.44%   +2.04%     
===========================================
  Files          410      410              
  Lines        45663    45690      +27     
  Branches      4086     4086              
===========================================
+ Hits         33516    34472     +956     
+ Misses       12147    11218     -929

Components	Coverage Δ
cvat-ui	`77.15% <ø> (ø)`
cvat-server	`74.00% <75.00%> (+3.78%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sonarqubecloud · 2025-09-23T20:20:53Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Use queue, reduce chunk size

679ec91

zhiltsov-max requested a review from SpecLad as a code owner September 15, 2025 11:40

zhiltsov-max changed the title ~~Use queue, reduce chunk size~~ Optimize task creation from CS without manifest Sep 15, 2025

zhiltsov-max added 2 commits September 15, 2025 14:41

fix import

e17bb96

Update changelog

bc82603

zhiltsov-max requested a review from nmanovic as a code owner September 15, 2025 11:43

zhiltsov-max requested a review from azhavoro September 15, 2025 11:50

Revert chunk size change

c7c0305

zhiltsov-max marked this pull request as draft September 23, 2025 07:58

zhiltsov-max added 3 commits September 23, 2025 17:02

Merge branch 'develop' into zm/optimize-cd-downloading

ad33d19

Improve connection reuse in aws client

999e8d8

Try more than 1 header size

dd66bdb

zhiltsov-max marked this pull request as ready for review September 23, 2025 16:31

Fix field access

ee42567

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize task creation from CS without manifest #9827

Optimize task creation from CS without manifest #9827

Uh oh!

zhiltsov-max commented Sep 15, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 15, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Sep 23, 2025

Uh oh!

Uh oh!

Optimize task creation from CS without manifest #9827

Are you sure you want to change the base?

Optimize task creation from CS without manifest #9827

Uh oh!

Conversation

zhiltsov-max commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and context

How has this been tested?

Checklist

License

Uh oh!

codecov-commenter commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud bot commented Sep 23, 2025

Quality Gate passed

Uh oh!

Uh oh!

zhiltsov-max commented Sep 15, 2025 •

edited

Loading

codecov-commenter commented Sep 15, 2025 •

edited

Loading