Skip to content

Conversation

zhiltsov-max
Copy link
Contributor

@zhiltsov-max zhiltsov-max commented Sep 15, 2025

Motivation and context

Related: #9757

When a raw images task is created from a CS without a manifest attached, CVAT downloads image headers to get image resolution. This operation can be quite time-consuming for big tasks, but it can be optimized quite simply.

  • Improved performance of CS image header downloading and manifest creation ~2-6x

The default chunk size used in the downloader is 64KB. For most image formats the required information is available in the first 1KB, while 64KB (the previous value) can be the size of the whole file. It is tempting to change it to a lower value, e.g. 1500 (the default Ethernet v2 MTU size) and it works fine, except for jpgs that include an embedded thumbnail (preview) image in the header, which can basically be of any size. Probably, this can be implemented with a more advanced JPEG parser. It doesn't seem reasonable to use the reduced chunk size and download the whole image for such images, as such jpegs seem to be quite common, but maybe it can be implemented as an exception just for the jpg format. Now, multiple header sizes are attempted per file.

AWS connections limits are floating and depend on the data filenames. In the worst case, we can expect about 100 connections per prefix, up to infinite in the best case (random prefixes). It's also possible to get throttled by AWS (e.g. 503 Slow Down), it should be handled by the boto library itself.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
https://stackoverflow.com/questions/37432285/maximum-no-of-connections-that-can-be-held-by-s3
https://repost.aws/knowledge-center/http-5xx-errors-s3
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html

Details:
test dataset: 26822 .jpg images
before: 550s
with queue: 320 - 350s
with reduced chunk size (1 MTU): 220 - 280s
with improved connection reuse for AWS: up to 82s (64 connections for 16 cores, up from 10 kept alive with the default config)

How has this been tested?

Sample script for testing

from time import perf_counter
from tempfile import TemporaryDirectory

from tqdm import tqdm

from cvat.apps.engine import models, cloud_provider
from utils.dataset_manifest.core import ImageManifestManager


cloud_storage = models.CloudStorage.objects.get(id=yourcs)
storage_client = cloud_provider.db_storage_to_storage_instance(cloud_storage)

media = [v["name"] for v in storage_client.list_files(prefix="images/", _use_flat_listing=True)]

header_downloader = cloud_provider.HeaderFirstMediaDownloader.create(
    models.DimensionType.DIM_2D, client=storage_client
)
content_generator = (
    v
    for v in tqdm(
        storage_client.bulk_download_to_memory(media, object_downloader=header_downloader.download),
        total=len(media),
    )
)


with TemporaryDirectory() as tempdir:
    start_time = perf_counter()

    manifest = ImageManifestManager(tempdir, upload_dir=tempdir, create_index=False)
    manifest.link(
        sources=content_generator,
        stop=len(media) - 1,
        DIM_3D=False,
    )
    manifest.create()

    duration = perf_counter() - start_time
    print(
        f"Manifest for {len(media)} files created in",
        duration,
        "seconds",
        f"avg. {duration / (len(media) or 1)}s.",
    )

# run with  
# cat test_cs_downloading.py | python manage.py shell

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • I have linked related issues (see GitHub docs)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

@zhiltsov-max zhiltsov-max changed the title Use queue, reduce chunk size Optimize task creation from CS without manifest Sep 15, 2025
@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2025

Codecov Report

❌ Patch coverage is 75.00000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.44%. Comparing base (5499aa9) to head (ee42567).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9827      +/-   ##
===========================================
+ Coverage    73.39%   75.44%   +2.04%     
===========================================
  Files          410      410              
  Lines        45663    45690      +27     
  Branches      4086     4086              
===========================================
+ Hits         33516    34472     +956     
+ Misses       12147    11218     -929     
Components Coverage Δ
cvat-ui 77.15% <ø> (ø)
cvat-server 74.00% <75.00%> (+3.78%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zhiltsov-max zhiltsov-max marked this pull request as draft September 23, 2025 07:58
@zhiltsov-max zhiltsov-max marked this pull request as ready for review September 23, 2025 16:31
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants