Skip to content

Multimodal datagen: configurable distinct-media pool + cross-request reuse policy #498

@Bslabe123

Description

@Bslabe123

What

SyntheticMultimodalDatagenConfig has no way to express:

  1. A bounded pool of N distinct items per modality (image / video / audio).
  2. A sampling / reuse policy over that pool across requests.

Today each request renders fresh random bytes per image and per audio (see MultimodalDataGenerator._build_spec in inference_perf/datagen/multimodal_datagen.py and the deterministic=False materialization path in ChatCompletionAPIData._materialize_multimodal_content in inference_perf/apis/chat.py). Video has a hidden VideoBytesPool of 4 per (w, h, frames) profile in inference_perf/mediagen/pool.py, not user-configurable.

The only cross-request reuse path is SharedPrefixDataGenerator, which is prefix-side only and binary at the group level (prefix_cache_key).

Why this matters

Real VLM workloads draw from a finite media corpus and reuse items at non-uniform rates (recently-uploaded asset hit by many users, hot product image, etc.). Without a pool + reuse policy we can't measure server-side multimodal cache behavior, dedup, or encoder-cache reuse under realistic distributions.

A stakeholder hit this directly: there is currently no way to express

request 1: <text 1> <image1> <text 2> <image2>
request 2: <text 3> <image2>

i.e. image2 reused across two requests.

Proposed config shape (sketch, for discussion)

Reuse the existing Distribution type for the sampling policy rather than inventing a parallel one. Distribution already covers uniform / fixed; adding zipf to DistributionType covers the heavy-tail case that real content reuse follows. Explicit weights mirror the existing WeightedResolution / WeightedVideoProfile pattern.

multimodal:
  image:
    count: { ... }              # existing
    resolutions: [ ... ]        # existing
    pool:
      size: 100                 # total distinct images materialized once
      sampling:                 # Distribution over indices [0, size-1]
        type: zipf
        min: 0
        max: 99
        skew: 1.1               # Zipf exponent (reuse skew field)
      # or, for explicit weights:
      # weights: [ { index: 0, weight: 10 }, { index: 1, weight: 1 }, ... ]
  video:
    pool:
      size: 16
      sampling: { type: uniform, min: 0, max: 15 }
  audio:
    pool:
      size: 50
      sampling: { type: zipf, min: 0, max: 49, skew: 0.9 }

Omitting pool keeps today's behavior: fresh bytes per request.

Acceptance

  • Per-modality pool.size honored: at most N distinct rendered blobs.
  • Per-modality pool.sampling honored across requests.
  • Pool is per-process (consistent with VideoBytesPool model) so multi-worker loadgen doesn't need IPC.
  • pool and prefix_multimodal interact cleanly (prefix bytes still deterministic per group; payload bytes drawn from the pool).
  • Existing VideoBytesPool subsumed by the new mechanism (image / audio / video unified) so the hardcoded pool_size=4 is replaced by pool.size.

Open questions

  • Distribution's mean / std_dev defaults (512 / 200) are meaningless for an index range. Validator requiring min / max when Distribution is used as a pool sampler, or a thin IndexDistribution alias that overrides defaults?
  • Add ZIPF as a new DistributionType value, or store the exponent on skew only when type=zipf?
  • Subsume the existing VideoBytesPool (cleaner, default pool.size=4 to preserve behavior) vs. leave it alone and only add new knobs for image / audio (smaller blast radius)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions