What
SyntheticMultimodalDatagenConfig has no way to express:
- A bounded pool of N distinct items per modality (image / video / audio).
- A sampling / reuse policy over that pool across requests.
Today each request renders fresh random bytes per image and per audio (see MultimodalDataGenerator._build_spec in inference_perf/datagen/multimodal_datagen.py and the deterministic=False materialization path in ChatCompletionAPIData._materialize_multimodal_content in inference_perf/apis/chat.py). Video has a hidden VideoBytesPool of 4 per (w, h, frames) profile in inference_perf/mediagen/pool.py, not user-configurable.
The only cross-request reuse path is SharedPrefixDataGenerator, which is prefix-side only and binary at the group level (prefix_cache_key).
Why this matters
Real VLM workloads draw from a finite media corpus and reuse items at non-uniform rates (recently-uploaded asset hit by many users, hot product image, etc.). Without a pool + reuse policy we can't measure server-side multimodal cache behavior, dedup, or encoder-cache reuse under realistic distributions.
A stakeholder hit this directly: there is currently no way to express
request 1: <text 1> <image1> <text 2> <image2>
request 2: <text 3> <image2>
i.e. image2 reused across two requests.
Proposed config shape (sketch, for discussion)
Reuse the existing Distribution type for the sampling policy rather than inventing a parallel one. Distribution already covers uniform / fixed; adding zipf to DistributionType covers the heavy-tail case that real content reuse follows. Explicit weights mirror the existing WeightedResolution / WeightedVideoProfile pattern.
multimodal:
image:
count: { ... } # existing
resolutions: [ ... ] # existing
pool:
size: 100 # total distinct images materialized once
sampling: # Distribution over indices [0, size-1]
type: zipf
min: 0
max: 99
skew: 1.1 # Zipf exponent (reuse skew field)
# or, for explicit weights:
# weights: [ { index: 0, weight: 10 }, { index: 1, weight: 1 }, ... ]
video:
pool:
size: 16
sampling: { type: uniform, min: 0, max: 15 }
audio:
pool:
size: 50
sampling: { type: zipf, min: 0, max: 49, skew: 0.9 }
Omitting pool keeps today's behavior: fresh bytes per request.
Acceptance
- Per-modality
pool.size honored: at most N distinct rendered blobs.
- Per-modality
pool.sampling honored across requests.
- Pool is per-process (consistent with
VideoBytesPool model) so multi-worker loadgen doesn't need IPC.
pool and prefix_multimodal interact cleanly (prefix bytes still deterministic per group; payload bytes drawn from the pool).
- Existing
VideoBytesPool subsumed by the new mechanism (image / audio / video unified) so the hardcoded pool_size=4 is replaced by pool.size.
Open questions
Distribution's mean / std_dev defaults (512 / 200) are meaningless for an index range. Validator requiring min / max when Distribution is used as a pool sampler, or a thin IndexDistribution alias that overrides defaults?
- Add
ZIPF as a new DistributionType value, or store the exponent on skew only when type=zipf?
- Subsume the existing
VideoBytesPool (cleaner, default pool.size=4 to preserve behavior) vs. leave it alone and only add new knobs for image / audio (smaller blast radius)?
What
SyntheticMultimodalDatagenConfighas no way to express:Today each request renders fresh random bytes per image and per audio (see
MultimodalDataGenerator._build_specininference_perf/datagen/multimodal_datagen.pyand thedeterministic=Falsematerialization path inChatCompletionAPIData._materialize_multimodal_contentininference_perf/apis/chat.py). Video has a hiddenVideoBytesPoolof 4 per(w, h, frames)profile ininference_perf/mediagen/pool.py, not user-configurable.The only cross-request reuse path is
SharedPrefixDataGenerator, which is prefix-side only and binary at the group level (prefix_cache_key).Why this matters
Real VLM workloads draw from a finite media corpus and reuse items at non-uniform rates (recently-uploaded asset hit by many users, hot product image, etc.). Without a pool + reuse policy we can't measure server-side multimodal cache behavior, dedup, or encoder-cache reuse under realistic distributions.
A stakeholder hit this directly: there is currently no way to express
i.e.
image2reused across two requests.Proposed config shape (sketch, for discussion)
Reuse the existing
Distributiontype for the sampling policy rather than inventing a parallel one.Distributionalready coversuniform/fixed; addingzipftoDistributionTypecovers the heavy-tail case that real content reuse follows. Explicit weights mirror the existingWeightedResolution/WeightedVideoProfilepattern.Omitting
poolkeeps today's behavior: fresh bytes per request.Acceptance
pool.sizehonored: at most N distinct rendered blobs.pool.samplinghonored across requests.VideoBytesPoolmodel) so multi-worker loadgen doesn't need IPC.poolandprefix_multimodalinteract cleanly (prefix bytes still deterministic per group; payload bytes drawn from the pool).VideoBytesPoolsubsumed by the new mechanism (image / audio / video unified) so the hardcodedpool_size=4is replaced bypool.size.Open questions
Distribution'smean/std_devdefaults (512 / 200) are meaningless for an index range. Validator requiringmin/maxwhenDistributionis used as a pool sampler, or a thinIndexDistributionalias that overrides defaults?ZIPFas a newDistributionTypevalue, or store the exponent onskewonly whentype=zipf?VideoBytesPool(cleaner, defaultpool.size=4to preserve behavior) vs. leave it alone and only add new knobs for image / audio (smaller blast radius)?