[WIP] Add MMMU Dataset#478
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Bslabe123 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via: Supported
Alternatively, if the co-author should not be included, remove the Please update your commit message(s) by doing |
ShareGPT4Video is a gated HuggingFace video-captioning corpus (https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video, ~1.5 TB total, 15-21 GB per source zip). The loader streams captions + keyframe indices and pulls source MP4 zips into a local cache directory via a background thread, so the hot request path never blocks on the network. Init blocks only until the first zip has been extracted so there is always at least one usable video when load generation begins; the on-disk pool grows over time as background downloads complete. Each request decodes the configured keyframes via PyAV, resizes to ``target_resolution``, re-encodes as PNG or JPEG, and emits the frames as ``PreEncodedFramesVideoSpec`` (one ``image_url`` block per frame at a single insertion point) — compatible with VLMs that don't accept ``video_url`` natively. MP4 representation is explicitly rejected for v1. ``GatedHFDataset`` mixin in ``inference_perf/datagen/gated_hf_dataset.py`` resolves the HF access token (config field → ``HF_TOKEN`` → ``HUGGING_FACE_HUB_TOKEN``) and wraps ``load_dataset`` so other gated loaders (MMMU) can reuse the pattern. This commit also folds in two adjacent changes that the loader and its parent issue rely on: - ``config.py``: ``SessionReplayConfig`` base extracted from ``OTelTraceReplayConfig``, picking up ``inject_random_session_id``, ``duplicate_sessions_target``, ``max_wait_ms``, ``hf_dataset_path`` on OTel and ``max_model_len`` on ``ConversationReplayConfig``. - ``chat.py``: tool-call wire support (``tool_calls`` / ``tool_call_id`` on ``ChatMessage``, ``tool_definitions`` → ``tools`` on the request payload, JSON-Schema normalisation for vLLM/xgrammar compatibility). Addresses kubernetes-sigs#475.
- ``tests/optional/manual/sharegpt4video/cases/{png,jpeg}_frames/``
— Kubernetes manual e2e cases. Each case deploys a Qwen3-VL vLLM
server and runs an inference-perf Job that hits ShareGPT4Video
through the gated HuggingFace dataset. ``hf-secret`` is expected
in the ``default`` namespace; the orchestrator copies it into the
per-case namespace before applying the workload.
- ``tests/optional/manual/sharegpt4video/e2e_test.sh`` — orchestrator
that iterates every case folder, runs each in its own namespace,
extracts the lifecycle reports, and verifies non-zero successes /
zero failures.
- ``e2e/tests/test_multimodal_sim.py::test_sharegpt4video_against_sim``
— Python e2e test against llm-d-inference-sim. Skipped by default;
requires ``SHAREGPT4VIDEO_CACHE_DIR`` to be pre-populated (the
loader does not auto-download in this test) and ``HF_TOKEN``.
Pinned to small resolution / 4 frames so the run stays fast.
Real-dataset loader for the gated HuggingFace MMMU corpus (https://huggingface.co/datasets/MMMU/MMMU), a college-level VLM evaluation benchmark spanning 30 subjects across 6 disciplines. Streams examples, extracts embedded PIL images, re-encodes them as PNG or JPEG, and emits chat-completion requests with question + options text plus 1-7 image_url blocks per request. Document- and diagram-heavy workload shape that the synthetic path doesn't exercise well. Wires PreEncodedImageSpec into the chat.py materializer (image loop now isinstance-dispatches on the discriminated union the same way the video loop already did). MMMU is small enough to fit in memory, so init loads all configured subjects synchronously and load_lazy_data is O(1) — no background download thread. Addresses kubernetes-sigs#476. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the sharegpt4video pattern from the parent commit:
- ``tests/optional/manual/mmmu/cases/{png,jpeg}/`` — Kubernetes manual e2e
cases. Each case deploys a Qwen3-VL vLLM server and runs an inference-perf
Job that hits MMMU through the gated HuggingFace dataset. ``hf-secret`` is
expected in the ``default`` namespace; the orchestrator copies it into the
per-case namespace before applying the workload.
- ``tests/optional/manual/mmmu/e2e_test.sh`` — orchestrator that iterates
every case folder, runs each in its own namespace, extracts the lifecycle
reports, and verifies non-zero successes / zero failures.
- ``e2e/tests/test_multimodal_sim.py::test_mmmu_against_sim`` — Python e2e
test against llm-d-inference-sim. Skipped by default; requires HF_TOKEN
(no cache pre-population needed — MMMU is small enough that the HF
datasets cache handles it transparently). Pinned to one subject
(Computer_Science) and aggressively resized so init stays fast.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
DO NOT REVIEW
Addresses: #476