Skip to content

[Question] Clarification on Inference Strategy for VQA Benchmarks (Standard Sampling vs. Streaming KV Cache) #15

@lern-to-write

Description

@lern-to-write

Hello, thank you for your impressive work on StreamingVLM!

I have been diving into the codebase to understand how the model handles different tasks, and I have a question regarding the inference strategy used for the VQA benchmarks reported in the paper (e.g., Video-MME, MVBench, LongVideoBench).

Observation
I noticed a distinction in the implementation between the streaming demo and the VQA evaluation:

Streaming Inference: In streaming_vlm/inference/inference.py, the code implements the process_past_kv function to handle KV cache reuse, text sinks, and sliding windows (eviction) to support infinite streams without OOM.
VQA Evaluation: In streaming_vlm/eval/VLMEvalKit, the configuration in video_dataset_config.py suggests that VQA tasks use standard frame sampling strategies (e.g., nframe=64 or fps=1.0) which are mutually exclusive. It appears that the VQA evaluation follows a standard model.generate pipeline without the specific KV cache compression/eviction logic used in the streaming mode.

Questions
Could you please clarify the following:

For the VQA results reported in the paper (Table 3 and Table 6), was the model running in the "Streaming Mode" (with KV cache eviction/reuse mechanisms enabled) or in "Standard Mode" (processing sampled frames with full attention/standard generation)?
If it is the latter (Standard Mode), is it correct to interpret that the performance gains on VQA benchmarks are primarily attributed to the SFT strategy and Dataset improving the model's weights/capabilities, rather than the streaming inference architecture (KV reuse/sink) itself being applied during these specific evaluations?
Thank you for your time and clarification!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions