[Question] Clarification on Inference Strategy for VQA Benchmarks (Standard Sampling vs. Streaming KV Cache)

Hello, thank you for your impressive work on StreamingVLM!

I have been diving into the codebase to understand how the model handles different tasks, and I have a question regarding the inference strategy used for the VQA benchmarks reported in the paper (e.g., Video-MME, MVBench, LongVideoBench).

**Observation**
I noticed a distinction in the implementation between the streaming demo and the VQA evaluation:

**Streaming Inference:** In streaming_vlm/inference/inference.py, the code implements the process_past_kv function to handle KV cache reuse, text sinks, and sliding windows (eviction) to support infinite streams without OOM.
**VQA Evaluation:** In streaming_vlm/eval/VLMEvalKit, the configuration in video_dataset_config.py suggests that VQA tasks use standard frame sampling strategies (e.g., nframe=64 or fps=1.0) which are mutually exclusive. It appears that the VQA evaluation follows a standard model.generate pipeline without the specific KV cache compression/eviction logic used in the streaming mode.


**Questions**
Could you please clarify the following:

For the VQA results reported in the paper (Table 3 and Table 6), was the model running in the "Streaming Mode" (with KV cache eviction/reuse mechanisms enabled) or in "Standard Mode" (processing sampled frames with full attention/standard generation)?
If it is the latter (Standard Mode), is it correct to interpret that the performance gains on VQA benchmarks are primarily attributed to the SFT strategy and Dataset improving the model's weights/capabilities, rather than the streaming inference architecture (KV reuse/sink) itself being applied during these specific evaluations?
Thank you for your time and clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Clarification on Inference Strategy for VQA Benchmarks (Standard Sampling vs. Streaming KV Cache) #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Clarification on Inference Strategy for VQA Benchmarks (Standard Sampling vs. Streaming KV Cache) #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions