-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Hello, thank you for your impressive work on StreamingVLM!
I have been diving into the codebase to understand how the model handles different tasks, and I have a question regarding the inference strategy used for the VQA benchmarks reported in the paper (e.g., Video-MME, MVBench, LongVideoBench).
Observation
I noticed a distinction in the implementation between the streaming demo and the VQA evaluation:
Streaming Inference: In streaming_vlm/inference/inference.py, the code implements the process_past_kv function to handle KV cache reuse, text sinks, and sliding windows (eviction) to support infinite streams without OOM.
VQA Evaluation: In streaming_vlm/eval/VLMEvalKit, the configuration in video_dataset_config.py suggests that VQA tasks use standard frame sampling strategies (e.g., nframe=64 or fps=1.0) which are mutually exclusive. It appears that the VQA evaluation follows a standard model.generate pipeline without the specific KV cache compression/eviction logic used in the streaming mode.
Questions
Could you please clarify the following:
For the VQA results reported in the paper (Table 3 and Table 6), was the model running in the "Streaming Mode" (with KV cache eviction/reuse mechanisms enabled) or in "Standard Mode" (processing sampled frames with full attention/standard generation)?
If it is the latter (Standard Mode), is it correct to interpret that the performance gains on VQA benchmarks are primarily attributed to the SFT strategy and Dataset improving the model's weights/capabilities, rather than the streaming inference architecture (KV reuse/sink) itself being applied during these specific evaluations?
Thank you for your time and clarification!