Replies: 1 comment
-
|
Thank you. Our current streaming implementation is more focused on demonstrating the algorithm computation process, and because there are few tokens for speech, it is even difficult to achieve the minimum block_size. Therefore, we did not consider kv-cache reuse in our open source code. However, this is still far from industrial implementation. We will continue to pay attention to this part and work with the vllm community to jointly build it. :) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Congratulations on the exciting model release!
We recently added some more native support in vLLM for streaming input based requests (see vllm-project/vllm#28973). It might be nice to look at how the streaming transcription APIs could exploit that. We would appreciate any feedback on the functionality if there are any adjustments needed to work with the model.
Beta Was this translation helpful? Give feedback.
All reactions