-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[Model][VLM] Add Kimi-VL model support #16387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model][VLM] Add Kimi-VL model support #16387
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Hey @courage17340 thanks for the contribution! Before I review the PR just one quick question:
Are you still seeing this issue from the main branch? This should have been fixed but it would be great if you can verify it. |
I saw this yesterday, let me rebase and try again |
Please also try specifiying |
After specifiying |
Could you elaborate on the memory leak when caching is enabled? Does the memory usage grow without bound or does it stabilize after a while? |
It grows without bound, and the machine runs out of memory very soon. |
Can you show how to reproduce this issue? |
I served kimi-vl and used VLMEvalKit to test it. Maybe simple cases like sending random images can also reproduce, let me try that. |
2f522fb
to
ead0fbc
Compare
python3 -m vllm.entrypoints.openai.api_server --port 8888 --served-model-name kimi-vl --trust-remote-code --model moonshotai/Kimi-VL-A3B-Instruct --tensor-parallel-size 1 --max-num-batched-tokens 131072 --max-model-len 131072 --max-num-seqs 512 --limit-mm-per-prompt image=256
python3 smoke.py import base64
import io
from multiprocessing import Pool
import click
import openai
from PIL import Image
import numpy as np
def make_message(num_images_per_prompt: int = 1):
images = []
for i in range(num_images_per_prompt):
random_array = np.random.randint(0, 256, (512, 512, 3), dtype=np.uint8)
image = Image.fromarray(random_array)
buffer = io.BytesIO()
image.save(buffer, format="JPEG")
images.append(base64.b64encode(buffer.getvalue()).decode())
return [{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image}",
},
} for image in images]
}]
def make_request(args):
baseurl, i = args
client = openai.OpenAI(
base_url=baseurl,
api_key="xx",
)
response = client.chat.completions.create(
model="kimi-vl",
messages=make_message(32),
stream=False,
temperature=0,
max_tokens=1,
)
return response.choices[0].message.content
@click.command(context_settings={"show_default": True})
@click.option("--endpoint",
"-e",
type=str,
help="service endpoint",
default="http://localhost:8888/v1")
@click.option("--num-reqs", "-n", type=int, help="num of requests", default=1024)
@click.option("--concurrency", "-c", type=int, help="concurrency", default=64)
def main(
endpoint: str,
num_reqs: int,
concurrency: int,
) -> None:
with Pool(processes=concurrency) as pool:
result = pool.map(make_request, [(endpoint, i) for i in range(num_reqs)])
if __name__ == "__main__":
main()
|
Can you try |
Meanwhile I'm going to try reproducing this, thanks for the MRE! |
Back to the PR, be sure to update the tests as mentioned here: https://docs.vllm.ai/en/latest/contributing/model/tests.html And don't forget to update the supported models page! Thanks for your help in implementing the model! |
Hmm testing this with a different model (and smaller image count:
It seems that the cache takes around 2 runs of the script to be filled completely, after which the memory usage remains stable. Let me move my setup so I can actually run the Kimi-VL model using your settings. Edit:* What GPU are you running this on? I get OOM even on A800 (80 GB) |
I'm running this on H800 (80GB too). It seems that you are getting GPU OOM? In that case, you should check whether flash-attn (vllm-flash-attn is not compatible, see the first known issue) is installed, because otherwise MoonViT will use a fallback attn impl, which consumes a lot of GPU memory. |
Thanks for reminding me of this, yeah it solved the OOM for me. |
Regarding the issue about memory usage, can you try out #16432 and see if the problem is solved? |
It's much better now, and becomes stable later (~13G in total). Compared with ~100G before, I think the problem is solved. |
setting dtype to half results in a data type mismatch between input and bias tensors in: x = self.proj(x).view(x.size(0), -1)
|
The problem looks like caused by |
Are you planning to add a reasoning parser for online inference? |
I can do that when I have time next week. |
I found that the output of the model gets abruptly cut off compared to HF. This occurs for both Instruct and Thinking variants. This is the prompt I'm using:
HF ResultScript: from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
PROMPT= ... # Copy from above
if __name__ == "__main__":
model_path = "moonshotai/Kimi-VL-A3B-Thinking"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image_path = "./demo.png" # Note: This is taken from the Kimi-VL-A3B-Instruct repo
image = Image.open(image_path)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": PROMPT},
],
},
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=16384)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response) Output:
vLLM ResultScript: from PIL import Image
from transformers import AutoProcessor
from vllm import LLM
PROMPT= ... # Copy from above
if __name__ == "__main__":
model_path = "moonshotai/Kimi-VL-A3B-Thinking"
llm = LLM(
model_path,
max_model_len=16384,
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image_path = "./demo.png" # Note: This is taken from the Kimi-VL-A3B-Instruct repo
images = Image.open(image_path)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": PROMPT},
],
},
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": images}}])
print("-" * 50)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
print("-" * 50) Output:
|
Thanks for the clarification! I met the OOM case, too. In this case, could you please let me know how to properly configure the model to use flash-attn instead of vllm-flash-attn during inference? |
Hi, if you have installed flash-attn, the vision tower will use it default. See vllm/vllm/model_executor/models/moonvit.py Line 419 in 205d84a
|
Thanks for your feedback. I have reproduced the unexpected output which was cut and we will try to fix it ASAP. |
Hi @DarkLight1337, I suddenly found a question that the default value of max_tokens in SamplingParams is 16. If there is no additional setting, the model will only output 16 tokens, which is consistent with the number of truncated output tokens. |
Oh, nice catch. My bad then. Let me try increasing the number of tokens... |
After increasing
The output is a bit different which is expected because of the temperature setting. It's still reasonable. Sorry for the false alarm! P.S.: I suggest to add |
Nice. Thanks for your suggestion. I will add it tomorrow. |
FYI, I haved added the |
Signed-off-by: courage17340 <[email protected]> Signed-off-by: Yang Wang <[email protected]>
Signed-off-by: courage17340 <[email protected]>
Signed-off-by: courage17340 <[email protected]>
Signed-off-by: courage17340 <[email protected]> Signed-off-by: Mu Huai <[email protected]>
So How to slove the error Kimi-VL-A3B-Instruct OOM with 80G*2 VRAM |
Please refer to "Known Issues" in the top post |
Any updates regarding reasoning parser and struct out for this new supported model? |
Do you guys plan to implement a reasoning parser and struct out support for this new Kimi model? |
Hi @nicoeiris11 , It is recommended to use regular expression matching to remove the content inside |
CLOSES #16387
Feature
Known Issues
--disable-mm-preprocessor-cache
to avoid memory leakExample Serving Command