[Model][VLM] Add Kimi-VL model support #16387

courage17340 · 2025-04-10T04:05:22Z

Feature

Added support for Kimi-VL
Included a bug fix for v1 engine + mla + multi modal
- _prepare_inputs may reorder the batch, and hence mm_embeds would be mismatched
- Maybe this should be split into a seperate PR

Known Issues

MoonViT have a special head_size=72, which is not compatible with vllm-flash-attn
- It's recommended to install flash-attn to workaround this
- If flash-attn is not installed, MoonViT will use a fallback impl which costs a lot of GPU memory and you may get GPU OOM.
It seems that recent versions of vllm have cpu memory leak problem for multi modal models
- You may need specifying --disable-mm-preprocessor-cache to avoid memory leak

Example Serving Command

python3 -m vllm.entrypoints.openai.api_server --port 8888 --served-model-name kimi-vl --trust-remote-code --model moonshotai/Kimi-VL-A3B-Instruct --tensor-parallel-size 1 --max-num-batched-tokens 131072 --max-model-len 131072 --max-num-seqs 512 --limit-mm-per-prompt image=256

github-actions · 2025-04-10T04:05:32Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

ywang96 · 2025-04-10T04:14:37Z

Hey @courage17340 thanks for the contribution! Before I review the PR just one quick question:

It seems that recent versions of vllm have cpu memory leak problem for multi modal models

Are you still seeing this issue from the main branch? This should have been fixed but it would be great if you can verify it.

courage17340 · 2025-04-10T04:40:33Z

Hey @courage17340 thanks for the contribution! Before I review the PR just one quick question:

It seems that recent versions of vllm have cpu memory leak problem for multi modal models

Are you still seeing this issue from the main branch? This should have been fixed but it would be great if you can verify it.

I saw this yesterday, let me rebase and try again

ywang96 · 2025-04-10T04:48:50Z

Hey @courage17340 thanks for the contribution! Before I review the PR just one quick question:

It seems that recent versions of vllm have cpu memory leak problem for multi modal models

Are you still seeing this issue from the main branch? This should have been fixed but it would be great if you can verify it.

I saw this yesterday, let me rebase and try again

Please also try specifiying --disable-mm-preprocessor-cache and see if the memory leak issues persists if it happens. Thanks!

vllm/v1/worker/gpu_model_runner.py

courage17340 · 2025-04-10T06:21:43Z

Hey @courage17340 thanks for the contribution! Before I review the PR just one quick question:

It seems that recent versions of vllm have cpu memory leak problem for multi modal models

Are you still seeing this issue from the main branch? This should have been fixed but it would be great if you can verify it.

I saw this yesterday, let me rebase and try again

Please also try specifiying --disable-mm-preprocessor-cache and see if the memory leak issues persists if it happens. Thanks!

After specifiying --disable-mm-preprocessor-cache, memory leak is gone, thanks!

vllm/transformers_utils/processors/processing_kimi_vl.py

DarkLight1337 · 2025-04-10T08:16:32Z

Could you elaborate on the memory leak when caching is enabled? Does the memory usage grow without bound or does it stabilize after a while?

courage17340 · 2025-04-10T09:43:24Z

Could you elaborate on the memory leak when caching is enabled? Does the memory usage grow without bound or does it stabilize after a while?

It grows without bound, and the machine runs out of memory very soon.

DarkLight1337 · 2025-04-10T09:45:06Z

Can you show how to reproduce this issue?

courage17340 · 2025-04-10T09:54:38Z

Can you show how to reproduce this issue?

I served kimi-vl and used VLMEvalKit to test it. Maybe simple cases like sending random images can also reproduce, let me try that.

courage17340 · 2025-04-10T12:22:27Z

It seems that single-image requests can't reproduce any more, but multiple-image requests can.
- My local test env is rebased on 0d4d06f, which includes [Bugfix] Avoid transferring cached multi-modal items from P0 to P1 #16273
serve

python3 -m vllm.entrypoints.openai.api_server --port 8888 --served-model-name kimi-vl --trust-remote-code --model moonshotai/Kimi-VL-A3B-Instruct --tensor-parallel-size 1 --max-num-batched-tokens 131072 --max-model-len 131072 --max-num-seqs 512 --limit-mm-per-prompt image=256

test

python3 smoke.py

import base64
import io
from multiprocessing import Pool

import click
import openai
from PIL import Image
import numpy as np


def make_message(num_images_per_prompt: int = 1):
    images = []
    for i in range(num_images_per_prompt):
        random_array = np.random.randint(0, 256, (512, 512, 3), dtype=np.uint8)
        image = Image.fromarray(random_array)
        buffer = io.BytesIO()
        image.save(buffer, format="JPEG")
        images.append(base64.b64encode(buffer.getvalue()).decode())
    return [{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{image}",
            },
        } for image in images]
    }]


def make_request(args):
    baseurl, i = args
    client = openai.OpenAI(
        base_url=baseurl,
        api_key="xx",
    )
    response = client.chat.completions.create(
        model="kimi-vl",
        messages=make_message(32),
        stream=False,
        temperature=0,
        max_tokens=1,
    )
    return response.choices[0].message.content


@click.command(context_settings={"show_default": True})
@click.option("--endpoint",
              "-e",
              type=str,
              help="service endpoint",
              default="http://localhost:8888/v1")
@click.option("--num-reqs", "-n", type=int, help="num of requests", default=1024)
@click.option("--concurrency", "-c", type=int, help="concurrency", default=64)
def main(
    endpoint: str,
    num_reqs: int,
    concurrency: int,
) -> None:
    with Pool(processes=concurrency) as pool:
        result = pool.map(make_request, [(endpoint, i) for i in range(num_reqs)])

if __name__ == "__main__":
    main()

I use ps auxf to see the memory cost (the RSS column, the unit should be in KB)
- vllm have 3 processes in total, and I choose the largest one
- after serving: 3017216 (~3GiB)
- after first execution of the smoke test: 57436432 (~53GiB)
- after second execution of the smoke test: 101605816 (~94GiB)

DarkLight1337 · 2025-04-10T13:57:28Z

Can you try VLLM_ENABLE_V1_MULTIPROCESSING=0 and see if it resolves the memory issue?

DarkLight1337 · 2025-04-10T13:57:39Z

Meanwhile I'm going to try reproducing this, thanks for the MRE!

DarkLight1337 · 2025-04-10T14:42:58Z

Back to the PR, be sure to update the tests as mentioned here: https://docs.vllm.ai/en/latest/contributing/model/tests.html

And don't forget to update the supported models page! Thanks for your help in implementing the model!

DarkLight1337 · 2025-04-10T15:12:35Z

Hmm testing this with a different model (and smaller image count:

memray run -m vllm.entrypoints.openai.api_server --model HuggingFaceTB/SmolVLM2-2.2B-Instruct --limit-mm-per-prompt image=4

# Run 3 times
python3 smoke.py

It seems that the cache takes around 2 runs of the script to be filled completely, after which the memory usage remains stable.

Let me move my setup so I can actually run the Kimi-VL model using your settings.

Edit:* What GPU are you running this on? I get OOM even on A800 (80 GB)

courage17340 · 2025-04-11T02:59:07Z

Edit:* What GPU are you running this on? I get OOM even on A800 (80 GB)

I'm running this on H800 (80GB too). It seems that you are getting GPU OOM? In that case, you should check whether flash-attn (vllm-flash-attn is not compatible, see the first known issue) is installed, because otherwise MoonViT will use a fallback attn impl, which consumes a lot of GPU memory.

DarkLight1337 · 2025-04-11T04:36:21Z

I'm running this on H800 (80GB too). It seems that you are getting GPU OOM? In that case, you should check whether flash-attn (vllm-flash-attn is not compatible, see the first known issue) is installed, because otherwise MoonViT will use a fallback attn impl, which consumes a lot of GPU memory.

Thanks for reminding me of this, yeah it solved the OOM for me.

DarkLight1337 · 2025-04-15T04:23:36Z

Regarding the issue about memory usage, can you try out #16432 and see if the problem is solved?

courage17340 · 2025-04-15T06:46:49Z

Regarding the issue about memory usage, can you try out #16432 and see if the problem is solved?

It's much better now, and becomes stable later (~13G in total). Compared with ~100G before, I think the problem is solved.

Husamx · 2025-04-16T14:01:50Z

setting dtype to half results in a data type mismatch between input and bias tensors in: ../vllm/model_executor/models/moonvit.py, line 257:

 x = self.proj(x).view(x.size(0), -1)

RuntimeError: Input type (c10::BFloat16) and bias type (c10::Half) should be the same

courage17340 · 2025-04-17T03:03:18Z

setting dtype to half results in a data type mismatch between input and bias tensors in: ../vllm/model_executor/models/moonvit.py, line 257:
 x = self.proj(x).view(x.size(0), -1)
RuntimeError: Input type (c10::BFloat16) and bias type (c10::Half) should be the same

The problem looks like caused by pixel_values = pixel_values.to(torch.bfloat16), which can be fixed. However, the model is trained with bfloat16, and we recommend inferencing with bfloat16. If you use half to inference, there may be unexpected inf/nan.

DarkLight1337 · 2025-04-18T09:09:29Z

Are you planning to add a reasoning parser for online inference?

courage17340 · 2025-04-18T10:43:17Z

Are you planning to add a reasoning parser for online inference?

I can do that when I have time next week.

DarkLight1337 · 2025-04-18T12:56:30Z

I found that the output of the model gets abruptly cut off compared to HF. This occurs for both Instruct and Thinking variants.

This is the prompt I'm using:

Your task is to list out the objects in this image.

For each object, follow these steps:

- Step 1: Locate this object in the image and output its bounding box in JSON, with the fields: `min_x`, `min_y`, `max_x` and `max_y`. These coordinates should be in terms of pixels.
- Step 2: Generate a caption of this object.
- Step 3: Summarize your answer in JSON format, with the fields: `bounding_box` (Step 1), `caption` (Step 2)

Afterwards, summarize your response in JSON format. The JSON should be a list where each item has the fields described in Step 3.

HF Result

Script:

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

PROMPT= ...  # Copy from above


if __name__ == "__main__":
    model_path = "moonshotai/Kimi-VL-A3B-Thinking"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

    image_path = "./demo.png"  # Note: This is taken from the Kimi-VL-A3B-Instruct repo
    image = Image.open(image_path)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": PROMPT},
            ],
        },
    ]
    text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=16384)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    response = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    print(response)

Output:

◁think▷I need to list out the objects in the provided image and follow specific steps for each object. Let me analyze the image and plan my response.

The image shows a cityscape with a prominent tower in the background. The tower appears to be the CN Tower in Toronto, Canada. There are also many buildings, roads, vehicles, and possibly some other structures visible.

For each object in the image, I need to:
1. Locate the object and provide its bounding box in JSON format (min_x, min_y, max_x, max_y in pixels)
2. Generate a caption for the object
3. Summarize my answer in JSON format with bounding_box and caption fields

Let me identify the main objects in the image:

1. The CN Tower (the tall tower on the right side of the image)
2. The city buildings (numerous buildings throughout the image)
3. The roads and vehicles (the busy roads in the foreground)
4. The sky and clouds (the sky in the background)

Now, I'll create bounding boxes for each of these objects:

For the CN Tower:
- Bounding box: [min_x, min_y, max_x, max_y] would need to be determined by examining the image. The tower spans from approximately (left: 0, top: 0) to (right: 500, bottom: 500) in this wide-angle image.

For the city buildings:
- Bounding box: The buildings span from approximately (left: 0, top: 0) to (right: 1000, bottom: 1000) in this wide-angle image.

For the roads and vehicles:
- Bounding box: The roads span from approximately (left: 200, top: 300) to (right: 800, bottom: 700) in this wide-angle image.

For the sky and clouds:
- Bounding box: The sky spans from approximately (left: 0, top: 0) to (right: 1000, bottom: 1000) in this wide-angle image.

Now, I'll generate captions for each object:

1. CN Tower: "The CN Tower is a famous landmark in Toronto, Canada, known for its distinctive design and observation decks."
2. City buildings: "Numerous buildings make up the Toronto skyline, showcasing modern architecture and urban density."
3. Roads and vehicles: "Busy roads with moving vehicles indicate the bustling transportation network of the city."
4. Sky and clouds: "The sky displays a beautiful sunset with clouds, adding a dramatic backdrop to the cityscape."

Finally, I'll summarize my answer in JSON format with the bounding_box and caption fields for each object.

Let me now format my complete response:◁/think▷```json
[
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 500,
      "max_y": 500
    },
    "caption": "The CN Tower is a famous landmark in Toronto, Canada, known for its distinctive design and observation decks."
  },
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 1000,
      "max_y": 1000
    },
    "caption": "Numerous buildings make up the Toronto skyline, showcasing modern architecture and urban density."
  },
  {
    "bounding_box": {
      "min_x": 200,
      "min_y": 300,
      "max_x": 800,
      "max_y": 700
    },
    "caption": "Busy roads with moving vehicles indicate the bustling transportation network of the city."
  },
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 1000,
      "max_y": 1000
    },
    "caption": "The sky displays a beautiful sunset with clouds, adding a dramatic backdrop to the cityscape."
  }
]
```

vLLM Result

Script:

from PIL import Image
from transformers import AutoProcessor
from vllm import LLM

PROMPT= ... # Copy from above


if __name__ == "__main__":
    model_path = "moonshotai/Kimi-VL-A3B-Thinking"
    llm = LLM(
        model_path,
        max_model_len=16384,
        trust_remote_code=True,
    )
    processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

    image_path = "./demo.png"  # Note: This is taken from the Kimi-VL-A3B-Instruct repo
    images = Image.open(image_path)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": PROMPT},
            ],
        },
    ]
    text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": images}}])

    print("-" * 50)
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
        print("-" * 50)

Output:

--------------------------------------------------
◁think▷I need to list out the objects in the image and
--------------------------------------------------

yushuiwx · 2025-04-19T12:07:55Z

MoonViT have a special head_size=72, which is not compatible with vllm-flash-attn

It's recommended to install flash-attn to workaround this

Thanks for the clarification!

I met the OOM case, too. In this case, could you please let me know how to properly configure the model to use flash-attn instead of vllm-flash-attn during inference?

zhouzaida · 2025-04-19T14:05:28Z

MoonViT have a special head_size=72, which is not compatible with vllm-flash-attn

It's recommended to install flash-attn to workaround this

Thanks for the clarification!

I met the OOM case, too. In this case, could you please let me know how to properly configure the model to use flash-attn instead of vllm-flash-attn during inference?

Hi, if you have installed flash-attn, the vision tower will use it default. See

vllm/vllm/model_executor/models/moonvit.py

Line 419 in 205d84a

if is_flash_attn_2_available():

for details.

zhouzaida · 2025-04-19T14:07:18Z

I found that the output of the model gets abruptly cut off compared to HF. This occurs for both Instruct and Thinking variants.

This is the prompt I'm using:

Your task is to list out the objects in this image.

For each object, follow these steps:

- Step 1: Locate this object in the image and output its bounding box in JSON, with the fields: `min_x`, `min_y`, `max_x` and `max_y`. These coordinates should be in terms of pixels.
- Step 2: Generate a caption of this object.
- Step 3: Summarize your answer in JSON format, with the fields: `bounding_box` (Step 1), `caption` (Step 2)

Afterwards, summarize your response in JSON format. The JSON should be a list where each item has the fields described in Step 3.

HF Result

Script:

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

PROMPT= ...  # Copy from above


if __name__ == "__main__":
    model_path = "moonshotai/Kimi-VL-A3B-Thinking"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

    image_path = "./demo.png"  # Note: This is taken from the Kimi-VL-A3B-Instruct repo
    image = Image.open(image_path)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": PROMPT},
            ],
        },
    ]
    text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=16384)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    response = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    print(response)

Output:

◁think▷I need to list out the objects in the provided image and follow specific steps for each object. Let me analyze the image and plan my response.

The image shows a cityscape with a prominent tower in the background. The tower appears to be the CN Tower in Toronto, Canada. There are also many buildings, roads, vehicles, and possibly some other structures visible.

For each object in the image, I need to:
1. Locate the object and provide its bounding box in JSON format (min_x, min_y, max_x, max_y in pixels)
2. Generate a caption for the object
3. Summarize my answer in JSON format with bounding_box and caption fields

Let me identify the main objects in the image:

1. The CN Tower (the tall tower on the right side of the image)
2. The city buildings (numerous buildings throughout the image)
3. The roads and vehicles (the busy roads in the foreground)
4. The sky and clouds (the sky in the background)

Now, I'll create bounding boxes for each of these objects:

For the CN Tower:
- Bounding box: [min_x, min_y, max_x, max_y] would need to be determined by examining the image. The tower spans from approximately (left: 0, top: 0) to (right: 500, bottom: 500) in this wide-angle image.

For the city buildings:
- Bounding box: The buildings span from approximately (left: 0, top: 0) to (right: 1000, bottom: 1000) in this wide-angle image.

For the roads and vehicles:
- Bounding box: The roads span from approximately (left: 200, top: 300) to (right: 800, bottom: 700) in this wide-angle image.

For the sky and clouds:
- Bounding box: The sky spans from approximately (left: 0, top: 0) to (right: 1000, bottom: 1000) in this wide-angle image.

Now, I'll generate captions for each object:

1. CN Tower: "The CN Tower is a famous landmark in Toronto, Canada, known for its distinctive design and observation decks."
2. City buildings: "Numerous buildings make up the Toronto skyline, showcasing modern architecture and urban density."
3. Roads and vehicles: "Busy roads with moving vehicles indicate the bustling transportation network of the city."
4. Sky and clouds: "The sky displays a beautiful sunset with clouds, adding a dramatic backdrop to the cityscape."

Finally, I'll summarize my answer in JSON format with the bounding_box and caption fields for each object.

Let me now format my complete response:◁/think▷```json
[
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 500,
      "max_y": 500
    },
    "caption": "The CN Tower is a famous landmark in Toronto, Canada, known for its distinctive design and observation decks."
  },
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 1000,
      "max_y": 1000
    },
    "caption": "Numerous buildings make up the Toronto skyline, showcasing modern architecture and urban density."
  },
  {
    "bounding_box": {
      "min_x": 200,
      "min_y": 300,
      "max_x": 800,
      "max_y": 700
    },
    "caption": "Busy roads with moving vehicles indicate the bustling transportation network of the city."
  },
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 1000,
      "max_y": 1000
    },
    "caption": "The sky displays a beautiful sunset with clouds, adding a dramatic backdrop to the cityscape."
  }
]


### vLLM Result
Script:

```python
from PIL import Image
from transformers import AutoProcessor
from vllm import LLM

PROMPT= ... # Copy from above


if __name__ == "__main__":
    model_path = "moonshotai/Kimi-VL-A3B-Thinking"
    llm = LLM(
        model_path,
        max_model_len=16384,
        trust_remote_code=True,
    )
    processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

    image_path = "./demo.png"  # Note: This is taken from the Kimi-VL-A3B-Instruct repo
    images = Image.open(image_path)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": PROMPT},
            ],
        },
    ]
    text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": images}}])

    print("-" * 50)
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
        print("-" * 50)

Output:

--------------------------------------------------
◁think▷I need to list out the objects in the image and
--------------------------------------------------

Thanks for your feedback. I have reproduced the unexpected output which was cut and we will try to fix it ASAP.

zhouzaida · 2025-04-19T15:24:04Z

Hi @DarkLight1337, I suddenly found a question that the default value of max_tokens in SamplingParams is 16. If there is no additional setting, the model will only output 16 tokens, which is consistent with the number of truncated output tokens.

DarkLight1337 · 2025-04-19T15:30:47Z

Oh, nice catch. My bad then. Let me try increasing the number of tokens...

DarkLight1337 · 2025-04-19T15:35:52Z

After increasing max_tokens=8192, I got this output which ~~is still somewhat different~~ looks fine:

◁think▷Let me break down the task:

1. I need to identify objects in the provided image.
2. For each object:
   - Step 1: Identify its location and provide bounding box coordinates in JSON format (min_x, min_y, max_x, max_y, all in pixels)
   - Step 2: Create a caption describing the object
   - Step 3: Summarize my findings in a JSON list where each item contains the bounding box and caption

I need to carefully examine the image and identify distinct objects that can be bounded. Some potential objects I can identify:

1. The CN Tower (a tall tower in Toronto)
2. Skyscrapers and buildings (multiple high-rises in the cityscape)
3. Cars on the roads (traffic on the highways)
4. Canadian flags (possibly some flags visible)
5. The domed building (possibly the Rogers Centre)

Now, let me go through each step for one of these objects to demonstrate the process:

Object: CN Tower

Step 1: Bounding box in pixels
Looking at the image, the CN Tower is prominently visible in the top right. I'll estimate its bounding box:
- Min_x: Approximately 430 (left edge of tower)
- Min_y: Approximately 50 (top of tower)
- Max_x: Approximately 800 (right edge of tower)
- Max_y: Approximately 500 (bottom of tower)

Step 2: Caption
"CN Tower, a distinctive tower in Toronto, Ontario, Canada."

Step 3: Summary in JSON
```json
{
  "bounding_box": {
    "min_x": 430,
    "min_y": 50,
    "max_x": 800,
    "max_y": 500
  },
  "caption": "CN Tower, a distinctive tower in Toronto, Ontario, Canada."
}
```

Now, I need to repeat this process for all relevant objects in the image. Let me identify all possible objects and their bounding boxes:

1. CN Tower:
  Bounding box: min_x=430, min_y=50, max_x=800, max_y=500
  Caption: "CN Tower, a distinctive tower in Toronto, Ontario, Canada."

2. Skyscrapers:
  Bounding box: min_x=90, min_y=200, max_x=870, max_y=850
  Caption: "Multiple mid-to-high rise buildings typical of Toronto"

3. Cars:
  Bounding box: min_x=420, min_y=380, max_x=680, max_y=520
  Caption: "Traffic on the highways surrounding the CN Tower"

4. Canadian flags:
  Bounding box: min_x=700, min_y=300, max_x=730, max_y=320
  Caption: "Canadian flags visible on some buildings"

5. Rogers Centre:
  Bounding box: min_x=550, min_y=300, max_x=830, max_y=380
  Caption: "Domed stadium, likely the Rogers Centre in Toronto"

Now I need to summarize my findings in JSON format as a list of items following the specified structure.

Let me create the final JSON response:◁/think▷```json
[
  {
    "bounding_box": {
      "min_x": 430,
      "min_y": 50,
      "max_x": 800,
      "max_y": 500
    },
    "caption": "CN Tower, a distinctive tower in Toronto, Ontario, Canada."
  },
  {
    "bounding_box": {
      "min_x": 90,
      "min_y": 200,
      "max_x": 870,
      "max_y": 850
    },
    "caption": "Multiple mid-to-high rise buildings typical of Toronto"
  },
  {
    "bounding_box": {
      "min_x": 420,
      "min_y": 380,
      "max_x": 680,
      "max_y": 520
    },
    "caption": "Traffic on the highways surrounding the CN Tower"
  },
  {
    "bounding_box": {
      "min_x": 700,
      "min_y": 300,
      "max_x": 730,
      "max_y": 320
    },
    "caption": "Canadian flags visible on some buildings"
  },
  {
    "bounding_box": {
      "min_x": 550,
      "min_y": 300,
      "max_x": 830,
      "max_y": 380
    },
    "caption": "Domed stadium, likely the Rogers Centre in Toronto"
  }
]
```

The output is a bit different which is expected because of the temperature setting. It's still reasonable. Sorry for the false alarm!

P.S.: I suggest to add generation_config.json to your HF Hub to set the default temperature according to what's on your model card

zhouzaida · 2025-04-19T15:39:03Z

Nice. Thanks for your suggestion. I will add it tomorrow.

zhouzaida · 2025-04-20T13:31:55Z

FYI, I haved added the generation_config.json.

Signed-off-by: courage17340 <[email protected]> Signed-off-by: Yang Wang <[email protected]>

Signed-off-by: courage17340 <[email protected]>

Signed-off-by: courage17340 <[email protected]> Signed-off-by: Mu Huai <[email protected]>

guihonghao · 2025-05-14T07:54:30Z

So How to slove the error Kimi-VL-A3B-Instruct OOM with 80G*2 VRAM

DarkLight1337 · 2025-05-14T07:55:53Z

Please refer to "Known Issues" in the top post

nicoeiris11 · 2025-05-16T19:45:23Z

Are you planning to add a reasoning parser for online inference?

Any updates regarding reasoning parser and struct out for this new supported model?

nicoeiris11 · 2025-06-17T20:34:54Z

Do you guys plan to implement a reasoning parser and struct out support for this new Kimi model?
Or should I build a simple regex to avoid ◁think▷ output from the model? @zhouzaida

zhouzaida · 2025-06-21T03:14:36Z

Hi @nicoeiris11 , It is recommended to use regular expression matching to remove the content inside ◁think▷.

courage17340 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 10, 2025 04:05

mergify bot added frontend v1 labels Apr 10, 2025

courage17340 mentioned this pull request Apr 10, 2025

[New Model]: Kimi-VL-A3B #16358

Closed

1 task

ywang96 self-assigned this Apr 10, 2025

DarkLight1337 reviewed Apr 10, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

Isotr0py reviewed Apr 10, 2025

View reviewed changes

vllm/transformers_utils/processors/processing_kimi_vl.py Outdated Show resolved Hide resolved

courage17340 force-pushed the support-kimi-vl branch from 2f522fb to ead0fbc Compare April 10, 2025 12:22

mergify bot added the documentation Improvements or additions to documentation label Apr 11, 2025

DarkLight1337 enabled auto-merge (squash) April 14, 2025 12:54

DarkLight1337 merged commit b1308b8 into vllm-project:main Apr 14, 2025
69 checks passed

courage17340 deleted the support-kimi-vl branch April 15, 2025 02:59

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025

[Model][VLM] Add Kimi-VL model support (vllm-project#16387)

8fae1d3

Signed-off-by: courage17340 <[email protected]> Signed-off-by: Yang Wang <[email protected]>

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[Model][VLM] Add Kimi-VL model support (vllm-project#16387)

ed495b7

Signed-off-by: courage17340 <[email protected]>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Model][VLM] Add Kimi-VL model support (vllm-project#16387)

56c6a49

Signed-off-by: courage17340 <[email protected]>

DarkLight1337 mentioned this pull request May 11, 2025

[Bug]: Kimi-VL-A3B-Instruct OOM with 80G*2 VRAM #17952

Closed

1 task

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Model][VLM] Add Kimi-VL model support (vllm-project#16387)

93ffb7b

Signed-off-by: courage17340 <[email protected]> Signed-off-by: Mu Huai <[email protected]>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

Uh oh!

[Model][VLM] Add Kimi-VL model support #16387

[Model][VLM] Add Kimi-VL model support #16387

Uh oh!

Conversation

courage17340 commented Apr 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature

Known Issues

Example Serving Command

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

ywang96 commented Apr 10, 2025

Uh oh!

courage17340 commented Apr 10, 2025

Uh oh!

ywang96 commented Apr 10, 2025

Uh oh!

Uh oh!

courage17340 commented Apr 10, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Apr 10, 2025

Uh oh!

courage17340 commented Apr 10, 2025

Uh oh!

DarkLight1337 commented Apr 10, 2025

Uh oh!

courage17340 commented Apr 10, 2025

Uh oh!

courage17340 commented Apr 10, 2025

Uh oh!

DarkLight1337 commented Apr 10, 2025

Uh oh!

DarkLight1337 commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

courage17340 commented Apr 11, 2025

Uh oh!

DarkLight1337 commented Apr 11, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Apr 15, 2025

Uh oh!

courage17340 commented Apr 15, 2025

Uh oh!

Husamx commented Apr 16, 2025

Uh oh!

courage17340 commented Apr 17, 2025

Uh oh!

DarkLight1337 commented Apr 18, 2025

Uh oh!

courage17340 commented Apr 18, 2025

Uh oh!

DarkLight1337 commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HF Result

vLLM Result

Uh oh!

yushuiwx commented Apr 19, 2025

Uh oh!

zhouzaida commented Apr 19, 2025

Uh oh!

zhouzaida commented Apr 19, 2025

HF Result

Uh oh!

zhouzaida commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Apr 19, 2025

Uh oh!

DarkLight1337 commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

courage17340 commented Apr 10, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Apr 10, 2025 •

edited

Loading

DarkLight1337 commented Apr 10, 2025 •

edited

Loading

DarkLight1337 commented Apr 10, 2025 •

edited

Loading

DarkLight1337 commented Apr 18, 2025 •

edited

Loading

zhouzaida commented Apr 19, 2025 •

edited

Loading

DarkLight1337 commented Apr 19, 2025 •

edited

Loading

zhouzaida commented Apr 19, 2025 •

edited

Loading

nicoeiris11 commented Jun 17, 2025 •

edited

Loading