Skip to content

[Model][VLM] Add Kimi-VL model support #16387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 14, 2025

Conversation

courage17340
Copy link
Contributor

@courage17340 courage17340 commented Apr 10, 2025

CLOSES #16387

Feature

Known Issues

  • MoonViT have a special head_size=72, which is not compatible with vllm-flash-attn
    • It's recommended to install flash-attn to workaround this
    • If flash-attn is not installed, MoonViT will use a fallback impl which costs a lot of GPU memory and you may get GPU OOM.
  • It seems that recent versions of vllm have cpu memory leak problem for multi modal models
    • You may need specifying --disable-mm-preprocessor-cache to avoid memory leak

Example Serving Command

python3 -m vllm.entrypoints.openai.api_server --port 8888 --served-model-name kimi-vl --trust-remote-code --model moonshotai/Kimi-VL-A3B-Instruct --tensor-parallel-size 1 --max-num-batched-tokens 131072 --max-model-len 131072 --max-num-seqs 512 --limit-mm-per-prompt image=256

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@ywang96
Copy link
Member

ywang96 commented Apr 10, 2025

Hey @courage17340 thanks for the contribution! Before I review the PR just one quick question:

It seems that recent versions of vllm have cpu memory leak problem for multi modal models

Are you still seeing this issue from the main branch? This should have been fixed but it would be great if you can verify it.

@ywang96 ywang96 self-assigned this Apr 10, 2025
@courage17340
Copy link
Contributor Author

Hey @courage17340 thanks for the contribution! Before I review the PR just one quick question:

It seems that recent versions of vllm have cpu memory leak problem for multi modal models

Are you still seeing this issue from the main branch? This should have been fixed but it would be great if you can verify it.

I saw this yesterday, let me rebase and try again

@ywang96
Copy link
Member

ywang96 commented Apr 10, 2025

Hey @courage17340 thanks for the contribution! Before I review the PR just one quick question:

It seems that recent versions of vllm have cpu memory leak problem for multi modal models

Are you still seeing this issue from the main branch? This should have been fixed but it would be great if you can verify it.

I saw this yesterday, let me rebase and try again

Please also try specifiying --disable-mm-preprocessor-cache and see if the memory leak issues persists if it happens. Thanks!

@courage17340
Copy link
Contributor Author

Hey @courage17340 thanks for the contribution! Before I review the PR just one quick question:

It seems that recent versions of vllm have cpu memory leak problem for multi modal models

Are you still seeing this issue from the main branch? This should have been fixed but it would be great if you can verify it.

I saw this yesterday, let me rebase and try again

Please also try specifiying --disable-mm-preprocessor-cache and see if the memory leak issues persists if it happens. Thanks!

After specifiying --disable-mm-preprocessor-cache, memory leak is gone, thanks!

@DarkLight1337
Copy link
Member

Could you elaborate on the memory leak when caching is enabled? Does the memory usage grow without bound or does it stabilize after a while?

@courage17340
Copy link
Contributor Author

Could you elaborate on the memory leak when caching is enabled? Does the memory usage grow without bound or does it stabilize after a while?

It grows without bound, and the machine runs out of memory very soon.

@DarkLight1337
Copy link
Member

Can you show how to reproduce this issue?

@courage17340
Copy link
Contributor Author

Can you show how to reproduce this issue?

I served kimi-vl and used VLMEvalKit to test it. Maybe simple cases like sending random images can also reproduce, let me try that.

@courage17340
Copy link
Contributor Author

python3 -m vllm.entrypoints.openai.api_server --port 8888 --served-model-name kimi-vl --trust-remote-code --model moonshotai/Kimi-VL-A3B-Instruct --tensor-parallel-size 1 --max-num-batched-tokens 131072 --max-model-len 131072 --max-num-seqs 512 --limit-mm-per-prompt image=256
  • test
python3 smoke.py
import base64
import io
from multiprocessing import Pool

import click
import openai
from PIL import Image
import numpy as np


def make_message(num_images_per_prompt: int = 1):
    images = []
    for i in range(num_images_per_prompt):
        random_array = np.random.randint(0, 256, (512, 512, 3), dtype=np.uint8)
        image = Image.fromarray(random_array)
        buffer = io.BytesIO()
        image.save(buffer, format="JPEG")
        images.append(base64.b64encode(buffer.getvalue()).decode())
    return [{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{image}",
            },
        } for image in images]
    }]


def make_request(args):
    baseurl, i = args
    client = openai.OpenAI(
        base_url=baseurl,
        api_key="xx",
    )
    response = client.chat.completions.create(
        model="kimi-vl",
        messages=make_message(32),
        stream=False,
        temperature=0,
        max_tokens=1,
    )
    return response.choices[0].message.content


@click.command(context_settings={"show_default": True})
@click.option("--endpoint",
              "-e",
              type=str,
              help="service endpoint",
              default="http://localhost:8888/v1")
@click.option("--num-reqs", "-n", type=int, help="num of requests", default=1024)
@click.option("--concurrency", "-c", type=int, help="concurrency", default=64)
def main(
    endpoint: str,
    num_reqs: int,
    concurrency: int,
) -> None:
    with Pool(processes=concurrency) as pool:
        result = pool.map(make_request, [(endpoint, i) for i in range(num_reqs)])

if __name__ == "__main__":
    main()
  • I use ps auxf to see the memory cost (the RSS column, the unit should be in KB)
    • vllm have 3 processes in total, and I choose the largest one
    • after serving: 3017216 (~3GiB)
    • after first execution of the smoke test: 57436432 (~53GiB)
    • after second execution of the smoke test: 101605816 (~94GiB)

@DarkLight1337
Copy link
Member

Can you try VLLM_ENABLE_V1_MULTIPROCESSING=0 and see if it resolves the memory issue?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 10, 2025

Meanwhile I'm going to try reproducing this, thanks for the MRE!

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 10, 2025

Back to the PR, be sure to update the tests as mentioned here: https://docs.vllm.ai/en/latest/contributing/model/tests.html

And don't forget to update the supported models page! Thanks for your help in implementing the model!

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 10, 2025

Hmm testing this with a different model (and smaller image count:

memray run -m vllm.entrypoints.openai.api_server --model HuggingFaceTB/SmolVLM2-2.2B-Instruct --limit-mm-per-prompt image=4
# Run 3 times
python3 smoke.py

image

It seems that the cache takes around 2 runs of the script to be filled completely, after which the memory usage remains stable.

Let me move my setup so I can actually run the Kimi-VL model using your settings.

Edit:* What GPU are you running this on? I get OOM even on A800 (80 GB)

@courage17340
Copy link
Contributor Author

Edit:* What GPU are you running this on? I get OOM even on A800 (80 GB)

I'm running this on H800 (80GB too). It seems that you are getting GPU OOM? In that case, you should check whether flash-attn (vllm-flash-attn is not compatible, see the first known issue) is installed, because otherwise MoonViT will use a fallback attn impl, which consumes a lot of GPU memory.

@mergify mergify bot added the documentation Improvements or additions to documentation label Apr 11, 2025
@DarkLight1337
Copy link
Member

I'm running this on H800 (80GB too). It seems that you are getting GPU OOM? In that case, you should check whether flash-attn (vllm-flash-attn is not compatible, see the first known issue) is installed, because otherwise MoonViT will use a fallback attn impl, which consumes a lot of GPU memory.

Thanks for reminding me of this, yeah it solved the OOM for me.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) April 14, 2025 12:54
@DarkLight1337 DarkLight1337 merged commit b1308b8 into vllm-project:main Apr 14, 2025
69 checks passed
@courage17340 courage17340 deleted the support-kimi-vl branch April 15, 2025 02:59
@DarkLight1337
Copy link
Member

Regarding the issue about memory usage, can you try out #16432 and see if the problem is solved?

@courage17340
Copy link
Contributor Author

Regarding the issue about memory usage, can you try out #16432 and see if the problem is solved?

It's much better now, and becomes stable later (~13G in total). Compared with ~100G before, I think the problem is solved.

@Husamx
Copy link

Husamx commented Apr 16, 2025

setting dtype to half results in a data type mismatch between input and bias tensors in: ../vllm/model_executor/models/moonvit.py, line 257:

 x = self.proj(x).view(x.size(0), -1)
RuntimeError: Input type (c10::BFloat16) and bias type (c10::Half) should be the same

@courage17340
Copy link
Contributor Author

setting dtype to half results in a data type mismatch between input and bias tensors in: ../vllm/model_executor/models/moonvit.py, line 257:

 x = self.proj(x).view(x.size(0), -1)
RuntimeError: Input type (c10::BFloat16) and bias type (c10::Half) should be the same

The problem looks like caused by pixel_values = pixel_values.to(torch.bfloat16), which can be fixed. However, the model is trained with bfloat16, and we recommend inferencing with bfloat16. If you use half to inference, there may be unexpected inf/nan.

@DarkLight1337
Copy link
Member

Are you planning to add a reasoning parser for online inference?

@courage17340
Copy link
Contributor Author

Are you planning to add a reasoning parser for online inference?

I can do that when I have time next week.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 18, 2025

I found that the output of the model gets abruptly cut off compared to HF. This occurs for both Instruct and Thinking variants.

This is the prompt I'm using:

Your task is to list out the objects in this image.

For each object, follow these steps:

- Step 1: Locate this object in the image and output its bounding box in JSON, with the fields: `min_x`, `min_y`, `max_x` and `max_y`. These coordinates should be in terms of pixels.
- Step 2: Generate a caption of this object.
- Step 3: Summarize your answer in JSON format, with the fields: `bounding_box` (Step 1), `caption` (Step 2)

Afterwards, summarize your response in JSON format. The JSON should be a list where each item has the fields described in Step 3.

HF Result

Script:

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

PROMPT= ...  # Copy from above


if __name__ == "__main__":
    model_path = "moonshotai/Kimi-VL-A3B-Thinking"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

    image_path = "./demo.png"  # Note: This is taken from the Kimi-VL-A3B-Instruct repo
    image = Image.open(image_path)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": PROMPT},
            ],
        },
    ]
    text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=16384)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    response = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    print(response)

Output:

◁think▷I need to list out the objects in the provided image and follow specific steps for each object. Let me analyze the image and plan my response.

The image shows a cityscape with a prominent tower in the background. The tower appears to be the CN Tower in Toronto, Canada. There are also many buildings, roads, vehicles, and possibly some other structures visible.

For each object in the image, I need to:
1. Locate the object and provide its bounding box in JSON format (min_x, min_y, max_x, max_y in pixels)
2. Generate a caption for the object
3. Summarize my answer in JSON format with bounding_box and caption fields

Let me identify the main objects in the image:

1. The CN Tower (the tall tower on the right side of the image)
2. The city buildings (numerous buildings throughout the image)
3. The roads and vehicles (the busy roads in the foreground)
4. The sky and clouds (the sky in the background)

Now, I'll create bounding boxes for each of these objects:

For the CN Tower:
- Bounding box: [min_x, min_y, max_x, max_y] would need to be determined by examining the image. The tower spans from approximately (left: 0, top: 0) to (right: 500, bottom: 500) in this wide-angle image.

For the city buildings:
- Bounding box: The buildings span from approximately (left: 0, top: 0) to (right: 1000, bottom: 1000) in this wide-angle image.

For the roads and vehicles:
- Bounding box: The roads span from approximately (left: 200, top: 300) to (right: 800, bottom: 700) in this wide-angle image.

For the sky and clouds:
- Bounding box: The sky spans from approximately (left: 0, top: 0) to (right: 1000, bottom: 1000) in this wide-angle image.

Now, I'll generate captions for each object:

1. CN Tower: "The CN Tower is a famous landmark in Toronto, Canada, known for its distinctive design and observation decks."
2. City buildings: "Numerous buildings make up the Toronto skyline, showcasing modern architecture and urban density."
3. Roads and vehicles: "Busy roads with moving vehicles indicate the bustling transportation network of the city."
4. Sky and clouds: "The sky displays a beautiful sunset with clouds, adding a dramatic backdrop to the cityscape."

Finally, I'll summarize my answer in JSON format with the bounding_box and caption fields for each object.

Let me now format my complete response:◁/think▷```json
[
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 500,
      "max_y": 500
    },
    "caption": "The CN Tower is a famous landmark in Toronto, Canada, known for its distinctive design and observation decks."
  },
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 1000,
      "max_y": 1000
    },
    "caption": "Numerous buildings make up the Toronto skyline, showcasing modern architecture and urban density."
  },
  {
    "bounding_box": {
      "min_x": 200,
      "min_y": 300,
      "max_x": 800,
      "max_y": 700
    },
    "caption": "Busy roads with moving vehicles indicate the bustling transportation network of the city."
  },
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 1000,
      "max_y": 1000
    },
    "caption": "The sky displays a beautiful sunset with clouds, adding a dramatic backdrop to the cityscape."
  }
]
```

vLLM Result

Script:

from PIL import Image
from transformers import AutoProcessor
from vllm import LLM

PROMPT= ... # Copy from above


if __name__ == "__main__":
    model_path = "moonshotai/Kimi-VL-A3B-Thinking"
    llm = LLM(
        model_path,
        max_model_len=16384,
        trust_remote_code=True,
    )
    processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

    image_path = "./demo.png"  # Note: This is taken from the Kimi-VL-A3B-Instruct repo
    images = Image.open(image_path)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": PROMPT},
            ],
        },
    ]
    text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": images}}])

    print("-" * 50)
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
        print("-" * 50)

Output:

--------------------------------------------------
◁think▷I need to list out the objects in the image and
--------------------------------------------------

@yushuiwx
Copy link

MoonViT have a special head_size=72, which is not compatible with vllm-flash-attn

  • It's recommended to install flash-attn to workaround this

Thanks for the clarification!

I met the OOM case, too. In this case, could you please let me know how to properly configure the model to use flash-attn instead of vllm-flash-attn during inference?

@zhouzaida
Copy link
Contributor

MoonViT have a special head_size=72, which is not compatible with vllm-flash-attn

  • It's recommended to install flash-attn to workaround this

Thanks for the clarification!

I met the OOM case, too. In this case, could you please let me know how to properly configure the model to use flash-attn instead of vllm-flash-attn during inference?

Hi, if you have installed flash-attn, the vision tower will use it default. See

if is_flash_attn_2_available():
for details.

@zhouzaida
Copy link
Contributor

I found that the output of the model gets abruptly cut off compared to HF. This occurs for both Instruct and Thinking variants.

This is the prompt I'm using:

Your task is to list out the objects in this image.

For each object, follow these steps:

- Step 1: Locate this object in the image and output its bounding box in JSON, with the fields: `min_x`, `min_y`, `max_x` and `max_y`. These coordinates should be in terms of pixels.
- Step 2: Generate a caption of this object.
- Step 3: Summarize your answer in JSON format, with the fields: `bounding_box` (Step 1), `caption` (Step 2)

Afterwards, summarize your response in JSON format. The JSON should be a list where each item has the fields described in Step 3.

HF Result

Script:

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

PROMPT= ...  # Copy from above


if __name__ == "__main__":
    model_path = "moonshotai/Kimi-VL-A3B-Thinking"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

    image_path = "./demo.png"  # Note: This is taken from the Kimi-VL-A3B-Instruct repo
    image = Image.open(image_path)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": PROMPT},
            ],
        },
    ]
    text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=16384)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    response = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    print(response)

Output:

◁think▷I need to list out the objects in the provided image and follow specific steps for each object. Let me analyze the image and plan my response.

The image shows a cityscape with a prominent tower in the background. The tower appears to be the CN Tower in Toronto, Canada. There are also many buildings, roads, vehicles, and possibly some other structures visible.

For each object in the image, I need to:
1. Locate the object and provide its bounding box in JSON format (min_x, min_y, max_x, max_y in pixels)
2. Generate a caption for the object
3. Summarize my answer in JSON format with bounding_box and caption fields

Let me identify the main objects in the image:

1. The CN Tower (the tall tower on the right side of the image)
2. The city buildings (numerous buildings throughout the image)
3. The roads and vehicles (the busy roads in the foreground)
4. The sky and clouds (the sky in the background)

Now, I'll create bounding boxes for each of these objects:

For the CN Tower:
- Bounding box: [min_x, min_y, max_x, max_y] would need to be determined by examining the image. The tower spans from approximately (left: 0, top: 0) to (right: 500, bottom: 500) in this wide-angle image.

For the city buildings:
- Bounding box: The buildings span from approximately (left: 0, top: 0) to (right: 1000, bottom: 1000) in this wide-angle image.

For the roads and vehicles:
- Bounding box: The roads span from approximately (left: 200, top: 300) to (right: 800, bottom: 700) in this wide-angle image.

For the sky and clouds:
- Bounding box: The sky spans from approximately (left: 0, top: 0) to (right: 1000, bottom: 1000) in this wide-angle image.

Now, I'll generate captions for each object:

1. CN Tower: "The CN Tower is a famous landmark in Toronto, Canada, known for its distinctive design and observation decks."
2. City buildings: "Numerous buildings make up the Toronto skyline, showcasing modern architecture and urban density."
3. Roads and vehicles: "Busy roads with moving vehicles indicate the bustling transportation network of the city."
4. Sky and clouds: "The sky displays a beautiful sunset with clouds, adding a dramatic backdrop to the cityscape."

Finally, I'll summarize my answer in JSON format with the bounding_box and caption fields for each object.

Let me now format my complete response:◁/think▷```json
[
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 500,
      "max_y": 500
    },
    "caption": "The CN Tower is a famous landmark in Toronto, Canada, known for its distinctive design and observation decks."
  },
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 1000,
      "max_y": 1000
    },
    "caption": "Numerous buildings make up the Toronto skyline, showcasing modern architecture and urban density."
  },
  {
    "bounding_box": {
      "min_x": 200,
      "min_y": 300,
      "max_x": 800,
      "max_y": 700
    },
    "caption": "Busy roads with moving vehicles indicate the bustling transportation network of the city."
  },
  {
    "bounding_box": {
      "min_x": 0,
      "min_y": 0,
      "max_x": 1000,
      "max_y": 1000
    },
    "caption": "The sky displays a beautiful sunset with clouds, adding a dramatic backdrop to the cityscape."
  }
]

### vLLM Result
Script:

```python
from PIL import Image
from transformers import AutoProcessor
from vllm import LLM

PROMPT= ... # Copy from above


if __name__ == "__main__":
    model_path = "moonshotai/Kimi-VL-A3B-Thinking"
    llm = LLM(
        model_path,
        max_model_len=16384,
        trust_remote_code=True,
    )
    processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

    image_path = "./demo.png"  # Note: This is taken from the Kimi-VL-A3B-Instruct repo
    images = Image.open(image_path)
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": PROMPT},
            ],
        },
    ]
    text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
    outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": images}}])

    print("-" * 50)
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
        print("-" * 50)

Output:

--------------------------------------------------
◁think▷I need to list out the objects in the image and
--------------------------------------------------

Thanks for your feedback. I have reproduced the unexpected output which was cut and we will try to fix it ASAP.

@zhouzaida
Copy link
Contributor

zhouzaida commented Apr 19, 2025

Hi @DarkLight1337, I suddenly found a question that the default value of max_tokens in SamplingParams is 16. If there is no additional setting, the model will only output 16 tokens, which is consistent with the number of truncated output tokens.

@DarkLight1337
Copy link
Member

Oh, nice catch. My bad then. Let me try increasing the number of tokens...

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 19, 2025

After increasing max_tokens=8192, I got this output which is still somewhat different looks fine:

◁think▷Let me break down the task:

1. I need to identify objects in the provided image.
2. For each object:
   - Step 1: Identify its location and provide bounding box coordinates in JSON format (min_x, min_y, max_x, max_y, all in pixels)
   - Step 2: Create a caption describing the object
   - Step 3: Summarize my findings in a JSON list where each item contains the bounding box and caption

I need to carefully examine the image and identify distinct objects that can be bounded. Some potential objects I can identify:

1. The CN Tower (a tall tower in Toronto)
2. Skyscrapers and buildings (multiple high-rises in the cityscape)
3. Cars on the roads (traffic on the highways)
4. Canadian flags (possibly some flags visible)
5. The domed building (possibly the Rogers Centre)

Now, let me go through each step for one of these objects to demonstrate the process:

Object: CN Tower

Step 1: Bounding box in pixels
Looking at the image, the CN Tower is prominently visible in the top right. I'll estimate its bounding box:
- Min_x: Approximately 430 (left edge of tower)
- Min_y: Approximately 50 (top of tower)
- Max_x: Approximately 800 (right edge of tower)
- Max_y: Approximately 500 (bottom of tower)

Step 2: Caption
"CN Tower, a distinctive tower in Toronto, Ontario, Canada."

Step 3: Summary in JSON
```json
{
  "bounding_box": {
    "min_x": 430,
    "min_y": 50,
    "max_x": 800,
    "max_y": 500
  },
  "caption": "CN Tower, a distinctive tower in Toronto, Ontario, Canada."
}
```

Now, I need to repeat this process for all relevant objects in the image. Let me identify all possible objects and their bounding boxes:

1. CN Tower:
  Bounding box: min_x=430, min_y=50, max_x=800, max_y=500
  Caption: "CN Tower, a distinctive tower in Toronto, Ontario, Canada."

2. Skyscrapers:
  Bounding box: min_x=90, min_y=200, max_x=870, max_y=850
  Caption: "Multiple mid-to-high rise buildings typical of Toronto"

3. Cars:
  Bounding box: min_x=420, min_y=380, max_x=680, max_y=520
  Caption: "Traffic on the highways surrounding the CN Tower"

4. Canadian flags:
  Bounding box: min_x=700, min_y=300, max_x=730, max_y=320
  Caption: "Canadian flags visible on some buildings"

5. Rogers Centre:
  Bounding box: min_x=550, min_y=300, max_x=830, max_y=380
  Caption: "Domed stadium, likely the Rogers Centre in Toronto"

Now I need to summarize my findings in JSON format as a list of items following the specified structure.

Let me create the final JSON response:◁/think▷```json
[
  {
    "bounding_box": {
      "min_x": 430,
      "min_y": 50,
      "max_x": 800,
      "max_y": 500
    },
    "caption": "CN Tower, a distinctive tower in Toronto, Ontario, Canada."
  },
  {
    "bounding_box": {
      "min_x": 90,
      "min_y": 200,
      "max_x": 870,
      "max_y": 850
    },
    "caption": "Multiple mid-to-high rise buildings typical of Toronto"
  },
  {
    "bounding_box": {
      "min_x": 420,
      "min_y": 380,
      "max_x": 680,
      "max_y": 520
    },
    "caption": "Traffic on the highways surrounding the CN Tower"
  },
  {
    "bounding_box": {
      "min_x": 700,
      "min_y": 300,
      "max_x": 730,
      "max_y": 320
    },
    "caption": "Canadian flags visible on some buildings"
  },
  {
    "bounding_box": {
      "min_x": 550,
      "min_y": 300,
      "max_x": 830,
      "max_y": 380
    },
    "caption": "Domed stadium, likely the Rogers Centre in Toronto"
  }
]
```

The output is a bit different which is expected because of the temperature setting. It's still reasonable. Sorry for the false alarm!

P.S.: I suggest to add generation_config.json to your HF Hub to set the default temperature according to what's on your model card

@zhouzaida
Copy link
Contributor

zhouzaida commented Apr 19, 2025

Nice. Thanks for your suggestion. I will add it tomorrow.

@zhouzaida
Copy link
Contributor

FYI, I haved added the generation_config.json.

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
@guihonghao
Copy link

So How to slove the error Kimi-VL-A3B-Instruct OOM with 80G*2 VRAM

@DarkLight1337
Copy link
Member

Please refer to "Known Issues" in the top post

@nicoeiris11
Copy link

Are you planning to add a reasoning parser for online inference?

Any updates regarding reasoning parser and struct out for this new supported model?

@nicoeiris11
Copy link

nicoeiris11 commented Jun 17, 2025

Do you guys plan to implement a reasoning parser and struct out support for this new Kimi model?
Or should I build a simple regex to avoid ◁think▷ output from the model? @zhouzaida

@zhouzaida
Copy link
Contributor

Hi @nicoeiris11 , It is recommended to use regular expression matching to remove the content inside ◁think▷.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants