Skip to content

clip : fix pixtral on some GPU backends #13097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 25, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 24, 2025

Working well on CUDA

But on Metal it still have problem with F16 mmproj file: #13065 (comment)

@ggerganov
Copy link
Member

Btw, I don't think there is problem with Metal specifically. With CPU-only (i.e. cmake -DGGML_METAL=OFF) also produce seemingly incorrect results with some images.

n_dim/2, n_head, n_pos,
ggml_row_size(cur->type, n_dim),
ggml_row_size(cur->type, n_dim*n_head),
n_dim/2 * ggml_element_size(cur));
tmp = ggml_rope_ext_inplace(
second = ggml_cont(ctx0, second); // copy, because ggml_rope don't play well with non-contiguous tensors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you suspect that ggml_rope is not implemented correctly for non-contiguous tensors, please add a test to test-backend-ops that shows the problem.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

@ggerganov Hmm yeah you're right, I think we currently have 2 separated problems:

First problem, the lenna.png 512x512 image is perceived as 2 repeated image, I tested more and turns out it's not even the problem with precision (both F16 and Q8_0 now output the same thing after this fix). If we remove --top-k 1, sometimes the model answer correctly.

I also double checked the preprocessing and found no problem. Unfortunately, for now, I don't either have the ability run the same test using the original model, so can't confirm exactly what is the root cause. But I think we can look into that later on, the more important part is to make it runnable on Vulkan/Rocm/CUDA.


The second problem about ggml_rope_ext_inplace doesn't work correctly on CUDA backend. This fix confirmed that it's the case, I will add a test as @slaren suggested, but maybe in a follow up PR (are you ok with this?) Sorry I ended up spending too much time on debugging this today.

Why I could confirm that ggml_rope_ext_inplace was the problem?

  • If I comment that part of the code on CPU backend, it output exactly the same result as on CUDA with that part of code
  • The current fix simply make the tensor contiguous (essentially copy it), then ggml_concat it back later, which works

@ngxson ngxson marked this pull request as ready for review April 24, 2025 14:32
@ngxson ngxson requested a review from ggerganov April 24, 2025 14:32
@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

Please also not that I also have a look at ggml_rope_multi but unfortunately it's not what I was looking for. Looking closely at rope.cu, the rope_multi does:

    dst[idst + 0]        = x0*cos_theta - x1*sin_theta;
    dst[idst + n_dims/2] = x0*sin_theta + x1*cos_theta;

While what I want is:

    dst[idst + 0] = x0*cos_theta - x1*sin_theta;
    dst[idst + 1] = x0*sin_theta + x1*cos_theta;

@stduhpf
Copy link
Contributor

stduhpf commented Apr 24, 2025

Works on Vulkan too. (as well as on CPU at least).

I can definitely confirm the repetition problem. Seems to happen whenever the resolution is not 1024x1024. Sometimes it claims to see two repetitions (ex: 512x512), sometimes three (ex: 768x1024), often "multiple" repetitions.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

@stduhpf Thanks. Interesting results. Can you also ask the model:

  • Does the image appear to be a square?
  • The repetition is horizontally or vertically?

@stduhpf
Copy link
Contributor

stduhpf commented Apr 24, 2025

Prompt:

How many faces in this image? Does the image appear to be a square? If there is a repetition, is it horizontally or vertically?

Parsed outputs:

Feature \Resolution 1024x1024 768x1024 512x1024 1024x768 768x768 768x512 1024x512 768x512 512x512
Input image batch0_1024x1024_img batch1_768x1024_img batch2_512x1024_img batch3_1024x768_img batch4_768x768_img batch5_512x768_img batch6_1024x512_img batch7_768x512_img batch8_512x512_img
Number of Faces 1 5 2 1 4+ 2 1 Multiple (exact count unclear) 2
Perceived Shape Square Rectangular Rectangular Rectangular Rectangular Rectangular Rectangular Rectangular Rectangular
Repetition Type None Horizontal Horizontal None Horizontal & Vertical Horizontal None Vertical Horizontal
Description Single square-shaped portrait of a young woman, plain background. Rectangular painting with a row of 5 faces, horizontal repetition. Two faces side by side, horizontally elongated, horizontal repetition. Rectangular portrait of a young woman, no repetition. Rectangular grid-like artwork with at least 4 faces, horizontal & vertical repetition. Two portraits of a young girl side by side, horizontal repetition. Rectangular portrait of a young woman with braids, no repetition. Digital collage with overlapping faces, vertical repetition. Two identical veiled faces side by side, horizontal repetition.
Feature \Resolution 1152x896 896x1152 1216x832 832x1216 1344x768 768x1344 1536x640 640x1536
Input image batchXL1_1152x896_img batchXL2_896x1152_img batchXL3_1216x832_img batchXL4_832x1216_img batchXL5_1344x768_img batchXL6_768x1344_img batchXL7_1536x640_img batchXL8_640x1536_img
Number of Faces 1 Several (exact count unclear due to overlapping) 1 Multiple (exact count unclear due to overlapping) 1 Multiple (exact count unclear due to abstraction) 1 Several (horizontal repetition)
Perceived Shape Rectangular Rectangular Rectangular Rectangular Rectangular Rectangular (elongated horizontally) Rectangular Rectangular
Repetition Type None Horizontal None Both horizontal & vertical None Unclear (abstract repetition) None Horizontal
Description Classical portrait of a young woman with a simple gradient background. Abstract/impressionistic painting with multiple overlapping faces. Portrait of a young woman with a dark background. Abstract painting with overlapping faces in muted earthy tones. Classical portrait of a young woman with a dark background. Abstract/distorted representation of multiple blurred faces. Classical single-face portrait in historical attire. Painting/digital art with horizontally repeated overlapping faces.

(I used a LLM to turn the raw descriptons into a table)

Raw Pixtral outputs

1024x1024:

The image appears to be a single, square-shaped portrait of a young woman. The subject is centrally positioned, and the composition is balanced. There are no other faces or objects visible in the image. The background is plain and does not contain any additional elements.

To summarize:

  • The number of faces: 1
  • The shape of the image: Square
  • Repetition: None

768x1024:

The image you provided appears to be a painting featuring a row of faces. The painting seems to have a rectangular shape, and it does not appear to be a square. The faces are arranged horizontally, suggesting that the repetition is horizontal.

Here are the details based on the image:

  • Number of faces: 5
  • Shape of the image: Rectangular
  • Repetition: Horizontal

512x1024:

The image shows two faces side by side. Let's address your questions one by one:

  1. How many faces in the image?

    • There are two faces in the image.
  2. Does the image appear to be a square?

    • The image does not appear to be a square. It is horizontally elongated, showing two faces side by side.
  3. If there is a repetition, is it horizontally or vertically?

    • The repetition is horizontal. The two faces are placed side by side in a horizontal arrangement.

So, to summarize:

  • There are two faces.
  • The image is not square.
  • The repetition is horizontal.

1024x768:

The image appears to be a portrait painting of a young woman. The painting has a single face, which is the central focus of the image. The background is relatively simple, with a dark color on one side and a lighter, golden hue on the other.

To address your specific questions:

  1. How many faces in this image?

    • There is one face in the image.
  2. Does the image appear to be a square?

    • The image does not appear to be a square. It looks more rectangular due to its aspect ratio.
  3. If there is a repetition, is it horizontally or vertically?

    • There is no repetition in the image. It is a single portrait without any repeated elements.

The painting is a classic portrait, likely from the 18th or 19th century, characterized by its detailed and realistic style.

768x768:

The image you provided appears to be a painting or a digital artwork featuring multiple faces. The faces are arranged in a grid-like pattern, suggesting a structured composition. Here are the observations based on the image:

  1. Number of Faces: There are multiple faces in the image. From what I can see, there are at least four faces visible.
  2. Shape of the Image: The image appears to be rectangular rather than square. The faces are aligned horizontally and vertically, suggesting a rectangular grid.
  3. Repetition: The faces are repeated in a pattern, which is common in grid-based artworks. The repetition is both horizontal and vertical.

768x512:

The image appears to be a painting featuring two portraits of a young girl side by side. Let's break down the elements you asked about:

  1. Number of Faces: There are two faces in the image.
  2. Shape of the Image: The image appears to be rectangular rather than square. The two portraits are aligned horizontally.
  3. Repetition: The repetition is horizontal, as the two portraits are placed side by side.

So, to summarize:

  • There are two faces.
  • The image is rectangular, not square.
  • The repetition is horizontal.

1024x512:

The image appears to be a portrait painting of a young woman with dark hair styled in braids. The background is a simple, dark color, possibly black or dark brown, which contrasts with the subject's light skin and hair. The painting is framed by a narrow border, which is visible on the left side of the image.

To address your specific questions:

  1. How many faces in the image?

    • There is one face in the image.
  2. Does the image appear to be a square?

    • The image does not appear to be a square. It looks more rectangular, with a longer horizontal dimension compared to the vertical dimension.
  3. If there is a repetition, is it horizontally or vertically?

    • There is no repetition in the image, either horizontally or vertically. The image is a single portrait without any repeated elements.

The overall composition and style suggest that the painting is likely a classical portrait, possibly from the 19th century or earlier, given the subject's attire and hairstyle.

768x512:

The image you provided appears to be a digital manipulation or collage of a portrait, likely of a historical or classical figure. The image shows multiple overlapping faces, creating a sense of depth and repetition.

Here are the answers to your questions:

  1. Number of Faces: The image contains multiple faces, but it's difficult to give an exact count due to the overlapping and repetition.
  2. Shape of the Image: The image appears to be rectangular rather than square, given the elongated shape and the vertical repetition.
  3. Repetition: The repetition is vertical, as the same face or similar faces are stacked on top of each other.

This type of image manipulation is often used in digital art to create a sense of depth, movement, or to emphasize certain features of the subject.

512x512:

The image shows a painting with two identical faces of a woman, each wearing a black veil and white headband. The painting appears to be rectangular rather than square. The repetition of the face is horizontal, as the two faces are placed side by side.

To summarize:

  • There are two faces in the image.
  • The image does not appear to be a square.
  • The repetition is horizontal.

1152x896:

The image you provided appears to be a portrait painting of a young woman. The painting is rendered in a classical style, with a focus on the subject's face and upper body. The background is relatively simple, with a gradient transitioning from dark to light.

Analysis of the Image:

  1. Faces: There is one face in the image, which is the central focus.
  2. Shape: The image appears to be rectangular rather than square.
  3. Repetition: There is no repetition of elements in the image. The composition is singular and focused on the portrait.

Summary:

  • Faces: 1
  • Shape: Rectangular
  • Repetition: None

The painting is a detailed and realistic portrayal of a young woman, likely intended to capture her likeness and convey a sense of her personality. The use of light and shadow helps to highlight her features and create depth in the image.

896x1152:

The image you provided appears to be an abstract or impressionistic painting featuring multiple faces. The faces are rendered in a loose, painterly style with visible brushstrokes, giving the image a textured appearance.

  1. Number of Faces: It is difficult to determine the exact number of faces due to the overlapping and abstract nature of the painting. However, there appear to be several faces visible, with some partially obscured by others.

  2. Shape of the Image: The image appears to be rectangular rather than square. This is evident from the aspect ratio, which is longer horizontally than it is vertically.

  3. Repetition: There is a sense of repetition in the image, with multiple faces arranged closely together. The repetition seems to be more horizontal, as the faces are aligned side by side.

Overall, the painting has a dynamic and expressive quality, with the repetition of faces creating a sense of unity and continuity.

1216x832:

The image you provided appears to be a portrait of a young woman. The portrait is a painting with a focus on the woman's face and upper body. The background is dark, which helps to highlight the subject.

Here are the answers to your questions:

  1. Number of Faces: There is one face in the image.
  2. Square Image: The image does not appear to be square. It looks more rectangular due to the vertical orientation of the subject's face and upper body.
  3. Repetition: There is no repetition in the image. The subject is the sole focus, and there are no repeated elements or patterns.

The portrait is a classic example of a traditional painting style, with attention to detail and a focus on the subject's features.

832x1216:

The image you provided appears to be an abstract or impressionistic painting with a focus on facial features. The painting is characterized by a series of overlapping faces rendered in a loose, painterly style. The colors are muted and earthy, with shades of brown, beige, and pink dominating the palette.

Analysis:

  1. Number of Faces:

    • The image contains multiple faces, though it's difficult to give an exact count due to the overlapping and abstract nature of the painting.
  2. Shape of the Image:

    • The image appears to be rectangular rather than square. This is inferred from the elongated horizontal and vertical proportions visible in the painting.
  3. Repetition:

    • There is a clear repetition of facial features, which suggests a pattern or motif. The repetition appears to be both horizontally and vertically, creating a sense of continuity and rhythm across the canvas.

Summary:

  • Faces: Multiple, overlapping facial features.
  • Shape: Rectangular.
  • Repetition: Both horizontal and vertical.

This type of painting might be exploring themes of identity, humanity, or emotional expression through the repetition and abstraction of facial features.

1344x768:

The image appears to be a portrait of a young woman, painted in a classical style. The subject is looking directly at the viewer with a neutral expression. The background is a simple, dark color, which contrasts with the subject's light skin and hair.

To answer your questions:

  1. How many faces in this image? There is one face in the image.
  2. Does the image appear to be a square? It is not possible to determine the aspect ratio of the image from the provided description, but the image does not appear to be square.
  3. If there is a repetition, is it horizontally or vertically? There is no repetition in the image, so this question does not apply.

The image is a detailed and realistic portrayal of a young woman, likely intended to capture her likeness and convey a sense of her personality.

768x1344:

The image appears to be a distorted or abstract representation of multiple faces. The faces are not clearly defined, and the image has a blurred and painterly quality. Here are the observations based on the image:

  1. Number of Faces: It is difficult to determine the exact number of faces due to the abstraction, but there appear to be multiple faces in the image.

  2. Shape of the Image: The image does not have a clear rectangular or square shape. It appears to be elongated horizontally, making it more rectangular than square.

  3. Repetition: There is a repetition of facial features, but it is not clear whether the repetition is horizontal or vertical due to the abstract nature of the image.

Overall, the image is abstract and does not provide clear details about the number of faces or the shape of the image. The repetition of facial features is present but not distinctly aligned horizontally or vertically.

1536x640:

The image you provided appears to be a classical portrait of a woman, likely painted in the style of a traditional oil painting. The portrait is a single face, and it does not contain multiple faces or any other objects.

To address your questions:

  1. How many faces in this image?

    • There is one face in the image.
  2. Does the image appear to be a square?

    • The image does not appear to be a square. It looks more rectangular, with a horizontal orientation.
  3. If there is a repetition, is it horizontally or vertically?

    • There is no repetition in the image. It is a single portrait without any repeated elements.

The portrait is a detailed representation of a woman with a serene expression, dressed in what appears to be historical attire. The background is simple and dark, which helps to emphasize the subject's face.

640x1536:

The image appears to be a painting or digital art piece featuring multiple faces. The faces are arranged in a repetitive pattern, creating a sense of depth and continuity. Here are the observations based on the image:

  1. Number of Faces: It is difficult to determine the exact number of faces due to the overlapping and repetitive nature of the image, but there are several faces visible.

  2. Shape of the Image: The image appears to be rectangular rather than square. This is inferred from the elongated horizontal arrangement of the faces.

  3. Repetition: The faces are repeated in a horizontal pattern, creating a sense of continuity and depth.

Overall, the image is a visually intriguing piece that uses repetition and overlapping to create a unique aesthetic.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 24, 2025

@stduhpf Thanks, that's very helpful. I pushed a test branch to force the preprocessed image to be 1024x1024: https://github.com/ngxson/llama.cpp/tree/xsn/test_pixtral_fixed_size

Could you re-run some tests with this?

@stduhpf
Copy link
Contributor

stduhpf commented Apr 24, 2025

@stduhpf Thanks, that's very helpful. I pushed a test branch to force the preprocessed image to be 1024x1024: https://github.com/ngxson/llama.cpp/tree/xsn/test_pixtral_fixed_size

Could you re-run some tests with this?

It looks like this does fix the repetition issue:

640x1536:

The image appears to be a portrait painting of a young woman. The painting is in a vertical orientation and has a rectangular shape. Here are the details based on the questions:

  1. Number of Faces: There is one face visible in the image.
  2. Square Appearance: The image does not appear to be a square; it is more rectangular in shape.
  3. Repetition: There is no visible repetition in the image, either horizontally or vertically.

The portrait is detailed and captures the subject's facial features and expression. The woman is dressed in period attire, with a dark dress and a white collar or scarf. The background is relatively plain, focusing attention on the subject.

512x512:

The image appears to be a portrait painting of a woman. Let's address the questions one by one:

  1. How many faces in the image?

    • There is one face in the image, which belongs to the woman in the portrait.
  2. Does the image appear to be a square?

    • Yes, the image appears to be a square. This is evident from the equal length and width of the image.
  3. If there is a repetition, is it horizontally or vertically?

    • There is no repetition in the image. The portrait is unique and does not show any horizontal or vertical repetition.

So, the image is a square portrait of a woman with no repetitions.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided results above indicate some incorrect frequency computation - likely missing scaling of the input positions. Only by looking at our code, without aware how the reference implementation works, I suspect that this part might need some adjustment:

else if (ctx->proj_type == PROJECTOR_TYPE_PIXTRAL) {
// set the 2D positions
int n_patches_per_col = image_size_width / patch_size;
std::vector<int> pos_data(num_positions);
struct ggml_tensor * pos;
// dimension H
pos = ggml_graph_get_tensor(gf, "pos_h");
for (int i = 0; i < num_positions; i++) {
pos_data[i] = i / n_patches_per_col;
}
ggml_backend_tensor_set(pos, pos_data.data(), 0, ggml_nbytes(pos));
// dimension W
pos = ggml_graph_get_tensor(gf, "pos_w");
for (int i = 0; i < num_positions; i++) {
pos_data[i] = i % n_patches_per_col;
}
ggml_backend_tensor_set(pos, pos_data.data(), 0, ggml_nbytes(pos));
}

With the current implementation, the input positions grow unbounded with the image size:

0 1 2 3 4 ...

While I would have expected them to be in a fixed range:

0/nx 1/nx 2/nx 3/nx ... 1

Anyway, we can resolve this later.

Maybe add a TODO comment to restore the previous approach of inplace ropes + views when the test-backend-ops is added and the CUDA implementation is updated.

Regarding the multi-rope comment that I think I saw somewhere earlier - in the future it should be expanded to support the NeoX-ordering type, through the mode parameter of ggml_rope_ext.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 25, 2025

Thanks @ggerganov for pointing me to the correct code. Turns out, the problem was due to 1 line of code at the beginning of clip_encode that I never had time to read 😂

Because the original llava model use fixed input size, the code was written in such away that it only care about the max image size (a square image), and only allow dynamic size if specified. It's an ugly logic tbh, I will have to refactor it later on.

I'm adding some TODO in a next commit.

In the meantime, @stduhpf could you rerun the test with 2461682 ? Thanks.

Regarding the multi-rope comment that I think I saw somewhere earlier - in the future it should be expanded to support the NeoX-ordering type, through the mode parameter of ggml_rope_ext.

Small correction, multi-rope is currently using neox-style and what we need is to support the normal ordering

@stduhpf
Copy link
Contributor

stduhpf commented Apr 25, 2025

No issues so far with 2461682.

@github-actions github-actions bot added the testing Everything test related label Apr 25, 2025
@ngxson ngxson merged commit edb18b6 into ggml-org:master Apr 25, 2025
89 of 90 checks passed
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
* clip : fix pixtral on some GPU backends

* refactor inp_raw set

* rm outdated comment

* fix dynamic size

* add TODO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants