clip : fix pixtral on some GPU backends #13097

ngxson · 2025-04-24T13:47:52Z

Working well on CUDA

But on Metal it still have problem with F16 mmproj file: #13065 (comment)

ggerganov · 2025-04-24T13:59:13Z

Btw, I don't think there is problem with Metal specifically. With CPU-only (i.e. cmake -DGGML_METAL=OFF) also produce seemingly incorrect results with some images.

slaren · 2025-04-24T14:02:57Z

examples/llava/clip.cpp

            n_dim/2, n_head, n_pos,
            ggml_row_size(cur->type, n_dim),
            ggml_row_size(cur->type, n_dim*n_head),
            n_dim/2 * ggml_element_size(cur));
-        tmp = ggml_rope_ext_inplace(
+        second = ggml_cont(ctx0, second); // copy, because ggml_rope don't play well with non-contiguous tensors


If you suspect that ggml_rope is not implemented correctly for non-contiguous tensors, please add a test to test-backend-ops that shows the problem.

ngxson · 2025-04-24T14:31:51Z

@ggerganov Hmm yeah you're right, I think we currently have 2 separated problems:

First problem, the lenna.png 512x512 image is perceived as 2 repeated image, I tested more and turns out it's not even the problem with precision (both F16 and Q8_0 now output the same thing after this fix). If we remove --top-k 1, sometimes the model answer correctly.

I also double checked the preprocessing and found no problem. Unfortunately, for now, I don't either have the ability run the same test using the original model, so can't confirm exactly what is the root cause. But I think we can look into that later on, the more important part is to make it runnable on Vulkan/Rocm/CUDA.

The second problem about ggml_rope_ext_inplace doesn't work correctly on CUDA backend. This fix confirmed that it's the case, I will add a test as @slaren suggested, but maybe in a follow up PR (are you ok with this?) Sorry I ended up spending too much time on debugging this today.

Why I could confirm that ggml_rope_ext_inplace was the problem?

If I comment that part of the code on CPU backend, it output exactly the same result as on CUDA with that part of code
The current fix simply make the tensor contiguous (essentially copy it), then ggml_concat it back later, which works

ngxson · 2025-04-24T14:40:03Z

Please also not that I also have a look at ggml_rope_multi but unfortunately it's not what I was looking for. Looking closely at rope.cu, the rope_multi does:

    dst[idst + 0]        = x0*cos_theta - x1*sin_theta;
    dst[idst + n_dims/2] = x0*sin_theta + x1*cos_theta;

While what I want is:

    dst[idst + 0] = x0*cos_theta - x1*sin_theta;
    dst[idst + 1] = x0*sin_theta + x1*cos_theta;

stduhpf · 2025-04-24T14:40:19Z

Works on Vulkan too. (as well as on CPU at least).

I can definitely confirm the repetition problem. Seems to happen whenever the resolution is not 1024x1024. Sometimes it claims to see two repetitions (ex: 512x512), sometimes three (ex: 768x1024), often "multiple" repetitions.

ngxson · 2025-04-24T14:43:33Z

@stduhpf Thanks. Interesting results. Can you also ask the model:

Does the image appear to be a square?
The repetition is horizontally or vertically?

stduhpf · 2025-04-24T15:20:35Z

Prompt:

How many faces in this image? Does the image appear to be a square? If there is a repetition, is it horizontally or vertically?

Parsed outputs:

Feature \Resolution	1024x1024	768x1024	512x1024	1024x768	768x768	768x512	1024x512	768x512	512x512
Input image
Number of Faces	1	5	2	1	4+	2	1	Multiple (exact count unclear)	2
Perceived Shape	Square	Rectangular	Rectangular	Rectangular	Rectangular	Rectangular	Rectangular	Rectangular	Rectangular
Repetition Type	None	Horizontal	Horizontal	None	Horizontal & Vertical	Horizontal	None	Vertical	Horizontal
Description	Single square-shaped portrait of a young woman, plain background.	Rectangular painting with a row of 5 faces, horizontal repetition.	Two faces side by side, horizontally elongated, horizontal repetition.	Rectangular portrait of a young woman, no repetition.	Rectangular grid-like artwork with at least 4 faces, horizontal & vertical repetition.	Two portraits of a young girl side by side, horizontal repetition.	Rectangular portrait of a young woman with braids, no repetition.	Digital collage with overlapping faces, vertical repetition.	Two identical veiled faces side by side, horizontal repetition.

Feature \Resolution	1152x896	896x1152	1216x832	832x1216	1344x768	768x1344	1536x640	640x1536
Input image
Number of Faces	1	Several (exact count unclear due to overlapping)	1	Multiple (exact count unclear due to overlapping)	1	Multiple (exact count unclear due to abstraction)	1	Several (horizontal repetition)
Perceived Shape	Rectangular	Rectangular	Rectangular	Rectangular	Rectangular	Rectangular (elongated horizontally)	Rectangular	Rectangular
Repetition Type	None	Horizontal	None	Both horizontal & vertical	None	Unclear (abstract repetition)	None	Horizontal
Description	Classical portrait of a young woman with a simple gradient background.	Abstract/impressionistic painting with multiple overlapping faces.	Portrait of a young woman with a dark background.	Abstract painting with overlapping faces in muted earthy tones.	Classical portrait of a young woman with a dark background.	Abstract/distorted representation of multiple blurred faces.	Classical single-face portrait in historical attire.	Painting/digital art with horizontally repeated overlapping faces.

(I used a LLM to turn the raw descriptons into a table)

Raw Pixtral outputs

1024x1024:

The image appears to be a single, square-shaped portrait of a young woman. The subject is centrally positioned, and the composition is balanced. There are no other faces or objects visible in the image. The background is plain and does not contain any additional elements.

To summarize:

The number of faces: 1
The shape of the image: Square
Repetition: None

768x1024:

The image you provided appears to be a painting featuring a row of faces. The painting seems to have a rectangular shape, and it does not appear to be a square. The faces are arranged horizontally, suggesting that the repetition is horizontal.

Here are the details based on the image:

Number of faces: 5
Shape of the image: Rectangular
Repetition: Horizontal

512x1024:

The image shows two faces side by side. Let's address your questions one by one:

How many faces in the image?
- There are two faces in the image.
Does the image appear to be a square?
- The image does not appear to be a square. It is horizontally elongated, showing two faces side by side.
If there is a repetition, is it horizontally or vertically?
- The repetition is horizontal. The two faces are placed side by side in a horizontal arrangement.

So, to summarize:

There are two faces.
The image is not square.
The repetition is horizontal.

1024x768:

The image appears to be a portrait painting of a young woman. The painting has a single face, which is the central focus of the image. The background is relatively simple, with a dark color on one side and a lighter, golden hue on the other.

To address your specific questions:

How many faces in this image?
- There is one face in the image.
Does the image appear to be a square?
- The image does not appear to be a square. It looks more rectangular due to its aspect ratio.
If there is a repetition, is it horizontally or vertically?
- There is no repetition in the image. It is a single portrait without any repeated elements.

The painting is a classic portrait, likely from the 18th or 19th century, characterized by its detailed and realistic style.

768x768:

The image you provided appears to be a painting or a digital artwork featuring multiple faces. The faces are arranged in a grid-like pattern, suggesting a structured composition. Here are the observations based on the image:

Number of Faces: There are multiple faces in the image. From what I can see, there are at least four faces visible.
Shape of the Image: The image appears to be rectangular rather than square. The faces are aligned horizontally and vertically, suggesting a rectangular grid.
Repetition: The faces are repeated in a pattern, which is common in grid-based artworks. The repetition is both horizontal and vertical.

768x512:

The image appears to be a painting featuring two portraits of a young girl side by side. Let's break down the elements you asked about:

Number of Faces: There are two faces in the image.
Shape of the Image: The image appears to be rectangular rather than square. The two portraits are aligned horizontally.
Repetition: The repetition is horizontal, as the two portraits are placed side by side.

So, to summarize:

There are two faces.
The image is rectangular, not square.
The repetition is horizontal.

1024x512:

The image appears to be a portrait painting of a young woman with dark hair styled in braids. The background is a simple, dark color, possibly black or dark brown, which contrasts with the subject's light skin and hair. The painting is framed by a narrow border, which is visible on the left side of the image.

To address your specific questions:

How many faces in the image?
- There is one face in the image.
Does the image appear to be a square?
- The image does not appear to be a square. It looks more rectangular, with a longer horizontal dimension compared to the vertical dimension.
If there is a repetition, is it horizontally or vertically?
- There is no repetition in the image, either horizontally or vertically. The image is a single portrait without any repeated elements.

The overall composition and style suggest that the painting is likely a classical portrait, possibly from the 19th century or earlier, given the subject's attire and hairstyle.

768x512:

The image you provided appears to be a digital manipulation or collage of a portrait, likely of a historical or classical figure. The image shows multiple overlapping faces, creating a sense of depth and repetition.

Here are the answers to your questions:

Number of Faces: The image contains multiple faces, but it's difficult to give an exact count due to the overlapping and repetition.
Shape of the Image: The image appears to be rectangular rather than square, given the elongated shape and the vertical repetition.
Repetition: The repetition is vertical, as the same face or similar faces are stacked on top of each other.

This type of image manipulation is often used in digital art to create a sense of depth, movement, or to emphasize certain features of the subject.

512x512:

The image shows a painting with two identical faces of a woman, each wearing a black veil and white headband. The painting appears to be rectangular rather than square. The repetition of the face is horizontal, as the two faces are placed side by side.

To summarize:

There are two faces in the image.
The image does not appear to be a square.
The repetition is horizontal.

1152x896:

The image you provided appears to be a portrait painting of a young woman. The painting is rendered in a classical style, with a focus on the subject's face and upper body. The background is relatively simple, with a gradient transitioning from dark to light.

Analysis of the Image:

Faces: There is one face in the image, which is the central focus.
Shape: The image appears to be rectangular rather than square.
Repetition: There is no repetition of elements in the image. The composition is singular and focused on the portrait.

Summary:

Faces: 1
Shape: Rectangular
Repetition: None

The painting is a detailed and realistic portrayal of a young woman, likely intended to capture her likeness and convey a sense of her personality. The use of light and shadow helps to highlight her features and create depth in the image.

896x1152:

The image you provided appears to be an abstract or impressionistic painting featuring multiple faces. The faces are rendered in a loose, painterly style with visible brushstrokes, giving the image a textured appearance.

Number of Faces: It is difficult to determine the exact number of faces due to the overlapping and abstract nature of the painting. However, there appear to be several faces visible, with some partially obscured by others.
Shape of the Image: The image appears to be rectangular rather than square. This is evident from the aspect ratio, which is longer horizontally than it is vertically.
Repetition: There is a sense of repetition in the image, with multiple faces arranged closely together. The repetition seems to be more horizontal, as the faces are aligned side by side.

Overall, the painting has a dynamic and expressive quality, with the repetition of faces creating a sense of unity and continuity.

1216x832:

The image you provided appears to be a portrait of a young woman. The portrait is a painting with a focus on the woman's face and upper body. The background is dark, which helps to highlight the subject.

Here are the answers to your questions:

Number of Faces: There is one face in the image.
Square Image: The image does not appear to be square. It looks more rectangular due to the vertical orientation of the subject's face and upper body.
Repetition: There is no repetition in the image. The subject is the sole focus, and there are no repeated elements or patterns.

The portrait is a classic example of a traditional painting style, with attention to detail and a focus on the subject's features.

832x1216:

The image you provided appears to be an abstract or impressionistic painting with a focus on facial features. The painting is characterized by a series of overlapping faces rendered in a loose, painterly style. The colors are muted and earthy, with shades of brown, beige, and pink dominating the palette.

Analysis:

Number of Faces:
- The image contains multiple faces, though it's difficult to give an exact count due to the overlapping and abstract nature of the painting.
Shape of the Image:
- The image appears to be rectangular rather than square. This is inferred from the elongated horizontal and vertical proportions visible in the painting.
Repetition:
- There is a clear repetition of facial features, which suggests a pattern or motif. The repetition appears to be both horizontally and vertically, creating a sense of continuity and rhythm across the canvas.

Summary:

Faces: Multiple, overlapping facial features.
Shape: Rectangular.
Repetition: Both horizontal and vertical.

This type of painting might be exploring themes of identity, humanity, or emotional expression through the repetition and abstraction of facial features.

1344x768:

The image appears to be a portrait of a young woman, painted in a classical style. The subject is looking directly at the viewer with a neutral expression. The background is a simple, dark color, which contrasts with the subject's light skin and hair.

To answer your questions:

How many faces in this image? There is one face in the image.
Does the image appear to be a square? It is not possible to determine the aspect ratio of the image from the provided description, but the image does not appear to be square.
If there is a repetition, is it horizontally or vertically? There is no repetition in the image, so this question does not apply.

The image is a detailed and realistic portrayal of a young woman, likely intended to capture her likeness and convey a sense of her personality.

768x1344:

The image appears to be a distorted or abstract representation of multiple faces. The faces are not clearly defined, and the image has a blurred and painterly quality. Here are the observations based on the image:

Number of Faces: It is difficult to determine the exact number of faces due to the abstraction, but there appear to be multiple faces in the image.
Shape of the Image: The image does not have a clear rectangular or square shape. It appears to be elongated horizontally, making it more rectangular than square.
Repetition: There is a repetition of facial features, but it is not clear whether the repetition is horizontal or vertical due to the abstract nature of the image.

Overall, the image is abstract and does not provide clear details about the number of faces or the shape of the image. The repetition of facial features is present but not distinctly aligned horizontally or vertically.

1536x640:

The image you provided appears to be a classical portrait of a woman, likely painted in the style of a traditional oil painting. The portrait is a single face, and it does not contain multiple faces or any other objects.

To address your questions:

How many faces in this image?
- There is one face in the image.
Does the image appear to be a square?
- The image does not appear to be a square. It looks more rectangular, with a horizontal orientation.
If there is a repetition, is it horizontally or vertically?
- There is no repetition in the image. It is a single portrait without any repeated elements.

The portrait is a detailed representation of a woman with a serene expression, dressed in what appears to be historical attire. The background is simple and dark, which helps to emphasize the subject's face.

640x1536:

The image appears to be a painting or digital art piece featuring multiple faces. The faces are arranged in a repetitive pattern, creating a sense of depth and continuity. Here are the observations based on the image:

Number of Faces: It is difficult to determine the exact number of faces due to the overlapping and repetitive nature of the image, but there are several faces visible.
Shape of the Image: The image appears to be rectangular rather than square. This is inferred from the elongated horizontal arrangement of the faces.
Repetition: The faces are repeated in a horizontal pattern, creating a sense of continuity and depth.

Overall, the image is a visually intriguing piece that uses repetition and overlapping to create a unique aesthetic.

ngxson · 2025-04-24T15:57:46Z

@stduhpf Thanks, that's very helpful. I pushed a test branch to force the preprocessed image to be 1024x1024: https://github.com/ngxson/llama.cpp/tree/xsn/test_pixtral_fixed_size

Could you re-run some tests with this?

stduhpf · 2025-04-24T16:07:16Z

@stduhpf Thanks, that's very helpful. I pushed a test branch to force the preprocessed image to be 1024x1024: https://github.com/ngxson/llama.cpp/tree/xsn/test_pixtral_fixed_size

Could you re-run some tests with this?

It looks like this does fix the repetition issue:

640x1536:

The image appears to be a portrait painting of a young woman. The painting is in a vertical orientation and has a rectangular shape. Here are the details based on the questions:

Number of Faces: There is one face visible in the image.
Square Appearance: The image does not appear to be a square; it is more rectangular in shape.
Repetition: There is no visible repetition in the image, either horizontally or vertically.

The portrait is detailed and captures the subject's facial features and expression. The woman is dressed in period attire, with a dark dress and a white collar or scarf. The background is relatively plain, focusing attention on the subject.

512x512:

The image appears to be a portrait painting of a woman. Let's address the questions one by one:

How many faces in the image?
- There is one face in the image, which belongs to the woman in the portrait.
Does the image appear to be a square?
- Yes, the image appears to be a square. This is evident from the equal length and width of the image.
If there is a repetition, is it horizontally or vertically?
- There is no repetition in the image. The portrait is unique and does not show any horizontal or vertical repetition.

So, the image is a square portrait of a woman with no repetitions.

ggerganov

The provided results above indicate some incorrect frequency computation - likely missing scaling of the input positions. Only by looking at our code, without aware how the reference implementation works, I suspect that this part might need some adjustment:

llama.cpp/examples/llava/clip.cpp

Lines 2938 to 2955 in 13be08d

    
           else if (ctx->proj_type == PROJECTOR_TYPE_PIXTRAL) { 
        
               // set the 2D positions 
        
               int n_patches_per_col = image_size_width / patch_size; 
        
               std::vector<int> pos_data(num_positions); 
        
               struct ggml_tensor * pos; 
        
               // dimension H 
        
               pos = ggml_graph_get_tensor(gf, "pos_h"); 
        
               for (int i = 0; i < num_positions; i++) { 
        
                   pos_data[i] = i / n_patches_per_col; 
        
               } 
        
               ggml_backend_tensor_set(pos, pos_data.data(), 0, ggml_nbytes(pos)); 
        
               // dimension W 
        
               pos = ggml_graph_get_tensor(gf, "pos_w"); 
        
               for (int i = 0; i < num_positions; i++) { 
        
                   pos_data[i] = i % n_patches_per_col; 
        
               } 
        
               ggml_backend_tensor_set(pos, pos_data.data(), 0, ggml_nbytes(pos)); 
        
           }

With the current implementation, the input positions grow unbounded with the image size:

0 1 2 3 4 ...

While I would have expected them to be in a fixed range:

0/nx 1/nx 2/nx 3/nx ... 1

Anyway, we can resolve this later.

Maybe add a TODO comment to restore the previous approach of inplace ropes + views when the test-backend-ops is added and the CUDA implementation is updated.

Regarding the multi-rope comment that I think I saw somewhere earlier - in the future it should be expanded to support the NeoX-ordering type, through the mode parameter of ggml_rope_ext.

ngxson · 2025-04-25T08:38:23Z

Thanks @ggerganov for pointing me to the correct code. Turns out, the problem was due to 1 line of code at the beginning of clip_encode that I never had time to read 😂

Because the original llava model use fixed input size, the code was written in such away that it only care about the max image size (a square image), and only allow dynamic size if specified. It's an ugly logic tbh, I will have to refactor it later on.

I'm adding some TODO in a next commit.

In the meantime, @stduhpf could you rerun the test with 2461682 ? Thanks.

Regarding the multi-rope comment that I think I saw somewhere earlier - in the future it should be expanded to support the NeoX-ordering type, through the mode parameter of ggml_rope_ext.

Small correction, multi-rope is currently using neox-style and what we need is to support the normal ordering

stduhpf · 2025-04-25T09:25:54Z

No issues so far with 2461682.

* clip : fix pixtral on some GPU backends * refactor inp_raw set * rm outdated comment * fix dynamic size * add TODO

clip : fix pixtral on some GPU backends

ce94be1

github-actions bot added the examples label Apr 24, 2025

slaren reviewed Apr 24, 2025

View reviewed changes

refactor inp_raw set

333fe3c

ngxson marked this pull request as ready for review April 24, 2025 14:32

ngxson requested a review from ggerganov April 24, 2025 14:32

rm outdated comment

cb4480a

ggerganov approved these changes Apr 25, 2025

View reviewed changes

fix dynamic size

2461682

add TODO

dc0b448

github-actions bot added the testing Everything test related label Apr 25, 2025

ngxson merged commit edb18b6 into ggml-org:master Apr 25, 2025
89 of 90 checks passed

ngxson mentioned this pull request Apr 27, 2025

clip : refactor set input for cgraph + fix qwen2.5vl input #13136

Merged

pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025

clip : fix pixtral on some GPU backends (ggml-org#13097)

e306c33

* clip : fix pixtral on some GPU backends * refactor inp_raw set * rm outdated comment * fix dynamic size * add TODO

	else if (ctx->proj_type == PROJECTOR_TYPE_PIXTRAL) {
	// set the 2D positions
	int n_patches_per_col = image_size_width / patch_size;
	std::vector<int> pos_data(num_positions);
	struct ggml_tensor * pos;
	// dimension H
	pos = ggml_graph_get_tensor(gf, "pos_h");
	for (int i = 0; i < num_positions; i++) {
	pos_data[i] = i / n_patches_per_col;
	}
	ggml_backend_tensor_set(pos, pos_data.data(), 0, ggml_nbytes(pos));
	// dimension W
	pos = ggml_graph_get_tensor(gf, "pos_w");
	for (int i = 0; i < num_positions; i++) {
	pos_data[i] = i % n_patches_per_col;
	}
	ggml_backend_tensor_set(pos, pos_data.data(), 0, ggml_nbytes(pos));
	}

clip : fix pixtral on some GPU backends #13097

clip : fix pixtral on some GPU backends #13097

Uh oh!

Conversation

ngxson commented Apr 24, 2025

Uh oh!

ggerganov commented Apr 24, 2025

Uh oh!

slaren Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 24, 2025

Uh oh!

stduhpf commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 24, 2025

Uh oh!

stduhpf commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prompt:

Parsed outputs:

1024x1024:

768x1024:

512x1024:

1024x768:

768x768:

768x512:

1024x512:

768x512:

512x512:

1152x896:

Analysis of the Image:

Summary:

896x1152:

1216x832:

832x1216:

Analysis:

Summary:

1344x768:

768x1344:

1536x640:

640x1536:

Uh oh!

ngxson commented Apr 24, 2025

Uh oh!

stduhpf commented Apr 24, 2025

640x1536:

512x512:

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 25, 2025

Uh oh!

stduhpf commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

ngxson commented Apr 24, 2025 •

edited

Loading

stduhpf commented Apr 24, 2025 •

edited

Loading

stduhpf commented Apr 24, 2025 •

edited

Loading