-
Notifications
You must be signed in to change notification settings - Fork 11.9k
clip : fix pixtral on some GPU backends #13097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Btw, I don't think there is problem with Metal specifically. With CPU-only (i.e. |
n_dim/2, n_head, n_pos, | ||
ggml_row_size(cur->type, n_dim), | ||
ggml_row_size(cur->type, n_dim*n_head), | ||
n_dim/2 * ggml_element_size(cur)); | ||
tmp = ggml_rope_ext_inplace( | ||
second = ggml_cont(ctx0, second); // copy, because ggml_rope don't play well with non-contiguous tensors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you suspect that ggml_rope
is not implemented correctly for non-contiguous tensors, please add a test to test-backend-ops
that shows the problem.
@ggerganov Hmm yeah you're right, I think we currently have 2 separated problems: First problem, the I also double checked the preprocessing and found no problem. Unfortunately, for now, I don't either have the ability run the same test using the original model, so can't confirm exactly what is the root cause. But I think we can look into that later on, the more important part is to make it runnable on Vulkan/Rocm/CUDA. The second problem about Why I could confirm that
|
Please also not that I also have a look at dst[idst + 0] = x0*cos_theta - x1*sin_theta;
dst[idst + n_dims/2] = x0*sin_theta + x1*cos_theta; While what I want is: dst[idst + 0] = x0*cos_theta - x1*sin_theta;
dst[idst + 1] = x0*sin_theta + x1*cos_theta; |
Works on Vulkan too. (as well as on CPU at least). I can definitely confirm the repetition problem. Seems to happen whenever the resolution is not 1024x1024. Sometimes it claims to see two repetitions (ex: 512x512), sometimes three (ex: 768x1024), often "multiple" repetitions. |
@stduhpf Thanks. Interesting results. Can you also ask the model:
|
@stduhpf Thanks, that's very helpful. I pushed a test branch to force the preprocessed image to be 1024x1024: https://github.com/ngxson/llama.cpp/tree/xsn/test_pixtral_fixed_size Could you re-run some tests with this? |
It looks like this does fix the repetition issue: 640x1536:The image appears to be a portrait painting of a young woman. The painting is in a vertical orientation and has a rectangular shape. Here are the details based on the questions:
The portrait is detailed and captures the subject's facial features and expression. The woman is dressed in period attire, with a dark dress and a white collar or scarf. The background is relatively plain, focusing attention on the subject. 512x512:The image appears to be a portrait painting of a woman. Let's address the questions one by one:
So, the image is a square portrait of a woman with no repetitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provided results above indicate some incorrect frequency computation - likely missing scaling of the input positions. Only by looking at our code, without aware how the reference implementation works, I suspect that this part might need some adjustment:
llama.cpp/examples/llava/clip.cpp
Lines 2938 to 2955 in 13be08d
else if (ctx->proj_type == PROJECTOR_TYPE_PIXTRAL) { | |
// set the 2D positions | |
int n_patches_per_col = image_size_width / patch_size; | |
std::vector<int> pos_data(num_positions); | |
struct ggml_tensor * pos; | |
// dimension H | |
pos = ggml_graph_get_tensor(gf, "pos_h"); | |
for (int i = 0; i < num_positions; i++) { | |
pos_data[i] = i / n_patches_per_col; | |
} | |
ggml_backend_tensor_set(pos, pos_data.data(), 0, ggml_nbytes(pos)); | |
// dimension W | |
pos = ggml_graph_get_tensor(gf, "pos_w"); | |
for (int i = 0; i < num_positions; i++) { | |
pos_data[i] = i % n_patches_per_col; | |
} | |
ggml_backend_tensor_set(pos, pos_data.data(), 0, ggml_nbytes(pos)); | |
} |
With the current implementation, the input positions grow unbounded with the image size:
0 1 2 3 4 ...
While I would have expected them to be in a fixed range:
0/nx 1/nx 2/nx 3/nx ... 1
Anyway, we can resolve this later.
Maybe add a TODO comment to restore the previous approach of inplace ropes + views when the test-backend-ops
is added and the CUDA implementation is updated.
Regarding the multi-rope comment that I think I saw somewhere earlier - in the future it should be expanded to support the NeoX-ordering type, through the mode
parameter of ggml_rope_ext
.
Thanks @ggerganov for pointing me to the correct code. Turns out, the problem was due to 1 line of code at the beginning of Because the original llava model use fixed input size, the code was written in such away that it only care about the max image size (a square image), and only allow dynamic size if specified. It's an ugly logic tbh, I will have to refactor it later on. I'm adding some TODO in a next commit. In the meantime, @stduhpf could you rerun the test with 2461682 ? Thanks.
Small correction, multi-rope is currently using neox-style and what we need is to support the normal ordering |
No issues so far with 2461682. |
* clip : fix pixtral on some GPU backends * refactor inp_raw set * rm outdated comment * fix dynamic size * add TODO
Working well on CUDA
But on Metal it still have problem with F16 mmproj file: #13065 (comment)