Skip to content

mtmd : add methods to access mtmd_image_tokens #12906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 18, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 11, 2025

Part of this PR is extracted from #12898 , for easier review

Note: we're doing hash of the post-processed image (f32 image), but probably it's better to clip_image_preprocess doesn't expose the internal downscaled image (u8 image) - this can be implemented in a future version of clip.cpp

@ngxson ngxson requested review from ggerganov and slaren April 11, 2025 20:20
@ngxson ngxson force-pushed the xsn/mtmd_image_api branch from bfbabea to c3587e2 Compare April 12, 2025 08:56
@ngxson ngxson force-pushed the xsn/mtmd_image_api branch from c3587e2 to d3c3e20 Compare April 12, 2025 09:03
@ngxson ngxson changed the title mtmd : add methods to access mtmd_image_tokens, add ability to get image hash mtmd : add methods to access mtmd_image_tokens Apr 12, 2025
@@ -123,14 +130,14 @@ mtmd_input_chunks * mtmd_tokenize(mtmd_context * ctx,
std::move(tokens),
{},
};
output->emplace_back(std::move(chunk));
output.emplace_back(std::move(chunk));

if (&parts.back() != &part) {
Copy link
Member

@ggerganov ggerganov Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure, but I think this logic does not handle the case where the text ends with an image marker:

<some_text><image_marker>

For Gemma3 this will not happen because we wrap the image marker with text on both sides, but maybe for other models it could happen? If it cannot happen for sure, then this if should become assert.

Copy link
Collaborator Author

@ngxson ngxson Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic will produce an empty text chunk in the end if the image marker is placed at the end on the input prompt. This is an effect from string_split_str

For example, this code:

auto test = string_split_str("123aa456aa", "aa");
for (auto & p : test) printf("'%s'\n", p.c_str());

Will output:

'123'
'456'
''

I think having an empty chunk in the end is expected for now, but I should document it better.

If we don't want this empty chunk, the proper way is to stop using string_split_str and to write our own code to do string matching / splitting.

In reality, this will almost never happen because user always input a prompt with a generation prefix, something like <s>user\nwhat do you see?<image></s><s>assistant\n

Copy link
Collaborator Author

@ngxson ngxson Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed one line of code:

auto tokens = mtmd_tokenize_text_internal(...);
if (tokens.empty()) {
    continue;
}

So that means there is no empty chunk being added, the case where image marker placed in the end is correctly handled. This also handles the case where 2 image markers are place one next to the other.

(This piece of code was firstly introduced from my first attempt to refactor vision API, so yeah it's quite hacky)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will refactor this function later on. Could you review the rest of this PR? Thanks!!

@ngxson ngxson merged commit b9154ec into ggml-org:master Apr 18, 2025
51 checks passed
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* fix prompt_modified

* rm redundant data member
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* fix prompt_modified

* rm redundant data member
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants