-
Notifications
You must be signed in to change notification settings - Fork 11.9k
mtmd : add methods to access mtmd_image_tokens
#12906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bfbabea
to
c3587e2
Compare
c3587e2
to
d3c3e20
Compare
mtmd_image_tokens
, add ability to get image hashmtmd_image_tokens
@@ -123,14 +130,14 @@ mtmd_input_chunks * mtmd_tokenize(mtmd_context * ctx, | |||
std::move(tokens), | |||
{}, | |||
}; | |||
output->emplace_back(std::move(chunk)); | |||
output.emplace_back(std::move(chunk)); | |||
|
|||
if (&parts.back() != &part) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure, but I think this logic does not handle the case where the text ends with an image marker:
<some_text><image_marker>
For Gemma3 this will not happen because we wrap the image marker with text on both sides, but maybe for other models it could happen? If it cannot happen for sure, then this if
should become assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic will produce an empty text chunk in the end if the image marker is placed at the end on the input prompt. This is an effect from string_split_str
For example, this code:
auto test = string_split_str("123aa456aa", "aa");
for (auto & p : test) printf("'%s'\n", p.c_str());
Will output:
'123'
'456'
''
I think having an empty chunk in the end is expected for now, but I should document it better.
If we don't want this empty chunk, the proper way is to stop using string_split_str
and to write our own code to do string matching / splitting.
In reality, this will almost never happen because user always input a prompt with a generation prefix, something like <s>user\nwhat do you see?<image></s><s>assistant\n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed one line of code:
auto tokens = mtmd_tokenize_text_internal(...);
if (tokens.empty()) {
continue;
}
So that means there is no empty chunk being added, the case where image marker placed in the end is correctly handled. This also handles the case where 2 image markers are place one next to the other.
(This piece of code was firstly introduced from my first attempt to refactor vision API, so yeah it's quite hacky)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will refactor this function later on. Could you review the rest of this PR? Thanks!!
* mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * fix prompt_modified * rm redundant data member
* mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * fix prompt_modified * rm redundant data member
Part of this PR is extracted from #12898 , for easier review
Note: we're doing hash of the post-processed image (f32 image), but probably it's better to
clip_image_preprocess
doesn't expose the internal downscaled image (u8 image) - this can be implemented in a future version ofclip.cpp