Skip to content

tokenize : add --no-parse-special option #8423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 11, 2024

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Jul 10, 2024

This should allow more easily explaining how parse_special affects tokenization.

I felt the need for this when working on #8228, because the tokenizer tests use parse_special = true, but when parse_special = false, some tokenizers have problems which were otherwise not really easy to visualize.

For example, with OLMo (which uses a tokenizer very similar to GPT-NeoX), there's a problem with tokenization of consecutive spaces:

$ ./build/bin/llama-tokenize --log-disable -m models/ggml-vocab-olmo.gguf -p "Hello   world of    spaces"
 12092 -> 'Hello'
 50275 -> '   '
 10186 -> 'world'
   273 -> ' of'
 50274 -> '    '
 31748 -> 'spaces'

$ ./build/bin/llama-tokenize --log-disable -m models/ggml-vocab-olmo.gguf -p "Hello   world of    spaces" --no-parse-special
 12092 -> 'Hello'
   245 -> '  '
  1533 -> ' world'
   273 -> ' of'
   341 -> '   '
  8470 -> ' spaces'

Notice how when parse_special = false, the spaces don't get tokenized correctly (space prefixes, and totally different token ids for spaces), because the user-defined multi-space tokens no longer have priority in the pre-tokenization (but they should!).

This is one of the problems fixed in #8228

This should allow more easily explaining
how parse_special affects tokenization.
@compilade compilade added enhancement New feature or request Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix examples and removed examples labels Jul 10, 2024
@ggerganov ggerganov merged commit 9a55ffe into master Jul 11, 2024
54 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 12, 2024
This should allow more easily explaining
how parse_special affects tokenization.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 12, 2024
This should allow more easily explaining
how parse_special affects tokenization.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
This should allow more easily explaining
how parse_special affects tokenization.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
This should allow more easily explaining
how parse_special affects tokenization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants