tokenize : add --no-parse-special option #8423

compilade · 2024-07-10T22:20:47Z

This should allow more easily explaining how parse_special affects tokenization.

I felt the need for this when working on #8228, because the tokenizer tests use parse_special = true, but when parse_special = false, some tokenizers have problems which were otherwise not really easy to visualize.

For example, with OLMo (which uses a tokenizer very similar to GPT-NeoX), there's a problem with tokenization of consecutive spaces:

$ ./build/bin/llama-tokenize --log-disable -m models/ggml-vocab-olmo.gguf -p "Hello   world of    spaces"
 12092 -> 'Hello'
 50275 -> '   '
 10186 -> 'world'
   273 -> ' of'
 50274 -> '    '
 31748 -> 'spaces'

$ ./build/bin/llama-tokenize --log-disable -m models/ggml-vocab-olmo.gguf -p "Hello   world of    spaces" --no-parse-special
 12092 -> 'Hello'
   245 -> '  '
  1533 -> ' world'
   273 -> ' of'
   341 -> '   '
  8470 -> ' spaces'

Notice how when parse_special = false, the spaces don't get tokenized correctly (space prefixes, and totally different token ids for spaces), because the user-defined multi-space tokens no longer have priority in the pre-tokenization (but they should!).

This is one of the problems fixed in #8228

I have read the contributing guidelines
Self-reported review complexity:
- Low

This should allow more easily explaining how parse_special affects tokenization.

tokenize : add --no-parse-special option

ba06b2d

This should allow more easily explaining how parse_special affects tokenization.

github-actions bot added the examples label Jul 10, 2024

compilade added enhancement New feature or request Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix examples and removed examples labels Jul 10, 2024

ggerganov approved these changes Jul 11, 2024

View reviewed changes

ggerganov merged commit 9a55ffe into master Jul 11, 2024
54 checks passed

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 12, 2024

tokenize : add --no-parse-special option (ggml-org#8423)

9b227fd

This should allow more easily explaining how parse_special affects tokenization.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 12, 2024

tokenize : add --no-parse-special option (ggml-org#8423)

d2d9d7b

This should allow more easily explaining how parse_special affects tokenization.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024

tokenize : add --no-parse-special option (ggml-org#8423)

671f343

This should allow more easily explaining how parse_special affects tokenization.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024

tokenize : add --no-parse-special option (ggml-org#8423)

4e4205a

This should allow more easily explaining how parse_special affects tokenization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tokenize : add --no-parse-special option #8423

tokenize : add --no-parse-special option #8423

Uh oh!

compilade commented Jul 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tokenize : add --no-parse-special option #8423

tokenize : add --no-parse-special option #8423

Uh oh!

Conversation

compilade commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

compilade commented Jul 10, 2024 •

edited

Loading