Skip to content

CodeGen Tokenizer Deletes Newline Symbols #21161

Closed
@hellodanylo

Description

@hellodanylo

System Info

  • transformers version: 4.25.1
  • Platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
  • Python version: 3.9.13
  • Huggingface_hub version: 0.10.0
  • PyTorch version (GPU?): 1.12.1.post201 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): 0.6.3 (cpu)
  • Jax version: 0.4.1
  • JaxLib version: 0.4.1
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@rooa @patrickvonplaten @patil-suraj

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The CodeGen tokenizer seems to remove the newline symbol in certain scenarios.
In particular, decode(encode(text)) does not always equal the original text.
The following is the smallest example that reproduces the error but other text examples will have this issue as well.

from transformers import CodeGenTokenizer

# other checkpoints in the CodeGen series have the same issue
tokenizer = CodeGenTokenizer.from_pretrained("Salesforce/codegen-350M-multi")

# new line (10), space (32), space (32)
text = "\n  "
print([ord(c) for c in text])
# output: [10, 32, 32]

encoded = tokenizer.encode(text)
print(encoded)
# output: [50286]

decoded = tokenizer.decode(encoded)
print([ord(c) for c in decoded])
# actual: [32, 32]
# expected: [10, 32, 32]

Expected behavior

Expected: the decoded string is equal to the original string.
Actual: the decoded string is missing the leading new line symbol.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions