Closed
Description
System Info
transformers
version: 4.25.1- Platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
- Python version: 3.9.13
- Huggingface_hub version: 0.10.0
- PyTorch version (GPU?): 1.12.1.post201 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): 0.6.3 (cpu)
- Jax version: 0.4.1
- JaxLib version: 0.4.1
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
@rooa @patrickvonplaten @patil-suraj
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
The CodeGen tokenizer seems to remove the newline symbol in certain scenarios.
In particular, decode(encode(text))
does not always equal the original text
.
The following is the smallest example that reproduces the error but other text examples will have this issue as well.
from transformers import CodeGenTokenizer
# other checkpoints in the CodeGen series have the same issue
tokenizer = CodeGenTokenizer.from_pretrained("Salesforce/codegen-350M-multi")
# new line (10), space (32), space (32)
text = "\n "
print([ord(c) for c in text])
# output: [10, 32, 32]
encoded = tokenizer.encode(text)
print(encoded)
# output: [50286]
decoded = tokenizer.decode(encoded)
print([ord(c) for c in decoded])
# actual: [32, 32]
# expected: [10, 32, 32]
Expected behavior
Expected: the decoded string is equal to the original string.
Actual: the decoded string is missing the leading new line symbol.
Metadata
Metadata
Assignees
Labels
No labels