CodeGen Tokenizer Deletes Newline Symbols

### System Info

- `transformers` version: 4.25.1
- Platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
- Python version: 3.9.13
- Huggingface_hub version: 0.10.0
- PyTorch version (GPU?): 1.12.1.post201 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): 0.6.3 (cpu)
- Jax version: 0.4.1
- JaxLib version: 0.4.1
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

### Who can help?

@rooa @patrickvonplaten @patil-suraj

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

The CodeGen tokenizer seems to remove the newline symbol in certain scenarios.
In particular, `decode(encode(text))` does not always equal the original `text`.
The following is the smallest example that reproduces the error but other text examples will have this issue as well.

```python
from transformers import CodeGenTokenizer

# other checkpoints in the CodeGen series have the same issue
tokenizer = CodeGenTokenizer.from_pretrained("Salesforce/codegen-350M-multi")

# new line (10), space (32), space (32)
text = "\n  "
print([ord(c) for c in text])
# output: [10, 32, 32]

encoded = tokenizer.encode(text)
print(encoded)
# output: [50286]

decoded = tokenizer.decode(encoded)
print([ord(c) for c in decoded])
# actual: [32, 32]
# expected: [10, 32, 32]
```

### Expected behavior

Expected: the decoded string is equal to the original string.
Actual: the decoded string is missing the leading new line symbol.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CodeGen Tokenizer Deletes Newline Symbols #21161

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CodeGen Tokenizer Deletes Newline Symbols #21161

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions