Skip to content

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in posotion 47 #30

@c6s0

Description

@c6s0

Hey,

while trying to use sp-encode [...] sp-model.model [...]/encode im running into this error:
Reading corpora: ['author']
author\train: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.03it/s]
author\valid: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\test\anaconda3\envs\transformertest\Scripts\sp-encode-script.py", line 33, in
sys.exit(load_entry_point('lm', 'console_scripts', 'sp-encode')())
File "c:\users\test\documents\transformer-lm\lm\data.py", line 100, in sp_encode
for line in f:
File "C:\Users\test\anaconda3\envs\transformertest\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 47: invalid continuation byte

I saved all my .txt files with utf-8 codec for the previous step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions