-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
file encoding detection bug #14434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's amazing. Maybe it's upstream related? https://github.com/gogs/chardet |
I did an upstream testcase to https://github.com/gogs/chardet to detector_test.go with the UTF-8 file from above it gets detected correctly. When I truncate the file to 1024 bytes
not only chardet gets fooled, the linux file command as well.
when I use the first 1025 bytes all is fine I have no real idea on how to solve this. |
Maybe we can increment the detect max content size from 1024 to 2048. |
I think the bug has been fixed (with #19743, by #19773). Feel free to re-open. https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/nok.xml |
Description
I noticed an encoding detection bug:
https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/nok.xml
This is UTF-8, but detected as
Content-Type: text/plain; charset=iso-8859-1
, which result in incorrect rendering of a special character:<name>Françoise Tomasetti</name>
should be<name>Françoise Tomasetti</name>
.The encoding detection is done on a buffer consisting of the first 1024 bytes of the file.
The UTF-8 ç character consists of 2 bytes: https://www.fileformat.info/info/unicode/char/00e7/index.htm.
By coincidence, in this file the first byte of that character happens to be the 1024th byte in the file, causing the encoding detection to not recognize this byte buffer as valid UTF-8.
As a test I removed a section from the start of the file, and then it works fine:
https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/ok.xml
The text was updated successfully, but these errors were encountered: