file encoding detection bug #14434

jpraet · 2021-01-22T18:13:34Z

Gitea version (or commit ref): 1.13.1
Can you reproduce the bug at https://try.gitea.io:
- Yes (provide example URL)

Description

I noticed an encoding detection bug:

https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/nok.xml
This is UTF-8, but detected as Content-Type: text/plain; charset=iso-8859-1, which result in incorrect rendering of a special character: <name>FranÃ§oise Tomasetti</name> should be <name>Françoise Tomasetti</name>.

The encoding detection is done on a buffer consisting of the first 1024 bytes of the file.
The UTF-8 ç character consists of 2 bytes: https://www.fileformat.info/info/unicode/char/00e7/index.htm.
By coincidence, in this file the first byte of that character happens to be the 1024th byte in the file, causing the encoding detection to not recognize this byte buffer as valid UTF-8.

As a test I removed a section from the start of the file, and then it works fine:
https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/ok.xml

The text was updated successfully, but these errors were encountered:

lunny · 2021-01-23T02:48:57Z

It's amazing. Maybe it's upstream related? https://github.com/gogs/chardet

ulrikian · 2021-03-21T10:44:23Z

I did an upstream testcase to https://github.com/gogs/chardet to detector_test.go with the UTF-8 file from above it gets detected correctly.

When I truncate the file to 1024 bytes

dd if=nok.xml of=nok1024.xml bs=1 count=1024

not only chardet gets fooled, the linux file command as well.

 ➜ file nok1024.xml 
nok1024.xml: XML 1.0 document, ISO-8859 text

when I use the first 1025 bytes all is fine

I have no real idea on how to solve this.
routers/repo/download.go ServeData uses an arbitrary number of bytes to detect encoding. As long as encoding is not determined by the full content of the file it can always go wrong. However files might be quite large, so reading them all into RAM to detect encoding would not be clever.

lunny · 2021-04-01T06:37:20Z

Maybe we can increment the detect max content size from 1024 to 2048.

wxiaoguang · 2022-09-24T11:56:58Z

I think the bug has been fixed (with #19743, by #19773). Feel free to re-open.

https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/nok.xml

6543 added the type/bug label Jan 22, 2021

sIspravnikov mentioned this issue May 20, 2022

Wrong display of cyrillic symbols in UTF-8 file #19743

Closed

wxiaoguang closed this as completed Sep 24, 2022

go-gitea locked and limited conversation to collaborators May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file encoding detection bug #14434

file encoding detection bug #14434

jpraet commented Jan 22, 2021 •

edited

Loading

lunny commented Jan 23, 2021

ulrikian commented Mar 21, 2021

lunny commented Apr 1, 2021

wxiaoguang commented Sep 24, 2022 •

edited

Loading

file encoding detection bug #14434

file encoding detection bug #14434

Comments

jpraet commented Jan 22, 2021 • edited Loading

Description

lunny commented Jan 23, 2021

ulrikian commented Mar 21, 2021

lunny commented Apr 1, 2021

wxiaoguang commented Sep 24, 2022 • edited Loading

jpraet commented Jan 22, 2021 •

edited

Loading

wxiaoguang commented Sep 24, 2022 •

edited

Loading