Skip to content

file encoding detection bug #14434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
jpraet opened this issue Jan 22, 2021 · 4 comments
Closed
1 task done

file encoding detection bug #14434

jpraet opened this issue Jan 22, 2021 · 4 comments
Labels

Comments

@jpraet
Copy link
Member

jpraet commented Jan 22, 2021

  • Gitea version (or commit ref): 1.13.1
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)

Description

I noticed an encoding detection bug:

https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/nok.xml
This is UTF-8, but detected as Content-Type: text/plain; charset=iso-8859-1, which result in incorrect rendering of a special character: <name>Françoise Tomasetti</name> should be <name>Françoise Tomasetti</name>.

The encoding detection is done on a buffer consisting of the first 1024 bytes of the file.
The UTF-8 ç character consists of 2 bytes: https://www.fileformat.info/info/unicode/char/00e7/index.htm.
By coincidence, in this file the first byte of that character happens to be the 1024th byte in the file, causing the encoding detection to not recognize this byte buffer as valid UTF-8.

As a test I removed a section from the start of the file, and then it works fine:
https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/ok.xml

@6543 6543 added the type/bug label Jan 22, 2021
@lunny
Copy link
Member

lunny commented Jan 23, 2021

It's amazing. Maybe it's upstream related? https://github.com/gogs/chardet

@ulrikian
Copy link

I did an upstream testcase to https://github.com/gogs/chardet to detector_test.go with the UTF-8 file from above it gets detected correctly.

When I truncate the file to 1024 bytes

dd if=nok.xml of=nok1024.xml bs=1 count=1024

not only chardet gets fooled, the linux file command as well.

 ➜ file nok1024.xml 
nok1024.xml: XML 1.0 document, ISO-8859 text

when I use the first 1025 bytes all is fine

I have no real idea on how to solve this.
routers/repo/download.go ServeData uses an arbitrary number of bytes to detect encoding. As long as encoding is not determined by the full content of the file it can always go wrong. However files might be quite large, so reading them all into RAM to detect encoding would not be clever.

@lunny
Copy link
Member

lunny commented Apr 1, 2021

Maybe we can increment the detect max content size from 1024 to 2048.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Sep 24, 2022

I think the bug has been fixed (with #19743, by #19773). Feel free to re-open.

https://try.gitea.io/jpraet/detect-encoding/raw/branch/master/nok.xml

image

@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants