Detect truncated utf-8 characters at the end of content as still representing utf-8#19773
Conversation
…esenting utf-8 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>
|
Strangely I'm having some difficulty creating a test that replicates this issue from within the charset module. I'm not certain as to what's going on that means that I can't replicate this. I've been able to add a testcase. |
|
See my comment in #19743, there is a test case. |
Signed-off-by: Andrew Thornton <art27@cantab.net>
Codecov Report
@@ Coverage Diff @@
## main #19773 +/- ##
=======================================
Coverage ? 47.29%
=======================================
Files ? 957
Lines ? 133317
Branches ? 0
=======================================
Hits ? 63058
Misses ? 62599
Partials ? 7660
Continue to review full report at Codecov.
|
|
It's better to include this test case. |
I've already included a specific test case but I can add that if you really want it. |
Signed-off-by: Andrew Thornton <art27@cantab.net>
…ection' into fix-19743-improve-encoding-detection
…esenting utf-8 (go-gitea#19773) Backport go-gitea#19773 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>
…esenting utf-8 (#19773) (#19774) Backport #19773 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix #19743 Signed-off-by: Andrew Thornton <art27@cantab.net>
* giteaofficial/main: Prevent NPE when cache service is disabled (go-gitea#19703) Detect truncated utf-8 characters at the end of content as still representing utf-8 (go-gitea#19773) Add silentcodeg to MAINTAINERS (go-gitea#19771) Allows repo search to match against "owner/repo" pattern strings (go-gitea#19754) Update JS dependencies (go-gitea#19767) Nuke the incorrect permission report on /api/v1/notifications (go-gitea#19761)
## [1.16.9](https://github.com/go-gitea/gitea/releases/tag/1.16.9) - 2022-06-20 * BUGFIXES * Fix permission check for delete tag (go-gitea#19985) (go-gitea#20001) * Only log non ErrNotExist errors in git.GetNote (go-gitea#19884) (go-gitea#19905) * Use exact search instead of fuzzy search for branch filter dropdown (go-gitea#19885) (go-gitea#19893) * Set Setpgid on child git processes (go-gitea#19865) (go-gitea#19881) * Import git from alpine 3.16 repository as 2.30.4 is needed for `safe.directory = '*'` to work but alpine 3.13 has 2.30.3 (go-gitea#19876) * Ensure responses are context.ResponseWriters (go-gitea#19843) (go-gitea#19859) * Fix count bug (go-gitea#19850) * Fix raw endpoint PDF file headers (go-gitea#19825) (go-gitea#19826) * Make WIP prefixes case insensitive, e.g. allow `Draft` as a WIP prefix (go-gitea#19780) (go-gitea#19811) * Fix NotificationUnreadCount (go-gitea#19802) * Prevent NPE when cache service is disabled (go-gitea#19703) (go-gitea#19783) * Detect truncated utf-8 characters at the end of content as still representing utf-8 (go-gitea#19773) (go-gitea#19774) * Fix doctor pq: syntax error at or near "." quote user table name (go-gitea#19765) (go-gitea#19770) * Fix bug (go-gitea#19757) Signed-off-by: Andrew Thornton <art27@cantab.net>
…esenting utf-8 (go-gitea#19773) Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>
Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.
This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.
Fix #19743
Signed-off-by: Andrew Thornton art27@cantab.net