Fix UTF-8 position encoding#564
Conversation
86e4bad to
3e76806
Compare
This reveals some incorrectness in the UTF-8 position codec.
The emoji in question is encoded into utf-8 as 4 bytes, not 3, as the current test assumes (and the current code implements).
2-byte and 3-byte encodings were previously not taken into account.
3e76806 to
2ea79a1
Compare
|
I also wrote lheckemann@48521d3 which refactors a little harder and avoids conditionals checking the encoding type on every character, but is also a little broken. Would appreciate if a maintainer took a glance at that and said if the approach is worth pursuing, or they would prefer to stick to this less-invasive approach. |
|
Thanks! I can also confirm this issue from my side and would be thankful for this being merged and released. |
|
This looks great, I'm happy to merge as is. Thanks for the diligent work. You're other approach seems good too, although I haven't looked close enough to note the big differences. Bigger refactors are certainly welcome. I don't think the CPU cycles checking every character are a performance issue because it only iterates all the characters on a line, not on the whole document right? Also we can merge this now so that others can benefit and then you can work on another approach in the meantime. Though bear in mind that this PR is against |
|
Alright, sounds good, thanks for the quick review! Is there a 1.x release branch I should also open it against for a backport/patch release? I don't see an immediately obvious one. |
|
There's no 1.x branch yet, but we can easily make one from the v1.3.1 tag. What version are you and @wolfskaempf currently using? The v2 pre-releases have actually been public for nearly a year now, I only just noticed!https://pypi.org/project/pygls/#history So it should be mature enough for use. But no stress, to either you or the deadline for releasing v2. We'll support whatever is most useful to our users. |
|
Ah yeah I was actually already using 2.0.0a2, that would explain why going straight from main for a drop-in replacement "just worked". It looks like the bug won't actually be triggered (except by clients that don't support UTF-16 and thus don't implement LSP correctly) before 0d51815, which only came after 1.3.1 and thus probably won't affect any consumers of 1.x. So I don't think a backport is really necessary. |
|
Okay great, I've merged it. Thank you. So even if this is needed on v1 it's needed in v2 too. So let's publish then right!? |
|
My understanding is that it can only be triggered in v1 by spec-violating clients, so probably not even needed. But yes, a new prerelease for v2 would be great :) |
|
Follow #565 for updates about the release. |
|
Released https://pypi.org/project/pygls/2.0.0a5 🥳 |
Description (e.g. "Related to ...", etc.)
didChangeevents coming from a client which prefers UTF-8 position encoding (Helix in my case) and includes multi-byte codepoints previously resulted in inconsistency between the client's view of the document and a pygls-based server's view of it as maintained by theLanguageServerProtocol.lsp_text_document__did_change->Workspace.update_text_document->TextDocument.apply_change->TextDocument._apply_incremental_change.For example, opening a document with the contents
öbcand inserting anato make it intoöabcwould result in pygls thinking the contents areöbac.This PR introduces tests that show the issue, fixes them, and adjusts some incorrect tests that are broken by the fix.
Code review checklist (for code reviewer to complete)
Automated linters
You can run the lints that are run on CI locally with: