Skip to content

Fix UTF-8 position encoding#564

Merged
tombh merged 5 commits into
openlawlibrary:mainfrom
lheckemann:push-kqxylwqwllwx
Jul 11, 2025
Merged

Fix UTF-8 position encoding#564
tombh merged 5 commits into
openlawlibrary:mainfrom
lheckemann:push-kqxylwqwllwx

Conversation

@lheckemann

Copy link
Copy Markdown
Contributor

Description (e.g. "Related to ...", etc.)

didChange events coming from a client which prefers UTF-8 position encoding (Helix in my case) and includes multi-byte codepoints previously resulted in inconsistency between the client's view of the document and a pygls-based server's view of it as maintained by the LanguageServerProtocol.lsp_text_document__did_change -> Workspace.update_text_document -> TextDocument.apply_change -> TextDocument._apply_incremental_change.

For example, opening a document with the contents öbc and inserting an a to make it into öabc would result in pygls thinking the contents are öbac.

This PR introduces tests that show the issue, fixes them, and adjusts some incorrect tests that are broken by the fix.

Code review checklist (for code reviewer to complete)

  • Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR)
  • Title summarizes what is changing
  • Commit messages are meaningful (see this for details)
  • Tests have been included and/or updated, as appropriate
  • Docstrings have been included and/or updated, as appropriate
  • Standalone docs have been updated accordingly

Automated linters

You can run the lints that are run on CI locally with:

uv run --all-extras poe lint

This reveals some incorrectness in the UTF-8 position codec.
The emoji in question is encoded into utf-8 as 4 bytes, not 3, as the
current test assumes (and the current code implements).
2-byte and 3-byte encodings were previously not taken into account.
@lheckemann

Copy link
Copy Markdown
Contributor Author

I also wrote lheckemann@48521d3 which refactors a little harder and avoids conditionals checking the encoding type on every character, but is also a little broken. Would appreciate if a maintainer took a glance at that and said if the approach is worth pursuing, or they would prefer to stick to this less-invasive approach.

@wolfskaempf

Copy link
Copy Markdown
Contributor

Thanks! I can also confirm this issue from my side and would be thankful for this being merged and released.

@tombh

tombh commented Jul 10, 2025

Copy link
Copy Markdown
Collaborator

This looks great, I'm happy to merge as is. Thanks for the diligent work. You're other approach seems good too, although I haven't looked close enough to note the big differences. Bigger refactors are certainly welcome. I don't think the CPU cycles checking every character are a performance issue because it only iterates all the characters on a line, not on the whole document right?

Also we can merge this now so that others can benefit and then you can work on another approach in the meantime. Though bear in mind that this PR is against main which is currently a v2 release candidate.

@lheckemann

Copy link
Copy Markdown
Contributor Author

Alright, sounds good, thanks for the quick review! Is there a 1.x release branch I should also open it against for a backport/patch release? I don't see an immediately obvious one.

@tombh

tombh commented Jul 10, 2025

Copy link
Copy Markdown
Collaborator

There's no 1.x branch yet, but we can easily make one from the v1.3.1 tag. What version are you and @wolfskaempf currently using?

The v2 pre-releases have actually been public for nearly a year now, I only just noticed!https://pypi.org/project/pygls/#history So it should be mature enough for use.

But no stress, to either you or the deadline for releasing v2. We'll support whatever is most useful to our users.

@lheckemann

Copy link
Copy Markdown
Contributor Author

Ah yeah I was actually already using 2.0.0a2, that would explain why going straight from main for a drop-in replacement "just worked".

It looks like the bug won't actually be triggered (except by clients that don't support UTF-16 and thus don't implement LSP correctly) before 0d51815, which only came after 1.3.1 and thus probably won't affect any consumers of 1.x. So I don't think a backport is really necessary.

@tombh tombh merged commit 558a897 into openlawlibrary:main Jul 11, 2025
18 checks passed
@tombh

tombh commented Jul 11, 2025

Copy link
Copy Markdown
Collaborator

Okay great, I've merged it. Thank you. So even if this is needed on v1 it's needed in v2 too. So let's publish then right!?

@lheckemann lheckemann deleted the push-kqxylwqwllwx branch July 11, 2025 18:06
@lheckemann

Copy link
Copy Markdown
Contributor Author

My understanding is that it can only be triggered in v1 by spec-violating clients, so probably not even needed. But yes, a new prerelease for v2 would be great :)

@tombh tombh mentioned this pull request Jul 13, 2025
@tombh

tombh commented Jul 13, 2025

Copy link
Copy Markdown
Collaborator

Follow #565 for updates about the release.

@tombh

tombh commented Jul 14, 2025

Copy link
Copy Markdown
Collaborator

Released https://pypi.org/project/pygls/2.0.0a5 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants