Fix UTF-8 position encoding by lheckemann · Pull Request #564 · openlawlibrary/pygls

lheckemann · 2025-07-10T13:09:12Z

Description (e.g. "Related to ...", etc.)

didChange events coming from a client which prefers UTF-8 position encoding (Helix in my case) and includes multi-byte codepoints previously resulted in inconsistency between the client's view of the document and a pygls-based server's view of it as maintained by the LanguageServerProtocol.lsp_text_document__did_change -> Workspace.update_text_document -> TextDocument.apply_change -> TextDocument._apply_incremental_change.

For example, opening a document with the contents öbc and inserting an a to make it into öabc would result in pygls thinking the contents are öbac.

This PR introduces tests that show the issue, fixes them, and adjusts some incorrect tests that are broken by the fix.

Code review checklist (for code reviewer to complete)

Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR)
Title summarizes what is changing
Commit messages are meaningful (see this for details)
Tests have been included and/or updated, as appropriate
Docstrings have been included and/or updated, as appropriate
Standalone docs have been updated accordingly

Automated linters

You can run the lints that are run on CI locally with:

uv run --all-extras poe lint

This reveals some incorrectness in the UTF-8 position codec.

The emoji in question is encoded into utf-8 as 4 bytes, not 3, as the current test assumes (and the current code implements).

2-byte and 3-byte encodings were previously not taken into account.

…t count

lheckemann · 2025-07-10T13:21:26Z

I also wrote lheckemann@48521d3 which refactors a little harder and avoids conditionals checking the encoding type on every character, but is also a little broken. Would appreciate if a maintainer took a glance at that and said if the approach is worth pursuing, or they would prefer to stick to this less-invasive approach.

wolfskaempf · 2025-07-10T13:36:48Z

Thanks! I can also confirm this issue from my side and would be thankful for this being merged and released.

tombh · 2025-07-10T20:46:00Z

This looks great, I'm happy to merge as is. Thanks for the diligent work. You're other approach seems good too, although I haven't looked close enough to note the big differences. Bigger refactors are certainly welcome. I don't think the CPU cycles checking every character are a performance issue because it only iterates all the characters on a line, not on the whole document right?

Also we can merge this now so that others can benefit and then you can work on another approach in the meantime. Though bear in mind that this PR is against main which is currently a v2 release candidate.

lheckemann · 2025-07-10T21:48:58Z

Alright, sounds good, thanks for the quick review! Is there a 1.x release branch I should also open it against for a backport/patch release? I don't see an immediately obvious one.

tombh · 2025-07-10T22:28:31Z

There's no 1.x branch yet, but we can easily make one from the v1.3.1 tag. What version are you and @wolfskaempf currently using?

The v2 pre-releases have actually been public for nearly a year now, I only just noticed!https://pypi.org/project/pygls/#history So it should be mature enough for use.

But no stress, to either you or the deadline for releasing v2. We'll support whatever is most useful to our users.

lheckemann · 2025-07-11T09:16:25Z

Ah yeah I was actually already using 2.0.0a2, that would explain why going straight from main for a drop-in replacement "just worked".

It looks like the bug won't actually be triggered (except by clients that don't support UTF-16 and thus don't implement LSP correctly) before 0d51815, which only came after 1.3.1 and thus probably won't affect any consumers of 1.x. So I don't think a backport is really necessary.

tombh · 2025-07-11T15:31:48Z

Okay great, I've merged it. Thank you. So even if this is needed on v1 it's needed in v2 too. So let's publish then right!?

lheckemann · 2025-07-11T18:07:34Z

My understanding is that it can only be triggered in v1 by spec-violating clients, so probably not even needed. But yes, a new prerelease for v2 would be great :)

tombh · 2025-07-13T16:45:17Z

Follow #565 for updates about the release.

tombh · 2025-07-14T14:54:58Z

Released https://pypi.org/project/pygls/2.0.0a5 🥳

lheckemann force-pushed the push-kqxylwqwllwx branch from 86e4bad to 3e76806 Compare July 10, 2025 13:12

lheckemann added 5 commits July 10, 2025 15:14

test(document): test position codecs against actual string encoding

a94c042

This reveals some incorrectness in the UTF-8 position codec.

test(document): fix some incorrect utf-8 offsets

2af62d5

The emoji in question is encoded into utf-8 as 4 bytes, not 3, as the current test assumes (and the current code implements).

feat(PositionCodec): provide encoding hint in __repr__

98f0a6d

fix(positioncodec): utf8 encodings

d067052

2-byte and 3-byte encodings were previously not taken into account.

refactor(position_codec): use clearer logic for accumulating code uni…

2ea79a1

…t count

lheckemann force-pushed the push-kqxylwqwllwx branch from 3e76806 to 2ea79a1 Compare July 10, 2025 13:15

tombh approved these changes Jul 11, 2025

View reviewed changes

tombh merged commit 558a897 into openlawlibrary:main Jul 11, 2025
18 checks passed

lheckemann deleted the push-kqxylwqwllwx branch July 11, 2025 18:06

tombh mentioned this pull request Jul 13, 2025

build: v2.0.0a5 #565

Merged

edvilme mentioned this pull request Feb 20, 2026

Bump to Python 3.10 (and Pygls 2.0) Automatically microsoft/vscode-black-formatter#608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix UTF-8 position encoding#564

Fix UTF-8 position encoding#564
tombh merged 5 commits into
openlawlibrary:mainfrom
lheckemann:push-kqxylwqwllwx

lheckemann commented Jul 10, 2025

Uh oh!

lheckemann commented Jul 10, 2025

Uh oh!

wolfskaempf commented Jul 10, 2025

Uh oh!

tombh commented Jul 10, 2025

Uh oh!

lheckemann commented Jul 10, 2025

Uh oh!

tombh commented Jul 10, 2025

Uh oh!

lheckemann commented Jul 11, 2025

Uh oh!

Uh oh!

tombh commented Jul 11, 2025

Uh oh!

lheckemann commented Jul 11, 2025

Uh oh!

tombh commented Jul 13, 2025

Uh oh!

tombh commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lheckemann commented Jul 10, 2025

Description (e.g. "Related to ...", etc.)

Code review checklist (for code reviewer to complete)

Automated linters

Uh oh!

lheckemann commented Jul 10, 2025

Uh oh!

wolfskaempf commented Jul 10, 2025

Uh oh!

tombh commented Jul 10, 2025

Uh oh!

lheckemann commented Jul 10, 2025

Uh oh!

tombh commented Jul 10, 2025

Uh oh!

lheckemann commented Jul 11, 2025

Uh oh!

Uh oh!

tombh commented Jul 11, 2025

Uh oh!

lheckemann commented Jul 11, 2025

Uh oh!

tombh commented Jul 13, 2025

Uh oh!

tombh commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants