-
-
Notifications
You must be signed in to change notification settings - Fork 498
Recognize supplementary characters #913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I ran Unit Test in my VS 2026 with .NET 6/8/9 locally and all test cases succeeded. |
|
Thanks for the work! LGTM. Quite heavy with the Rune polyfill but we don't have much choices. @MihaZupan anything other feedback? |
|
Rune is based on the latest snapshot of https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Text/Rune.cs,a0cdde85f676b935. |
|
If we adopt C# 14, we can use Extension Members instead of creating the |
src/Markdig/Helpers/StringSlice.cs
Outdated
| /// <param name="offset">The offset.</param> | ||
| /// <returns>The rune at the specified offset, returns default if none.</returns> | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| internal readonly Rune PeekRuneExtra(int offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the behavior of sometimes looking further than the specified offset odd for this generic helper.
Since this is only used to look at the previous character, can we change this to something like Rune PreviousRune() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
find the behavior of sometimes looking further than the specified offset odd
I think it's a natural behavior if you replace the offending supplementary character (the valid surrogate pair) with a BMP character. We don't know which of a BMP or supplementary character the character at the given position is.
-2 -1 0 1 2
[𠮷|| ]<we are here>[𠮷|| ]
[吉]<we are here>[吉]
-----------------------------
High surrogate code unit
↓
[𠮷|| ]
↑
Low surrogate code unit
Note
𠮷 (U+20BB7) is a supplementary character and 吉 (U+5409) is a BMP character.
What we need is just a character placed at the position. If the character occupies 2 UTF-16 code units, we should fetch the remaining outside half.
Since this is only used to look at the previous character,
CJK Friendly Emphasis (#890) occasionally requires the previous character of the previous character of the current position. i.e. may need to call PeekRuneExtra(-2) or PeekRuneExtra(-3) in the future. It's not so bad idea to split that method into 2 (looking backward and forward).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why "Outdated" is displayed is simply because I changed the visibility of this method.
src/Markdig/Helpers/StringSlice.cs
Outdated
| var currentLowSurrogate = Text[start++]; | ||
| if (!char.IsLowSurrogate(currentLowSurrogate)) | ||
| { | ||
| start--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any test that's covering this?
Needing to backtrack in NextRune is odd and feels like a usage error, not something this method should be hiding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.Normalize used by TestParser.Compact throws ArgumentException for strings containing isolated surrogate code units. I don't think we need to prepare one. Inputs whose outputs are accepted by String.Normalize don't go through this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all tests have to go through TestParser, we can test the StringSlice directly.
It's easy to make off by one errors here, or forget to move offsets correctly in edge cases, I think we should add the corresponding test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example, RuneAt(int index) is inconsistently validating the End right now. It's being checked for the second surrogate, but not the first character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should check Text.Length instead to align with the indexer. I added tests.
Could you get by with making the existing |
I moved the new members to the existing |
I don't think it's needed |
|
I see. Let me know if we should move it from |
|
@MihaZupan I have finished correcting all sections where your additional responses were not required. |
MihaZupan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
| int start = Start; | ||
| if (start > End) return default; | ||
| var first = Text[start]; | ||
| // BMP character | ||
| if (Rune.TryCreate(first, out var rune)) return rune; | ||
| if (start + 1 > End) return default; | ||
| var second = Text[start + 1]; | ||
| // Supplementary character | ||
| return Rune.TryCreate(first, second, out rune) | ||
| ? rune | ||
| : default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| int start = Start; | |
| if (start > End) return default; | |
| var first = Text[start]; | |
| // BMP character | |
| if (Rune.TryCreate(first, out var rune)) return rune; | |
| if (start + 1 > End) return default; | |
| var second = Text[start + 1]; | |
| // Supplementary character | |
| return Rune.TryCreate(first, second, out rune) | |
| ? rune | |
| : default; | |
| int start = Start; | |
| if (start > End) return default; | |
| char first = Text[start]; | |
| if (!Rune.TryCreate(first, out Rune rune) && start + 1 <= End) | |
| { | |
| // The first character is a surrogate, check if we have a valid pair | |
| Rune.TryCreate(first, Text[start + 1], out rune); | |
| } | |
| return rune; |
| var first = Text[index]; | ||
| // BMP character | ||
| if (Rune.TryCreate(first, out var rune)) | ||
| return rune; | ||
| if (index + 1 < Text.Length) | ||
| { | ||
| var second = Text[index + 1]; | ||
| return Rune.TryCreate(first, second, out rune) | ||
| ? rune | ||
| : default; | ||
| } | ||
| return default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| var first = Text[index]; | |
| // BMP character | |
| if (Rune.TryCreate(first, out var rune)) | |
| return rune; | |
| if (index + 1 < Text.Length) | |
| { | |
| var second = Text[index + 1]; | |
| return Rune.TryCreate(first, second, out rune) | |
| ? rune | |
| : default; | |
| } | |
| return default; | |
| string text = Text; | |
| char first = text[index]; | |
| if (!Rune.TryCreate(first, out Rune rune) && (uint)(index + 1) < (uint)text.Length) | |
| { | |
| // The first character is a surrogate, check if we have a valid pair | |
| Rune.TryCreate(first, text[index + 1], out rune); | |
| } | |
| return rune; |
| Rune NextRune() | ||
| { | ||
| int start = Start; | ||
| if (start >= End) | ||
| { | ||
| Start = End + 1; | ||
| return default; | ||
| } | ||
| var currentBmpOrHighSurrogate = Text[start++]; | ||
| if (char.IsHighSurrogate(currentBmpOrHighSurrogate)) | ||
| { | ||
| var currentLowSurrogate = Text[start]; | ||
| if (char.IsLowSurrogate(currentLowSurrogate)) | ||
| { | ||
| // Supplementary character that occupies 2 code units. | ||
| start++; | ||
| } | ||
| } | ||
| Start = start; | ||
| var first = Text[start]; | ||
| // BMP character | ||
| if (Rune.TryCreate(first, out var rune)) | ||
| return rune; | ||
| if (start + 1 > End) | ||
| return default; | ||
| var second = Text[start + 1]; | ||
| // Supplementary character | ||
| return Rune.TryCreate(first, second, out rune) | ||
| ? rune | ||
| : default; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Rune NextRune() | |
| { | |
| int start = Start; | |
| if (start >= End) | |
| { | |
| Start = End + 1; | |
| return default; | |
| } | |
| var currentBmpOrHighSurrogate = Text[start++]; | |
| if (char.IsHighSurrogate(currentBmpOrHighSurrogate)) | |
| { | |
| var currentLowSurrogate = Text[start]; | |
| if (char.IsLowSurrogate(currentLowSurrogate)) | |
| { | |
| // Supplementary character that occupies 2 code units. | |
| start++; | |
| } | |
| } | |
| Start = start; | |
| var first = Text[start]; | |
| // BMP character | |
| if (Rune.TryCreate(first, out var rune)) | |
| return rune; | |
| if (start + 1 > End) | |
| return default; | |
| var second = Text[start + 1]; | |
| // Supplementary character | |
| return Rune.TryCreate(first, second, out rune) | |
| ? rune | |
| : default; | |
| } | |
| Rune NextRune() | |
| { | |
| int start = Start; | |
| if (start >= End) | |
| { | |
| Start = End + 1; | |
| return default; | |
| } | |
| char first = Text[start++]; | |
| if (!Rune.TryCreate(first, out Rune rune) && start <= End) | |
| { | |
| // The first character is a surrogate, check if we have a valid pair | |
| if (Rune.TryCreate(first, Text[start], out rune)) | |
| { | |
| // Valid surrogate pair | |
| start++; | |
| } | |
| } | |
| Start = start; | |
| return rune; | |
| } |
| readonly Rune PeekRuneExtra(int offset) | ||
| { // Supplementary character | ||
| var index = Start + offset; | ||
| var text = Text; | ||
| if ((uint)index >= (uint)text.Length) | ||
| { | ||
| return default; | ||
| } | ||
| var bmpResultOrNearerSurrogate = text[index]; | ||
| // BMP character | ||
| if (Rune.TryCreate(bmpResultOrNearerSurrogate, out var rune)) | ||
| return rune; | ||
| // Supplementary character | ||
| if (offset < 0) | ||
| { | ||
| // The code unit at `index` should be a low surrogate | ||
| // The scalar value (rune) of a supplementary character should start at `index - 1`, which should be a high surrogate | ||
| if (index < 1) | ||
| { | ||
| return default; | ||
| } | ||
| var highSurrogate = text[index - 1]; | ||
| return Rune.TryCreate(highSurrogate, bmpResultOrNearerSurrogate, out rune) | ||
| ? rune | ||
| : default; | ||
| } | ||
| // The code unit at `index` should be a high surrogate and the start of a scalar value (rune) of a supplementary character | ||
| if (index + 1 >= text.Length) | ||
| { | ||
| return default; | ||
| } | ||
| var lowSurrogate = text[index + 1]; | ||
| return Rune.TryCreate(bmpResultOrNearerSurrogate, lowSurrogate, out rune) | ||
| ? rune | ||
| : default; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| readonly Rune PeekRuneExtra(int offset) | |
| { // Supplementary character | |
| var index = Start + offset; | |
| var text = Text; | |
| if ((uint)index >= (uint)text.Length) | |
| { | |
| return default; | |
| } | |
| var bmpResultOrNearerSurrogate = text[index]; | |
| // BMP character | |
| if (Rune.TryCreate(bmpResultOrNearerSurrogate, out var rune)) | |
| return rune; | |
| // Supplementary character | |
| if (offset < 0) | |
| { | |
| // The code unit at `index` should be a low surrogate | |
| // The scalar value (rune) of a supplementary character should start at `index - 1`, which should be a high surrogate | |
| if (index < 1) | |
| { | |
| return default; | |
| } | |
| var highSurrogate = text[index - 1]; | |
| return Rune.TryCreate(highSurrogate, bmpResultOrNearerSurrogate, out rune) | |
| ? rune | |
| : default; | |
| } | |
| // The code unit at `index` should be a high surrogate and the start of a scalar value (rune) of a supplementary character | |
| if (index + 1 >= text.Length) | |
| { | |
| return default; | |
| } | |
| var lowSurrogate = text[index + 1]; | |
| return Rune.TryCreate(bmpResultOrNearerSurrogate, lowSurrogate, out rune) | |
| ? rune | |
| : default; | |
| } | |
| readonly Rune PeekRuneExtra(int offset) | |
| { | |
| int index = Start + offset; | |
| string text = Text; | |
| if ((uint)index >= (uint)text.Length) | |
| { | |
| return default; | |
| } | |
| char first = text[index]; | |
| if (Rune.TryCreate(first, out var rune)) | |
| { | |
| // BMP | |
| return rune; | |
| } | |
| // Check if we have a valid surrogate pair | |
| if (offset < 0) | |
| { | |
| // The code unit at `index` should be a low surrogate | |
| // The scalar value (rune) of a supplementary character should start at `index - 1`, which should be a high surrogate | |
| if ((uint)(index - 1) < (uint)text.Length) | |
| { | |
| Rune.TryCreate(text[index - 1], first, out rune); | |
| } | |
| } | |
| else | |
| { | |
| // The code unit at `index` should be a high surrogate and the start of a scalar value (rune) of a supplementary character | |
| if ((uint)(index + 1) < (uint)text.Length) | |
| { | |
| Rune.TryCreate(first, text[index + 1], out rune); | |
| } | |
| } | |
| return rune; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ((uint)(index - 1) < (uint)text.Length)
index is never be 0 here (IndexOutOfRangeException has already been thrown at text[index]), but could you tell me why you prefer the cast into uint when compare indexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uint casts are a trick that allows the JIT to avoid inserting additional bounds checks for the length. It checks for "index - 1 is not negative" and "index - 1 is in range of the string Length" as a single comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Could you add an additional comment (e.g. // index is never be 0 here (IndexOutOfRangeException would have already been thrown at text[index]) above) above the if statement of the suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't it be 0? Wouldn't you hit it with Start = 1, offset = -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the wrong impression. (uint)(index - 1) will become ~4.3 billion (uint.MaxValue) when index is 0 due to underflow. It should be (uint)index < (uint)text.Length + 1u instead.
Co-authored-by: Miha Zupan <[email protected]>
Fixes #860