Skip to content

Conversation

@tats-u
Copy link

@tats-u tats-u commented Nov 13, 2025

Fixes #860

@tats-u tats-u marked this pull request as draft December 28, 2025 10:00
@tats-u tats-u marked this pull request as ready for review December 31, 2025 10:33
@tats-u tats-u requested a review from MihaZupan December 31, 2025 10:35
@tats-u
Copy link
Author

tats-u commented Dec 31, 2025

I ran Unit Test in my VS 2026 with .NET 6/8/9 locally and all test cases succeeded.
Sorry for having submitted the half-baked PR.

@xoofx
Copy link
Owner

xoofx commented Dec 31, 2025

Thanks for the work! LGTM. Quite heavy with the Rune polyfill but we don't have much choices.

@MihaZupan anything other feedback?

@tats-u
Copy link
Author

tats-u commented Dec 31, 2025

@tats-u
Copy link
Author

tats-u commented Dec 31, 2025

If we adopt C# 14, we can use Extension Members instead of creating the UnicodeUtilitySupplement class.

/// <param name="offset">The offset.</param>
/// <returns>The rune at the specified offset, returns default if none.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal readonly Rune PeekRuneExtra(int offset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the behavior of sometimes looking further than the specified offset odd for this generic helper.

Since this is only used to look at the previous character, can we change this to something like Rune PreviousRune() instead?

Copy link
Author

@tats-u tats-u Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find the behavior of sometimes looking further than the specified offset odd

I think it's a natural behavior if you replace the offending supplementary character (the valid surrogate pair) with a BMP character. We don't know which of a BMP or supplementary character the character at the given position is.

 -2  -1       0        1   2
[𠮷||  ]<we are here>[𠮷||  ]
    [吉]<we are here>[吉]

-----------------------------

High surrogate code unit
 ↓
[𠮷||  ]
     ↑
Low surrogate code unit

Note

𠮷 (U+20BB7) is a supplementary character and 吉 (U+5409) is a BMP character.

What we need is just a character placed at the position. If the character occupies 2 UTF-16 code units, we should fetch the remaining outside half.

Since this is only used to look at the previous character,

CJK Friendly Emphasis (#890) occasionally requires the previous character of the previous character of the current position. i.e. may need to call PeekRuneExtra(-2) or PeekRuneExtra(-3) in the future. It's not so bad idea to split that method into 2 (looking backward and forward).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why "Outdated" is displayed is simply because I changed the visibility of this method.

var currentLowSurrogate = Text[start++];
if (!char.IsLowSurrogate(currentLowSurrogate))
{
start--;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any test that's covering this?
Needing to backtrack in NextRune is odd and feels like a usage error, not something this method should be hiding.

Copy link
Author

@tats-u tats-u Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.Normalize used by TestParser.Compact throws ArgumentException for strings containing isolated surrogate code units. I don't think we need to prepare one. Inputs whose outputs are accepted by String.Normalize don't go through this line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all tests have to go through TestParser, we can test the StringSlice directly.
It's easy to make off by one errors here, or forget to move offsets correctly in edge cases, I think we should add the corresponding test coverage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, RuneAt(int index) is inconsistently validating the End right now. It's being checked for the second surrogate, but not the first character.

Copy link
Author

@tats-u tats-u Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should check Text.Length instead to align with the indexer. I added tests.

@MihaZupan
Copy link
Collaborator

If we adopt C# 14, we can use Extension Members instead of creating the UnicodeUtilitySupplement class.

Could you get by with making the existing UnicodeUtlity file partial instead?

@tats-u
Copy link
Author

tats-u commented Jan 2, 2026

Could you get by with making the existing UnicodeUtlity file partial instead?

I moved the new members to the existing UnicodeUtility.cs. Should UnicodeUtility.cs also exist in the Polyfills directory with partial added to both class definitions?

@MihaZupan
Copy link
Collaborator

Should UnicodeUtility.cs also exist in the Polyfills directory with partial added to both class definitions?

I don't think it's needed

@tats-u
Copy link
Author

tats-u commented Jan 2, 2026

I see. Let me know if we should move it from Helpers to the Polyfills directory.

@tats-u
Copy link
Author

tats-u commented Jan 4, 2026

@MihaZupan I have finished correcting all sections where your additional responses were not required.

@tats-u tats-u requested a review from MihaZupan January 4, 2026 05:46
@tats-u tats-u requested a review from MihaZupan January 4, 2026 15:28
Copy link
Collaborator

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

Comment on lines +142 to +152
int start = Start;
if (start > End) return default;
var first = Text[start];
// BMP character
if (Rune.TryCreate(first, out var rune)) return rune;
if (start + 1 > End) return default;
var second = Text[start + 1];
// Supplementary character
return Rune.TryCreate(first, second, out rune)
? rune
: default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int start = Start;
if (start > End) return default;
var first = Text[start];
// BMP character
if (Rune.TryCreate(first, out var rune)) return rune;
if (start + 1 > End) return default;
var second = Text[start + 1];
// Supplementary character
return Rune.TryCreate(first, second, out rune)
? rune
: default;
int start = Start;
if (start > End) return default;
char first = Text[start];
if (!Rune.TryCreate(first, out Rune rune) && start + 1 <= End)
{
// The first character is a surrogate, check if we have a valid pair
Rune.TryCreate(first, Text[start + 1], out rune);
}
return rune;

Comment on lines +190 to +201
var first = Text[index];
// BMP character
if (Rune.TryCreate(first, out var rune))
return rune;
if (index + 1 < Text.Length)
{
var second = Text[index + 1];
return Rune.TryCreate(first, second, out rune)
? rune
: default;
}
return default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var first = Text[index];
// BMP character
if (Rune.TryCreate(first, out var rune))
return rune;
if (index + 1 < Text.Length)
{
var second = Text[index + 1];
return Rune.TryCreate(first, second, out rune)
? rune
: default;
}
return default;
string text = Text;
char first = text[index];
if (!Rune.TryCreate(first, out Rune rune) && (uint)(index + 1) < (uint)text.Length)
{
// The first character is a surrogate, check if we have a valid pair
Rune.TryCreate(first, text[index + 1], out rune);
}
return rune;

Comment on lines +237 to +267
Rune NextRune()
{
int start = Start;
if (start >= End)
{
Start = End + 1;
return default;
}
var currentBmpOrHighSurrogate = Text[start++];
if (char.IsHighSurrogate(currentBmpOrHighSurrogate))
{
var currentLowSurrogate = Text[start];
if (char.IsLowSurrogate(currentLowSurrogate))
{
// Supplementary character that occupies 2 code units.
start++;
}
}
Start = start;
var first = Text[start];
// BMP character
if (Rune.TryCreate(first, out var rune))
return rune;
if (start + 1 > End)
return default;
var second = Text[start + 1];
// Supplementary character
return Rune.TryCreate(first, second, out rune)
? rune
: default;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Rune NextRune()
{
int start = Start;
if (start >= End)
{
Start = End + 1;
return default;
}
var currentBmpOrHighSurrogate = Text[start++];
if (char.IsHighSurrogate(currentBmpOrHighSurrogate))
{
var currentLowSurrogate = Text[start];
if (char.IsLowSurrogate(currentLowSurrogate))
{
// Supplementary character that occupies 2 code units.
start++;
}
}
Start = start;
var first = Text[start];
// BMP character
if (Rune.TryCreate(first, out var rune))
return rune;
if (start + 1 > End)
return default;
var second = Text[start + 1];
// Supplementary character
return Rune.TryCreate(first, second, out rune)
? rune
: default;
}
Rune NextRune()
{
int start = Start;
if (start >= End)
{
Start = End + 1;
return default;
}
char first = Text[start++];
if (!Rune.TryCreate(first, out Rune rune) && start <= End)
{
// The first character is a surrogate, check if we have a valid pair
if (Rune.TryCreate(first, Text[start], out rune))
{
// Valid surrogate pair
start++;
}
}
Start = start;
return rune;
}

Comment on lines +361 to +396
readonly Rune PeekRuneExtra(int offset)
{ // Supplementary character
var index = Start + offset;
var text = Text;
if ((uint)index >= (uint)text.Length)
{
return default;
}
var bmpResultOrNearerSurrogate = text[index];
// BMP character
if (Rune.TryCreate(bmpResultOrNearerSurrogate, out var rune))
return rune;
// Supplementary character
if (offset < 0)
{
// The code unit at `index` should be a low surrogate
// The scalar value (rune) of a supplementary character should start at `index - 1`, which should be a high surrogate
if (index < 1)
{
return default;
}
var highSurrogate = text[index - 1];
return Rune.TryCreate(highSurrogate, bmpResultOrNearerSurrogate, out rune)
? rune
: default;
}
// The code unit at `index` should be a high surrogate and the start of a scalar value (rune) of a supplementary character
if (index + 1 >= text.Length)
{
return default;
}
var lowSurrogate = text[index + 1];
return Rune.TryCreate(bmpResultOrNearerSurrogate, lowSurrogate, out rune)
? rune
: default;
}
Copy link
Collaborator

@MihaZupan MihaZupan Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
readonly Rune PeekRuneExtra(int offset)
{ // Supplementary character
var index = Start + offset;
var text = Text;
if ((uint)index >= (uint)text.Length)
{
return default;
}
var bmpResultOrNearerSurrogate = text[index];
// BMP character
if (Rune.TryCreate(bmpResultOrNearerSurrogate, out var rune))
return rune;
// Supplementary character
if (offset < 0)
{
// The code unit at `index` should be a low surrogate
// The scalar value (rune) of a supplementary character should start at `index - 1`, which should be a high surrogate
if (index < 1)
{
return default;
}
var highSurrogate = text[index - 1];
return Rune.TryCreate(highSurrogate, bmpResultOrNearerSurrogate, out rune)
? rune
: default;
}
// The code unit at `index` should be a high surrogate and the start of a scalar value (rune) of a supplementary character
if (index + 1 >= text.Length)
{
return default;
}
var lowSurrogate = text[index + 1];
return Rune.TryCreate(bmpResultOrNearerSurrogate, lowSurrogate, out rune)
? rune
: default;
}
readonly Rune PeekRuneExtra(int offset)
{
int index = Start + offset;
string text = Text;
if ((uint)index >= (uint)text.Length)
{
return default;
}
char first = text[index];
if (Rune.TryCreate(first, out var rune))
{
// BMP
return rune;
}
// Check if we have a valid surrogate pair
if (offset < 0)
{
// The code unit at `index` should be a low surrogate
// The scalar value (rune) of a supplementary character should start at `index - 1`, which should be a high surrogate
if ((uint)(index - 1) < (uint)text.Length)
{
Rune.TryCreate(text[index - 1], first, out rune);
}
}
else
{
// The code unit at `index` should be a high surrogate and the start of a scalar value (rune) of a supplementary character
if ((uint)(index + 1) < (uint)text.Length)
{
Rune.TryCreate(first, text[index + 1], out rune);
}
}
return rune;
}

Copy link
Author

@tats-u tats-u Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ((uint)(index - 1) < (uint)text.Length)

index is never be 0 here (IndexOutOfRangeException has already been thrown at text[index]), but could you tell me why you prefer the cast into uint when compare indexes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uint casts are a trick that allows the JIT to avoid inserting additional bounds checks for the length. It checks for "index - 1 is not negative" and "index - 1 is in range of the string Length" as a single comparison.

E.g. compare
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAZgAJiAmcgFRlwwGEAbPXAWACgBvb8gZQrEAjEnIBLAHYZyAMQgQAFKIAM5XGkkztAExgIAlP0F8ugi5IBm5JdP0JycciPIAeDQDoAMjCkBzDAALQ0oAdg0AbXsDJxcAXQBuE0twl2TzQQBfFPJcskoxbVkAIWwoFRF1TWK9A2NMgTNUgQkbJSUAV2kMQzspBziRUI8unsNcHz9AkLTcaIHY5xEk3ItiCJEMixyuLKA=

Copy link
Author

@tats-u tats-u Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could you add an additional comment (e.g. // index is never be 0 here (IndexOutOfRangeException would have already been thrown at text[index]) above) above the if statement of the suggestion?

Copy link
Collaborator

@MihaZupan MihaZupan Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't it be 0? Wouldn't you hit it with Start = 1, offset = -1?

Copy link
Author

@tats-u tats-u Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the wrong impression. (uint)(index - 1) will become ~4.3 billion (uint.MaxValue) when index is 0 due to underflow. It should be (uint)index < (uint)text.Length + 1u instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recognize supplementary (non-BMP) punctuations around emphasis delimiter runs

3 participants