Recognize supplementary characters #913

tats-u · 2025-11-13T10:50:25Z

Fixes #860

src/Markdig/Helpers/CharHelper.cs

src/Markdig/Helpers/StringSlice.cs

src/Markdig/Markdig.targets

tats-u · 2025-12-31T10:35:40Z

I ran Unit Test in my VS 2026 with .NET 6/8/9 locally and all test cases succeeded.
Sorry for having submitted the half-baked PR.

xoofx · 2025-12-31T12:12:15Z

Thanks for the work! LGTM. Quite heavy with the Rune polyfill but we don't have much choices.

@MihaZupan anything other feedback?

tats-u · 2025-12-31T13:01:29Z

Rune is based on the latest snapshot of https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Text/Rune.cs,a0cdde85f676b935.

tats-u · 2025-12-31T13:04:25Z

If we adopt C# 14, we can use Extension Members instead of creating the UnicodeUtilitySupplement class.

src/Markdig/Extensions/SmartyPants/SmartyPantsInlineParser.cs

src/Markdig/Helpers/CharHelper.cs

src/Markdig/Parsers/Inlines/EmphasisInlineParser.cs

MihaZupan · 2026-01-01T21:31:07Z

src/Markdig/Helpers/StringSlice.cs

+    /// <param name="offset">The offset.</param>
+    /// <returns>The rune at the specified offset, returns default if none.</returns>
+    [MethodImpl(MethodImplOptions.AggressiveInlining)]
+    internal readonly Rune PeekRuneExtra(int offset)


I find the behavior of sometimes looking further than the specified offset odd for this generic helper.

Since this is only used to look at the previous character, can we change this to something like Rune PreviousRune() instead?

find the behavior of sometimes looking further than the specified offset odd

I think it's a natural behavior if you replace the offending supplementary character (the valid surrogate pair) with a BMP character. We don't know which of a BMP or supplementary character the character at the given position is.

-2 -1 0 1 2 [𠮷|| ]<we are here>[𠮷|| ] [吉]<we are here>[吉] ----------------------------- High surrogate code unit ↓ [𠮷|| ] ↑ Low surrogate code unit

Note
𠮷 (U+20BB7) is a supplementary character and 吉 (U+5409) is a BMP character.

What we need is just a character placed at the position. If the character occupies 2 UTF-16 code units, we should fetch the remaining outside half.

Since this is only used to look at the previous character,

CJK Friendly Emphasis (#890) occasionally requires the previous character of the previous character of the current position. i.e. may need to call PeekRuneExtra(-2) or PeekRuneExtra(-3) in the future. It's not so bad idea to split that method into 2 (looking backward and forward).

The reason why "Outdated" is displayed is simply because I changed the visibility of this method.

MihaZupan · 2026-01-01T21:33:21Z

src/Markdig/Helpers/StringSlice.cs

+            var currentLowSurrogate = Text[start++];
+            if (!char.IsLowSurrogate(currentLowSurrogate))
+            {
+                start--;


Is there any test that's covering this?
Needing to backtrack in NextRune is odd and feels like a usage error, not something this method should be hiding.

String.Normalize used by TestParser.Compact throws ArgumentException for strings containing isolated surrogate code units. I don't think we need to prepare one. Inputs whose outputs are accepted by String.Normalize don't go through this line.

Not all tests have to go through TestParser, we can test the StringSlice directly.
It's easy to make off by one errors here, or forget to move offsets correctly in edge cases, I think we should add the corresponding test coverage.

As an example, RuneAt(int index) is inconsistently validating the End right now. It's being checked for the second surrogate, but not the first character.

It should check Text.Length instead to align with the indexer. I added tests.

MihaZupan · 2026-01-01T21:41:29Z

If we adopt C# 14, we can use Extension Members instead of creating the UnicodeUtilitySupplement class.

Could you get by with making the existing UnicodeUtlity file partial instead?

tats-u · 2026-01-02T13:01:47Z

Could you get by with making the existing UnicodeUtlity file partial instead?

I moved the new members to the existing UnicodeUtility.cs. Should UnicodeUtility.cs also exist in the Polyfills directory with partial added to both class definitions?

MihaZupan · 2026-01-02T14:24:28Z

Should UnicodeUtility.cs also exist in the Polyfills directory with partial added to both class definitions?

I don't think it's needed

tats-u · 2026-01-02T15:01:29Z

I see. Let me know if we should move it from Helpers to the Polyfills directory.

tats-u · 2026-01-04T05:46:02Z

@MihaZupan I have finished correcting all sections where your additional responses were not required.

src/Markdig/Parsers/Inlines/EmphasisInlineParser.cs

src/Markdig/Helpers/StringSlice.cs

MihaZupan

Thank you

MihaZupan · 2026-01-05T10:50:08Z

src/Markdig/Helpers/StringSlice.cs

+            int start = Start;
+            if (start > End) return default;
+            var first = Text[start];
+            // BMP character
+            if (Rune.TryCreate(first, out var rune)) return rune;
+            if (start + 1 > End) return default;
+            var second = Text[start + 1];
+            // Supplementary character
+            return Rune.TryCreate(first, second, out rune)
+                ? rune
+                : default;


Suggested change

int start = Start;

if (start > End) return default;

var first = Text[start];

// BMP character

if (Rune.TryCreate(first, out var rune)) return rune;

if (start + 1 > End) return default;

var second = Text[start + 1];

// Supplementary character

return Rune.TryCreate(first, second, out rune)

? rune

: default;

int start = Start;

if (start > End) return default;

char first = Text[start];

if (!Rune.TryCreate(first, out Rune rune) && start + 1 <= End)

{

// The first character is a surrogate, check if we have a valid pair

Rune.TryCreate(first, Text[start + 1], out rune);

}

return rune;

MihaZupan · 2026-01-05T10:54:33Z

src/Markdig/Helpers/StringSlice.cs

+        var first = Text[index];
+        // BMP character
+        if (Rune.TryCreate(first, out var rune))
+            return rune;
+        if (index + 1 < Text.Length)
+        {
+            var second = Text[index + 1];
+            return Rune.TryCreate(first, second, out rune)
+                ? rune
+                : default;
+        }
+        return default;


Suggested change

var first = Text[index];

// BMP character

if (Rune.TryCreate(first, out var rune))

return rune;

if (index + 1 < Text.Length)

{

var second = Text[index + 1];

return Rune.TryCreate(first, second, out rune)

? rune

: default;

}

return default;

string text = Text;

char first = text[index];

if (!Rune.TryCreate(first, out Rune rune) && (uint)(index + 1) < (uint)text.Length)

{

// The first character is a surrogate, check if we have a valid pair

Rune.TryCreate(first, text[index + 1], out rune);

}

return rune;

MihaZupan · 2026-01-05T11:03:39Z

src/Markdig/Helpers/StringSlice.cs

+    Rune NextRune()
+    {
+        int start = Start;
+        if (start >= End)
+        {
+            Start = End + 1;
+            return default;
+        }
+        var currentBmpOrHighSurrogate = Text[start++];
+        if (char.IsHighSurrogate(currentBmpOrHighSurrogate))
+        {
+            var currentLowSurrogate = Text[start];
+            if (char.IsLowSurrogate(currentLowSurrogate))
+            {
+                // Supplementary character that occupies 2 code units.
+                start++;
+            }
+        }
+        Start = start;
+        var first = Text[start];
+        // BMP character
+        if (Rune.TryCreate(first, out var rune))
+            return rune;
+        if (start + 1 > End)
+            return default;
+        var second = Text[start + 1];
+        // Supplementary character
+        return Rune.TryCreate(first, second, out rune)
+            ? rune
+            : default;
+    }


Suggested change

Rune NextRune()

{

int start = Start;

if (start >= End)

{

Start = End + 1;

return default;

}

var currentBmpOrHighSurrogate = Text[start++];

if (char.IsHighSurrogate(currentBmpOrHighSurrogate))

{

var currentLowSurrogate = Text[start];

if (char.IsLowSurrogate(currentLowSurrogate))

{

// Supplementary character that occupies 2 code units.

start++;

}

}

Start = start;

var first = Text[start];

// BMP character

if (Rune.TryCreate(first, out var rune))

return rune;

if (start + 1 > End)

return default;

var second = Text[start + 1];

// Supplementary character

return Rune.TryCreate(first, second, out rune)

? rune

: default;

}

Rune NextRune()

{

int start = Start;

if (start >= End)

{

Start = End + 1;

return default;

}

char first = Text[start++];

if (!Rune.TryCreate(first, out Rune rune) && start <= End)

{

// The first character is a surrogate, check if we have a valid pair

if (Rune.TryCreate(first, Text[start], out rune))

{

// Valid surrogate pair

start++;

}

}

Start = start;

return rune;

}

src/Markdig/Helpers/StringSlice.cs

MihaZupan · 2026-01-05T11:16:36Z

src/Markdig/Helpers/StringSlice.cs

+    readonly Rune PeekRuneExtra(int offset)
+    {    // Supplementary character
+        var index = Start + offset;
+        var text = Text;
+        if ((uint)index >= (uint)text.Length)
+        {
+            return default;
+        }
+        var bmpResultOrNearerSurrogate = text[index];
+        // BMP character
+        if (Rune.TryCreate(bmpResultOrNearerSurrogate, out var rune))
+            return rune;
+        // Supplementary character
+        if (offset < 0)
+        {
+            // The code unit at `index` should be a low surrogate
+            // The scalar value (rune) of a supplementary character should start at `index - 1`, which should be a high surrogate
+            if (index < 1)
+            {
+                return default;
+            }
+            var highSurrogate = text[index - 1];
+            return Rune.TryCreate(highSurrogate, bmpResultOrNearerSurrogate, out rune)
+                ? rune
+                : default;
+        }
+        // The code unit at `index` should be a high surrogate and the start of a scalar value (rune) of a supplementary character
+        if (index + 1 >= text.Length)
+        {
+            return default;
+        }
+        var lowSurrogate = text[index + 1];
+        return Rune.TryCreate(bmpResultOrNearerSurrogate, lowSurrogate, out rune)
+            ? rune
+            : default;
+    }


Suggested change

readonly Rune PeekRuneExtra(int offset)

{ // Supplementary character

var index = Start + offset;

var text = Text;

if ((uint)index >= (uint)text.Length)

{

return default;

}

var bmpResultOrNearerSurrogate = text[index];

// BMP character

if (Rune.TryCreate(bmpResultOrNearerSurrogate, out var rune))

return rune;

// Supplementary character

if (offset < 0)

{

// The code unit at `index` should be a low surrogate

// The scalar value (rune) of a supplementary character should start at `index - 1`, which should be a high surrogate

if (index < 1)

{

return default;

}

var highSurrogate = text[index - 1];

return Rune.TryCreate(highSurrogate, bmpResultOrNearerSurrogate, out rune)

? rune

: default;

}

// The code unit at `index` should be a high surrogate and the start of a scalar value (rune) of a supplementary character

if (index + 1 >= text.Length)

{

return default;

}

var lowSurrogate = text[index + 1];

return Rune.TryCreate(bmpResultOrNearerSurrogate, lowSurrogate, out rune)

? rune

: default;

}

readonly Rune PeekRuneExtra(int offset)

{

int index = Start + offset;

string text = Text;

if ((uint)index >= (uint)text.Length)

{

return default;

}

char first = text[index];

if (Rune.TryCreate(first, out var rune))

{

// BMP

return rune;

}

// Check if we have a valid surrogate pair

if (offset < 0)

{

// The code unit at `index` should be a low surrogate

// The scalar value (rune) of a supplementary character should start at `index - 1`, which should be a high surrogate

if ((uint)(index - 1) < (uint)text.Length)

{

Rune.TryCreate(text[index - 1], first, out rune);

}

}

else

{

// The code unit at `index` should be a high surrogate and the start of a scalar value (rune) of a supplementary character

if ((uint)(index + 1) < (uint)text.Length)

{

Rune.TryCreate(first, text[index + 1], out rune);

}

}

return rune;

}

if ((uint)(index - 1) < (uint)text.Length)

index is never be 0 here (IndexOutOfRangeException has already been thrown at text[index]), but could you tell me why you prefer the cast into uint when compare indexes?

The uint casts are a trick that allows the JIT to avoid inserting additional bounds checks for the length. It checks for "index - 1 is not negative" and "index - 1 is in range of the string Length" as a single comparison.

E.g. compare
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAZgAJiAmcgFRlwwGEAbPXAWACgBvb8gZQrEAjEnIBLAHYZyAMQgQAFKIAM5XGkkztAExgIAlP0F8ugi5IBm5JdP0JycciPIAeDQDoAMjCkBzDAALQ0oAdg0AbXsDJxcAXQBuE0twl2TzQQBfFPJcskoxbVkAIWwoFRF1TWK9A2NMgTNUgQkbJSUAV2kMQzspBziRUI8unsNcHz9AkLTcaIHY5xEk3ItiCJEMixyuLKA=

I see. Could you add an additional comment (e.g. // index is never be 0 here (IndexOutOfRangeException would have already been thrown at text[index]) above) above the if statement of the suggestion?

Why can't it be 0? Wouldn't you hit it with Start = 1, offset = -1?

I was under the wrong impression. (uint)(index - 1) will become ~4.3 billion (uint.MaxValue) when index is 0 due to underflow. It should be (uint)index < (uint)text.Length + 1u instead.

src/Markdig.Tests/TestStringSlice.cs

Co-authored-by: Miha Zupan <[email protected]>

Recognize supplementary characters

f43d988

MihaZupan reviewed Nov 13, 2025

View reviewed changes

src/Markdig/Helpers/CharHelper.cs Outdated Show resolved Hide resolved

src/Markdig/Helpers/CharHelper.cs Outdated Show resolved Hide resolved

src/Markdig/Helpers/StringSlice.cs Outdated Show resolved Hide resolved

src/Markdig/Markdig.targets Outdated Show resolved Hide resolved

xoofx mentioned this pull request Nov 25, 2025

German Umlaut handling in urilize #910

Closed

Internatize Rune

2a851a2

tats-u marked this pull request as draft December 28, 2025 10:00

Fix failing tests

de5d3fe

tats-u marked this pull request as ready for review December 31, 2025 10:33

tats-u requested a review from MihaZupan December 31, 2025 10:35

Fix extra comment error

095f0ed

MihaZupan reviewed Jan 1, 2026

View reviewed changes

Remove extra local variable c

948bf66

Reorganize classes around Rune

e968e52

tats-u added 5 commits January 3, 2026 00:07

Prepare both Rune and char variants / make Rune variant public for .NET

bbffa33

Make APIs in StringSlice.cs public only in modern .NET

3a65e4b

Throw exception if cannot obtain first Rune

0f928a2

Add comments

a4c9146

Add comment on PeekRuneExtra

9839b99

tats-u requested a review from MihaZupan January 4, 2026 05:46

MihaZupan reviewed Jan 4, 2026

View reviewed changes

src/Markdig/Parsers/Inlines/EmphasisInlineParser.cs Outdated Show resolved Hide resolved

MihaZupan reviewed Jan 4, 2026

View reviewed changes

src/Markdig/Helpers/StringSlice.cs Outdated Show resolved Hide resolved

Use Rune.TryCreate

3ba8a3c

tats-u added 7 commits January 4, 2026 19:34

Remove backtrack

8ab6542

Fix parameter name in XML comment

f6d6916

Don't throw when error in Rune.DecodeFromUtf16

03822ac

Fix RuneAt

b9d9e09

Add tests of Rune-related methods of StringSlice

476fb63

Make comment more tolerant of changes

4cb6895

Tweak comment

e1e58cb

tats-u requested a review from MihaZupan January 4, 2026 15:28

Fix comment

b302cbc

MihaZupan approved these changes Jan 5, 2026

View reviewed changes

tats-u and others added 2 commits January 5, 2026 23:29

Add readonly

a0d08bf

Co-authored-by: Miha Zupan <[email protected]>

Move namespace of polyfilled Rune out of System.Text

31f48ac

Uh oh!

Recognize supplementary characters #913

Are you sure you want to change the base?

Recognize supplementary characters #913

Uh oh!

Conversation

tats-u commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tats-u commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xoofx commented Dec 31, 2025

Uh oh!

tats-u commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tats-u commented Dec 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tats-u Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tats-u Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tats-u Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MihaZupan commented Jan 1, 2026

Uh oh!

tats-u commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MihaZupan commented Jan 2, 2026

Uh oh!

tats-u commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tats-u commented Jan 4, 2026

Uh oh!

Uh oh!

Uh oh!

MihaZupan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MihaZupan Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tats-u commented Dec 31, 2025 •

edited

Loading

tats-u commented Dec 31, 2025 •

edited

Loading

tats-u Jan 2, 2026 •

edited

Loading

tats-u Jan 3, 2026 •

edited

Loading

tats-u Jan 4, 2026 •

edited

Loading

tats-u commented Jan 2, 2026 •

edited

Loading

tats-u commented Jan 2, 2026 •

edited

Loading

MihaZupan Jan 5, 2026 •

edited

Loading

tats-u Jan 5, 2026 •

edited

Loading

tats-u Jan 5, 2026 •

edited

Loading

MihaZupan Jan 5, 2026 •

edited

Loading

tats-u Jan 6, 2026 •

edited

Loading