Update Utf8 and Rune docs #8128

GrabYourPitchforks · 2022-06-03T01:43:18Z

Summary

Significantly fleshes out the Remarks section for the System.Text.Unicode.Utf8 APIs, complete with sample code. Also improves some docs for System.Text.Rune.

Feedback requested: What do folks think of the Utf8.FromUtf16 docs here? I want to update ToUtf16 to largely follow the same pattern. Figured it'd be best to solicit some feedback first before dedicating time to it.

ghost · 2022-06-03T01:43:24Z

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

Significantly fleshes out the Remarks section for the System.Text.Unicode.Utf8 APIs, complete with sample code. Also improves some docs for System.Text.Rune.

Feedback requested: What do folks think of the Utf8.FromUtf16 docs here? I want to update ToUtf16 to largely follow the same pattern. Figured it'd be best to solicit some feedback first before dedicating time to it.

Author:	GrabYourPitchforks
Assignees:	-
Labels:	`area-System.Text.Encoding`
Milestone:	-

tarekgh · 2022-06-03T02:06:58Z

xml/System.Text.Unicode/Utf8.xml


-If 'replaceInvalidSequences' is `true`, the method never returns <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>. If 'isFinalBlock' is `true`, the method never returns <xref:System.Buffers.OperationStatus.NeedMoreData?displayProperty=nameWithType>.
+Span<byte> utf8DestinationBytes = new byte[64];
+string utf16InputChars = "¿Cómo estás?"; // "How are you?" in Spanish


"How are you?" in Spanish

we should have this comment with the voice :-) ...just kidding.

I'll attach a wav file. :)

In all seriousness, if there's any other sample text that's preferred here, I'm open to suggestions. The best sample for this particular scenario is text which is mostly-ASCII but with a handful of non-ASCII characters thrown in.

It is a reasonable sample text. It is good enough demonstrating the idea.

opbld31 · 2022-06-03T02:12:44Z

Docs Build status updates of commit d2f34dc:

⚠️ Validation status: warnings

File	Status	Preview URL	Details
xml/System.Text/Rune.xml	⚠️Warning	View	Details
xml/System.Text.Unicode/Utf8.xml	✅Succeeded	View

xml/System.Text/Rune.xml

Line 0, Column 0: [Warning: xref-not-found] Cross reference not found: 'System.Text.Rune.DecodeFromUtf8'.
Line 0, Column 0: [Warning: xref-not-found] Cross reference not found: 'System.Text.Rune.DecodeLastFromUtf8'.

For more details, please refer to the build report.

If you see build warnings/errors with permission issues, it might be due to single sign-on (SSO) enabled on Microsoft's GitHub organizations. Please follow instructions here to re-authorize your GitHub account to Docs Build.

Note: Broken links written as relative paths are included in the above build report. For broken links written as absolute paths or external URLs, see the broken link report.

Note: Your PR may contain errors or warnings unrelated to the files you changed. This happens when external dependencies like GitHub alias, Microsoft alias, cross repo links are updated. Please use these instructions to resolve them.

For any questions, please:

Try searching the docs.microsoft.com contributor guides
Post your question in the Docs support channel

tarekgh

LGTM. Thanks for writing it. It is a valuable addition for the users who are not that familiar with the UTF-8 encoding or not paying much attention to corner cases.

gfoidl

Just a nit.

It's a very clear read with good examples 👍🏻
I guess that even people that only know the name UTF-8 will be able to use these APIs correctly (if they read the docs).

gfoidl · 2022-06-03T11:47:23Z

xml/System.Text.Unicode/Utf8.xml

+MemoryStream outputStream = new MemoryStream();
+string stringToWrite = "Hello world!";
+await WriteStringToStreamAsync(stringToWrite, outputStream);


For the MemoryStream async doesn't make much sense. I understand the idea behind using async here, but should there be a commet indicating this?
Or maybe

Suggested change

MemoryStream outputStream = new MemoryStream();

string stringToWrite = "Hello world!";

await WriteStringToStreamAsync(stringToWrite, outputStream);

Stream outputStream = GetOutputStream(); // get the stream from somewhere

string stringToWrite = "Hello world!";

await WriteStringToStreamAsync(stringToWrite, outputStream);

?

gfoidl · 2022-06-03T11:47:36Z

xml/System.Text.Unicode/Utf8.xml

+        Debug.Assert(opStatus == OperationStatus.Done || opStatus == OperationStatus.DestinationTooSmall);
+        Debug.Assert(bytesWritten > 0, "Scratch buffer is too small for loop to make forward progress.");


👍🏻 (perfect for pushing more people towards using Debug.Asserts)

gewarren

That was a lot - thanks Levi!

gewarren · 2022-06-06T18:35:06Z

xml/System.Text.Unicode/Utf8.xml

+
+async Task WriteStringToStreamAsync(string dataToWrite, Stream outputStream)
+{
+    // For this example we'll use a 1024-byte scratch buffer, but you can


Suggested change

// For this example we'll use a 1024-byte scratch buffer, but you can

// This example uses a 1024-byte scratch buffer, but you can

gewarren · 2022-06-06T18:35:25Z

xml/System.Text.Unicode/Utf8.xml

+async Task WriteStringToStreamAsync(string dataToWrite, Stream outputStream)
+{
+    // For this example we'll use a 1024-byte scratch buffer, but you can
+    // use pooled arrays or a differently-sized buffer depending on your


Suggested change

// use pooled arrays or a differently-sized buffer depending on your

// use pooled arrays or a different-sized buffer depending on your

gewarren · 2022-06-06T18:39:55Z

xml/System.Text.Unicode/Utf8.xml

+
+In the output, the leading `"AB"` is successfully transcoded into its UTF-8 representation `[ 41 42 ]`. However, the standalone high surrogate char `'\ud800'` cannot be represented in UTF-8, so the replacement character sequence `[ EF BF BD ]` is written to the destination instead. Finally, the trailing `"YZ"` does transcode successfully to `[ 59 5A ]` and is written to the destination.
+
+If you set `replaceInvalidSequences` to `false`, substitution of ill-formed input data not take place. Instead, the `ToUtf8` method will stop processing input immediately upon seeing ill-formed input data and return <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>, as shown in the following example.


Suggested change

If you set `replaceInvalidSequences` to `false`, substitution of ill-formed input data not take place. Instead, the `ToUtf8` method will stop processing input immediately upon seeing ill-formed input data and return <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>, as shown in the following example.

If you set `replaceInvalidSequences` to `false`, ill-formed input data is not substituted. Instead, the `ToUtf8` method stops processing input and returns <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType> as soon as it finds ill-formed data, as shown in the following example.

gewarren · 2022-06-06T18:58:26Z

xml/System.Text/Rune.xml

+> [!CAUTION]
+> When calling this method in a loop and slicing the `source` span, use the returned `charsConsumed` value instead of the returned `result`'s <xref:System.Text.Rune.Utf16SequenceLength> property.
+>
+> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeFromUtf8> for more information.


Suggested change

> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeFromUtf8> for more information.

> While these two values will be identical for UTF-16 scenarios, they aren't guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications that initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. For more information, see the Remarks section in <xref:System.Text.Rune.DecodeFromUtf8%2A>.

gewarren · 2022-06-06T19:04:08Z

xml/System.Text/Rune.xml

+> [!CAUTION]
+> When calling this method in a loop and slicing the `source` span, use the returned `charsConsumed` value instead of the returned `result`'s <xref:System.Text.Rune.Utf16SequenceLength> property.
+>
+> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeLastFromUtf8> for more information.


Suggested change

> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeLastFromUtf8> for more information.

> While these two values will be identical for UTF-16 scenarios, they aren't guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications that initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. For more information, see the Remarks section in <xref:System.Text.Rune.DecodeLastFromUtf8%2A>.

GrabYourPitchforks · 2022-06-07T15:25:30Z

Thanks all for the feedback. :)

Since it looked like reception was positive, I'll also update the FromUtf16 docs in the next iteration. Should be able to get to that in a few days after knocking out some other higher-priority work.

eiriktsarpalis · 2023-03-20T15:48:51Z

@GrabYourPitchforks I'm going to converting this to a draft for now. Feel free to revisit whenever you can.

Update Utf8 and Rune docs

d2f34dc

GrabYourPitchforks added the area-System.Text.Encoding label Jun 3, 2022

GrabYourPitchforks requested a review from tarekgh June 3, 2022 01:43

GrabYourPitchforks requested a review from a team as a code owner June 3, 2022 01:43

ghost assigned GrabYourPitchforks Jun 3, 2022

tarekgh reviewed Jun 3, 2022

View reviewed changes

tarekgh approved these changes Jun 3, 2022

View reviewed changes

gfoidl reviewed Jun 3, 2022

View reviewed changes

gewarren approved these changes Jun 6, 2022

View reviewed changes

eiriktsarpalis marked this pull request as draft March 20, 2023 15:48

		Debug.Assert(opStatus == OperationStatus.Done \|\| opStatus == OperationStatus.DestinationTooSmall);
		Debug.Assert(bytesWritten > 0, "Scratch buffer is too small for loop to make forward progress.");

	// For this example we'll use a 1024-byte scratch buffer, but you can
	// This example uses a 1024-byte scratch buffer, but you can

	// use pooled arrays or a differently-sized buffer depending on your
	// use pooled arrays or a different-sized buffer depending on your


		In the output, the leading `"AB"` is successfully transcoded into its UTF-8 representation `[ 41 42 ]`. However, the standalone high surrogate char `'\ud800'` cannot be represented in UTF-8, so the replacement character sequence `[ EF BF BD ]` is written to the destination instead. Finally, the trailing `"YZ"` does transcode successfully to `[ 59 5A ]` and is written to the destination.

		If you set `replaceInvalidSequences` to `false`, substitution of ill-formed input data not take place. Instead, the `ToUtf8` method will stop processing input immediately upon seeing ill-formed input data and return <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>, as shown in the following example.

Update Utf8 and Rune docs #8128

Are you sure you want to change the base?

Update Utf8 and Rune docs #8128

Uh oh!

Conversation

GrabYourPitchforks commented Jun 3, 2022

Summary

Uh oh!

ghost commented Jun 3, 2022

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

opbld31 commented Jun 3, 2022

⚠️ Validation status: warnings

Uh oh!

tarekgh left a comment

Choose a reason for hiding this comment

Uh oh!

gfoidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gewarren left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gewarren Jun 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gewarren Jun 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GrabYourPitchforks commented Jun 7, 2022

Uh oh!

eiriktsarpalis commented Mar 20, 2023

Uh oh!

Uh oh!

gewarren Jun 6, 2022 •

edited

Loading

gewarren Jun 6, 2022 •

edited

Loading