Skip to content

Update Utf8 and Rune docs #8128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

GrabYourPitchforks
Copy link
Member

Summary

Significantly fleshes out the Remarks section for the System.Text.Unicode.Utf8 APIs, complete with sample code. Also improves some docs for System.Text.Rune.

Feedback requested: What do folks think of the Utf8.FromUtf16 docs here? I want to update ToUtf16 to largely follow the same pattern. Figured it'd be best to solicit some feedback first before dedicating time to it.

@ghost
Copy link

ghost commented Jun 3, 2022

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

Significantly fleshes out the Remarks section for the System.Text.Unicode.Utf8 APIs, complete with sample code. Also improves some docs for System.Text.Rune.

Feedback requested: What do folks think of the Utf8.FromUtf16 docs here? I want to update ToUtf16 to largely follow the same pattern. Figured it'd be best to solicit some feedback first before dedicating time to it.

Author: GrabYourPitchforks
Assignees: -
Labels:

area-System.Text.Encoding

Milestone: -


If 'replaceInvalidSequences' is `true`, the method never returns <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>. If 'isFinalBlock' is `true`, the method never returns <xref:System.Buffers.OperationStatus.NeedMoreData?displayProperty=nameWithType>.
Span<byte> utf8DestinationBytes = new byte[64];
string utf16InputChars = "¿Cómo estás?"; // "How are you?" in Spanish
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"How are you?" in Spanish

we should have this comment with the voice :-) ...just kidding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll attach a wav file. :)

In all seriousness, if there's any other sample text that's preferred here, I'm open to suggestions. The best sample for this particular scenario is text which is mostly-ASCII but with a handful of non-ASCII characters thrown in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a reasonable sample text. It is good enough demonstrating the idea.

@opbld31
Copy link

opbld31 commented Jun 3, 2022

Docs Build status updates of commit d2f34dc:

⚠️ Validation status: warnings

File Status Preview URL Details
xml/System.Text/Rune.xml ⚠️Warning View Details
xml/System.Text.Unicode/Utf8.xml ✅Succeeded View

xml/System.Text/Rune.xml

  • Line 0, Column 0: [Warning: xref-not-found] Cross reference not found: 'System.Text.Rune.DecodeFromUtf8'.
  • Line 0, Column 0: [Warning: xref-not-found] Cross reference not found: 'System.Text.Rune.DecodeLastFromUtf8'.

For more details, please refer to the build report.

If you see build warnings/errors with permission issues, it might be due to single sign-on (SSO) enabled on Microsoft's GitHub organizations. Please follow instructions here to re-authorize your GitHub account to Docs Build.

Note: Broken links written as relative paths are included in the above build report. For broken links written as absolute paths or external URLs, see the broken link report.

Note: Your PR may contain errors or warnings unrelated to the files you changed. This happens when external dependencies like GitHub alias, Microsoft alias, cross repo links are updated. Please use these instructions to resolve them.

For any questions, please:

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for writing it. It is a valuable addition for the users who are not that familiar with the UTF-8 encoding or not paying much attention to corner cases.

Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit.

It's a very clear read with good examples 👍🏻
I guess that even people that only know the name UTF-8 will be able to use these APIs correctly (if they read the docs).

Comment on lines +133 to +135
MemoryStream outputStream = new MemoryStream();
string stringToWrite = "Hello world!";
await WriteStringToStreamAsync(stringToWrite, outputStream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the MemoryStream async doesn't make much sense. I understand the idea behind using async here, but should there be a commet indicating this?
Or maybe

Suggested change
MemoryStream outputStream = new MemoryStream();
string stringToWrite = "Hello world!";
await WriteStringToStreamAsync(stringToWrite, outputStream);
Stream outputStream = GetOutputStream(); // get the stream from somewhere
string stringToWrite = "Hello world!";
await WriteStringToStreamAsync(stringToWrite, outputStream);

?

Comment on lines +149 to +150
Debug.Assert(opStatus == OperationStatus.Done || opStatus == OperationStatus.DestinationTooSmall);
Debug.Assert(bytesWritten > 0, "Scratch buffer is too small for loop to make forward progress.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 (perfect for pushing more people towards using Debug.Asserts)

Copy link
Contributor

@gewarren gewarren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a lot - thanks Levi!


async Task WriteStringToStreamAsync(string dataToWrite, Stream outputStream)
{
// For this example we'll use a 1024-byte scratch buffer, but you can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For this example we'll use a 1024-byte scratch buffer, but you can
// This example uses a 1024-byte scratch buffer, but you can

async Task WriteStringToStreamAsync(string dataToWrite, Stream outputStream)
{
// For this example we'll use a 1024-byte scratch buffer, but you can
// use pooled arrays or a differently-sized buffer depending on your
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// use pooled arrays or a differently-sized buffer depending on your
// use pooled arrays or a different-sized buffer depending on your


In the output, the leading `"AB"` is successfully transcoded into its UTF-8 representation `[ 41 42 ]`. However, the standalone high surrogate char `'\ud800'` cannot be represented in UTF-8, so the replacement character sequence `[ EF BF BD ]` is written to the destination instead. Finally, the trailing `"YZ"` does transcode successfully to `[ 59 5A ]` and is written to the destination.

If you set `replaceInvalidSequences` to `false`, substitution of ill-formed input data not take place. Instead, the `ToUtf8` method will stop processing input immediately upon seeing ill-formed input data and return <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>, as shown in the following example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you set `replaceInvalidSequences` to `false`, substitution of ill-formed input data not take place. Instead, the `ToUtf8` method will stop processing input immediately upon seeing ill-formed input data and return <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>, as shown in the following example.
If you set `replaceInvalidSequences` to `false`, ill-formed input data is not substituted. Instead, the `ToUtf8` method stops processing input and returns <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType> as soon as it finds ill-formed data, as shown in the following example.

> [!CAUTION]
> When calling this method in a loop and slicing the `source` span, use the returned `charsConsumed` value instead of the returned `result`'s <xref:System.Text.Rune.Utf16SequenceLength> property.
>
> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeFromUtf8> for more information.
Copy link
Contributor

@gewarren gewarren Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeFromUtf8> for more information.
> While these two values will be identical for UTF-16 scenarios, they aren't guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications that initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. For more information, see the Remarks section in <xref:System.Text.Rune.DecodeFromUtf8%2A>.

> [!CAUTION]
> When calling this method in a loop and slicing the `source` span, use the returned `charsConsumed` value instead of the returned `result`'s <xref:System.Text.Rune.Utf16SequenceLength> property.
>
> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeLastFromUtf8> for more information.
Copy link
Contributor

@gewarren gewarren Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeLastFromUtf8> for more information.
> While these two values will be identical for UTF-16 scenarios, they aren't guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications that initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. For more information, see the Remarks section in <xref:System.Text.Rune.DecodeLastFromUtf8%2A>.

@GrabYourPitchforks
Copy link
Member Author

Thanks all for the feedback. :)

Since it looked like reception was positive, I'll also update the FromUtf16 docs in the next iteration. Should be able to get to that in a few days after knocking out some other higher-priority work.

@eiriktsarpalis
Copy link
Member

@GrabYourPitchforks I'm going to converting this to a draft for now. Feel free to revisit whenever you can.

@eiriktsarpalis eiriktsarpalis marked this pull request as draft March 20, 2023 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants