-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Update Utf8 and Rune docs #8128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-encoding Issue DetailsSummarySignificantly fleshes out the Remarks section for the Feedback requested: What do folks think of the
|
|
||
If 'replaceInvalidSequences' is `true`, the method never returns <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>. If 'isFinalBlock' is `true`, the method never returns <xref:System.Buffers.OperationStatus.NeedMoreData?displayProperty=nameWithType>. | ||
Span<byte> utf8DestinationBytes = new byte[64]; | ||
string utf16InputChars = "¿Cómo estás?"; // "How are you?" in Spanish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll attach a wav file. :)
In all seriousness, if there's any other sample text that's preferred here, I'm open to suggestions. The best sample for this particular scenario is text which is mostly-ASCII but with a handful of non-ASCII characters thrown in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a reasonable sample text. It is good enough demonstrating the idea.
Docs Build status updates of commit d2f34dc:
|
File | Status | Preview URL | Details |
---|---|---|---|
xml/System.Text/Rune.xml | View | Details | |
xml/System.Text.Unicode/Utf8.xml | ✅Succeeded | View |
xml/System.Text/Rune.xml
- Line 0, Column 0: [Warning: xref-not-found]
Cross reference not found: 'System.Text.Rune.DecodeFromUtf8'.
- Line 0, Column 0: [Warning: xref-not-found]
Cross reference not found: 'System.Text.Rune.DecodeLastFromUtf8'.
For more details, please refer to the build report.
If you see build warnings/errors with permission issues, it might be due to single sign-on (SSO) enabled on Microsoft's GitHub organizations. Please follow instructions here to re-authorize your GitHub account to Docs Build.
Note: Broken links written as relative paths are included in the above build report. For broken links written as absolute paths or external URLs, see the broken link report.
Note: Your PR may contain errors or warnings unrelated to the files you changed. This happens when external dependencies like GitHub alias, Microsoft alias, cross repo links are updated. Please use these instructions to resolve them.
For any questions, please:
- Try searching the docs.microsoft.com contributor guides
- Post your question in the Docs support channel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for writing it. It is a valuable addition for the users who are not that familiar with the UTF-8 encoding or not paying much attention to corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit.
It's a very clear read with good examples 👍🏻
I guess that even people that only know the name UTF-8 will be able to use these APIs correctly (if they read the docs).
MemoryStream outputStream = new MemoryStream(); | ||
string stringToWrite = "Hello world!"; | ||
await WriteStringToStreamAsync(stringToWrite, outputStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the MemoryStream async doesn't make much sense. I understand the idea behind using async here, but should there be a commet indicating this?
Or maybe
MemoryStream outputStream = new MemoryStream(); | |
string stringToWrite = "Hello world!"; | |
await WriteStringToStreamAsync(stringToWrite, outputStream); | |
Stream outputStream = GetOutputStream(); // get the stream from somewhere | |
string stringToWrite = "Hello world!"; | |
await WriteStringToStreamAsync(stringToWrite, outputStream); |
?
Debug.Assert(opStatus == OperationStatus.Done || opStatus == OperationStatus.DestinationTooSmall); | ||
Debug.Assert(bytesWritten > 0, "Scratch buffer is too small for loop to make forward progress."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 (perfect for pushing more people towards using Debug.Assert
s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a lot - thanks Levi!
|
||
async Task WriteStringToStreamAsync(string dataToWrite, Stream outputStream) | ||
{ | ||
// For this example we'll use a 1024-byte scratch buffer, but you can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// For this example we'll use a 1024-byte scratch buffer, but you can | |
// This example uses a 1024-byte scratch buffer, but you can |
async Task WriteStringToStreamAsync(string dataToWrite, Stream outputStream) | ||
{ | ||
// For this example we'll use a 1024-byte scratch buffer, but you can | ||
// use pooled arrays or a differently-sized buffer depending on your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// use pooled arrays or a differently-sized buffer depending on your | |
// use pooled arrays or a different-sized buffer depending on your |
|
||
In the output, the leading `"AB"` is successfully transcoded into its UTF-8 representation `[ 41 42 ]`. However, the standalone high surrogate char `'\ud800'` cannot be represented in UTF-8, so the replacement character sequence `[ EF BF BD ]` is written to the destination instead. Finally, the trailing `"YZ"` does transcode successfully to `[ 59 5A ]` and is written to the destination. | ||
|
||
If you set `replaceInvalidSequences` to `false`, substitution of ill-formed input data not take place. Instead, the `ToUtf8` method will stop processing input immediately upon seeing ill-formed input data and return <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>, as shown in the following example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set `replaceInvalidSequences` to `false`, substitution of ill-formed input data not take place. Instead, the `ToUtf8` method will stop processing input immediately upon seeing ill-formed input data and return <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType>, as shown in the following example. | |
If you set `replaceInvalidSequences` to `false`, ill-formed input data is not substituted. Instead, the `ToUtf8` method stops processing input and returns <xref:System.Buffers.OperationStatus.InvalidData?displayProperty=nameWithType> as soon as it finds ill-formed data, as shown in the following example. |
> [!CAUTION] | ||
> When calling this method in a loop and slicing the `source` span, use the returned `charsConsumed` value instead of the returned `result`'s <xref:System.Text.Rune.Utf16SequenceLength> property. | ||
> | ||
> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeFromUtf8> for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeFromUtf8> for more information. | |
> While these two values will be identical for UTF-16 scenarios, they aren't guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications that initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. For more information, see the Remarks section in <xref:System.Text.Rune.DecodeFromUtf8%2A>. |
> [!CAUTION] | ||
> When calling this method in a loop and slicing the `source` span, use the returned `charsConsumed` value instead of the returned `result`'s <xref:System.Text.Rune.Utf16SequenceLength> property. | ||
> | ||
> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeLastFromUtf8> for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in <xref:System.Text.Rune.DecodeLastFromUtf8> for more information. | |
> While these two values will be identical for UTF-16 scenarios, they aren't guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications that initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. For more information, see the Remarks section in <xref:System.Text.Rune.DecodeLastFromUtf8%2A>. |
Thanks all for the feedback. :) Since it looked like reception was positive, I'll also update the FromUtf16 docs in the next iteration. Should be able to get to that in a few days after knocking out some other higher-priority work. |
@GrabYourPitchforks I'm going to converting this to a draft for now. Feel free to revisit whenever you can. |
Summary
Significantly fleshes out the Remarks section for the
System.Text.Unicode.Utf8
APIs, complete with sample code. Also improves some docs forSystem.Text.Rune
.Feedback requested: What do folks think of the
Utf8.FromUtf16
docs here? I want to updateToUtf16
to largely follow the same pattern. Figured it'd be best to solicit some feedback first before dedicating time to it.