Multibyte character encoding generally problematic in subprocess streams

## Background

This is potentially duplicating #352, but that has a special wrinkle in that stdin really wants to be read a very small number of bytes at a time, whereas the issue still exists for stdout/err but on a broader scale most of the time.

I'm encountering it in Fabric 2 thusly:

- Integration test does a variant on `cat /usr/share/dict/words`, which on some systems contains UTF-8 encoded text
- Fabric 2's runner subclass reads 1000 bytes at a time from the network
- "Normally", this did not cause a problem in my particular test cases, even though it theoretically could have (as noted in #352, this problem seems impossible to naively solve unless you are able to somehow read() the entire stream in one go).
- Intermittently, one of the reads catches the network stack (i.e. either a client or server buffer) low enough that the `recv` call under the hood is only given, say, 384 bytes
- This throws off all of the byte boundaries for the stream from "normal", and for whatever reason, the readjustment of all the subsequent (usually still 1000b) 'windows' is highly likely to catch some multibyte character partway through.
- Then during the read-and-decode step, both halves o the byte get replace'd into the Unicode replacement character
- Final result is a word that should be e.g. `Lumiére` but is instead `Lumi��re`

Again, the details there are less important than the core problem of any chunked data transfer potentially encountering this issue.

## Brainstorm

First, it'd be nice to see how other codebases handle this because quite obviously it's not new to us...

Left on my own, I could see some solutions including:

- An 'easy' half-baked solution is to ensure that we capture (vs mirror) the streams as raw bytes and only perform a decode of the captured bytes once command execution is complete. (implied is that mirroring simply works as it does now, attempting to decode each chunk individually for immediate display.)
  - This should mean that my particular case of `Connection().run('command').stdout.stuff` would "work right"
  - Though it doesn't solve the display issue, if one were to be mirroring the same stream.
  - That isn't the worst crime ever (especially given its expected rarity - the product of multibyte characters crossed with 'window' size) because I expect all "useful" processing (storing to file, parsing, etc) to be using the captured data and not the emitted/mirrored stdout.
- A more complex but possibly more widely applicable option is to attempt progressive decoding - _if_ a given decode action results in errors/replacements, defer final decoding and storage until you can try it with the sum of that read plus and the next one.
  - This would fix my particular issue pretty well...
  - It could cause stuttering or delays in mirroring, but practically speaking I can't see it being a common issue.
    - Especially as we could adjust this to read only the current encoding's character byte size number of bytes on the next read, i.e. reading another 1-2 bytes instead of 1000 or whatever.
  - In situations where the output is seriously garbage and there would _always_ be a lot of encoding errors/replacements, this could end up in a worst-case scenario of not displaying any output until end of session (because we'd keep trying for "the rest of the bytes", thinking on each read that we'd run into the same situation on the opposite end of the chunk).
    - This seems pretty unlikely, but hey.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multibyte character encoding generally problematic in subprocess streams #438

Background

Brainstorm

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multibyte character encoding generally problematic in subprocess streams #438

Description

Background

Brainstorm

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions