Skip to content

Add OsStr::from_platform_bytes() #27390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Add OsStr::from_platform_bytes() #27390

wants to merge 5 commits into from

Conversation

remram44
Copy link
Contributor

Rebased #26730

OsString::from_bytes() turns Vec<u8> into OsString.

This adds the same thing, from &[u8] to &OsStr, and renames both of them to from_platform_bytes() to avoid conflict with OsStrExt::from_bytes() already available on UNIX.

I don't know if the last commit is needed (should from_bytes() just be removed without deprecation? It was never stable), let me know.

remram44 added 3 commits July 29, 2015 17:50
Should be from_bytes(), since that the name of the method in OsString;
however this is already used in OsStrExt on Unix. Thus
from_bytes_slice(), which makes it clear why the name is different (this
one operates on slices without copying).
OsString::from_bytes() is now OsString::from_platform_bytes(), to match
OsStr::from_platform_bytes().

The method was still unstable.
@rust-highfive
Copy link
Contributor

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

@remram44
Copy link
Contributor Author

r? @alexcrichton

@rust-highfive rust-highfive assigned alexcrichton and unassigned aturon Jul 29, 2015
@Diggsey
Copy link
Contributor

Diggsey commented Jul 29, 2015

This is incorrect on windows: OsString does not use UTF8 encoding on windows, it uses WTF8, a superset of UTF8 designed to have a one-to-one mapping with "potentially ill-formed UTF16", which is the platform encoding of unicode strings on windows. This function can't exist in a cross-platform way in its current form: it should either restrict the input to valid UTF8 across all platforms, or it should put no additional restrictions on the input, by taking a &[u16] on windows and converting losslessly to WTF8 internally (which is the correct way to process strings returned by any windows API functions).

@nagisa
Copy link
Member

nagisa commented Jul 29, 2015

I see absolutely no reason to rename; there’s UFCS for resolving such ambiguities.

As far as conversion goes, you really want to convert from &[u16] rather than &[u8] on Windows, hence why OsStrExt::from_bytes() exists on UNIX, but not Windows. Overall, 👎 from me.

@retep998
Copy link
Member

We already have OsStringExt::from_wide for converting from &[u16] anyway.

I believe that exposing OsString::from_platform_bytes will give people the false idea that they can use it for narrow strings from system functions on Windows, so I'm somewhat against this.

@remram44
Copy link
Contributor Author

Interesting detail: I'm not adding functionality here, just providing the zero-copy &OsStr version with the same semantics as the existing (unstable) owned OsString method.


As for the functionality, I know it's not perfectly right, but it's still closer than parsing UTF-8 then converting String to OsString. Non-UTF-8 strings definitely appear on UNIX systems in day-to-day situations (they are not an error, they are not an oversight; people still use non-UTF-8 locales), whereas non-unicode strings on Windows are (arguably) an error people shouldn't really rely on; I'm ok not supporting it in some methods (like this one) with proper warnings (which I can add if you want).

You have to understand that the alternative is for everyone trying to get an OsString (and thus a Path) from bytes is to write a UNIX and a Windows version themselves. My view is that nobody is going to do that, everyone is going to use either this or go from String (UTF-8) and back (if we remove this feature), and it'll be a lot more broken (see things like getopts which just panics if you feed it non-UTF-8 arguments)! People will not get this right, don't take away the easy tool to get this right just because it misses one impossible-to-cover corner-case on one platform.

@aturon
Copy link
Member

aturon commented Jul 30, 2015

There seems to be some misunderstanding about this PR. As @remram44, this is just filling out a variant of functionality we already have. While it's true that in general on Windows strings may be arbitrary u16 sequences, there are often cases where you are given some byte sequence and need to view it as an OS string -- and on Windows, often the only way to make sense of that is to interpret that byte sequence as UTF8.

@remram44
Copy link
Contributor Author

btw tests don't pass because of the last commit, which should arguably just be dropped (no need to deprecate something that has never been stable?)

@nagisa
Copy link
Member

nagisa commented Jul 30, 2015

@aturon I still don’t see how assumption that any byte sequence on Windows will be UTF-8 is correct; is there a document anywhere that mandates all bytestrings returned by the Windows “platform” to be valid UTF-8?

EDIT: oooh, it returns an Option. Yeah, ok. I have nothing else to say here.

@aturon
Copy link
Member

aturon commented Jul 30, 2015

@nagisa Consider the existing from_bytes function. The point is that if you want to interpret a byte sequence as an OsString on Windows, the only way to do so is to interpret the bytes as UTF8. Or, put differently: if you want to take a &[u8] and produce an OsString (or &OsStr), what can you possibly do if the slice isn't in UTF8 format? The ways that an arbitrary u8 sequence can violate UTF8 are distinct from how an arbitrary u16 sequence can vioalte UTF16.

(The WTF8 encoding is a private representation detail for our OsString API; it's not a representation actually used in the wild. Furthermore, not every u8 sequence is valid WTF8.)

if cfg!(windows) {
str::from_utf8(bytes).ok().map(|s| s.as_ref())
} else {
Some(unsafe { mem::transmute(bytes) })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this use the safe OsStrExt::from_bytes instead?

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jul 30, 2015
@Diggsey
Copy link
Contributor

Diggsey commented Jul 30, 2015

Consider the existing from_bytes function.

I don't think that function should have been added in the first place in this form, but it's a bit late for that.

on Windows, the only way to do so is to interpret the bytes as UTF8. Or, put differently: if you want to take a &[u8] and produce an OsString (or &OsStr), what can you possibly do if the slice isn't in UTF8 format?

On windows, strings are not byte sequences, so any code which blindly takes a byte sequence and converts to an OsString using this method will be broken.

The closest approximation would be when using the non-unicode version of the windows API, in which case, assuming that the bytes are UTF8 is still wrong: the string should first be converted to a wide string using the correct code-page with the functions windows provides (MultiByteToWideChar)., and only then stored in an OsString.

Can you give a single example where this would give the correct behaviour on both platforms?

(The WTF8 encoding is a private representation detail for our OsString API; it's not a representation actually used in the wild. Furthermore, not every u8 sequence is valid WTF8.)

Agreed, which is precisely why there shouldn't be a way to construct an OsString from a byte sequence. There should be platform specific function, which converts to/from a raw platform string on each platform, taking a &[u8] on linux, and a &[u16] on windows, a conversion to/from the built-in string type, and maybe a helper function for using multi-byte strings on windows, implemented as I described above, nothing else.

Non-UTF-8 strings definitely appear on UNIX systems in day-to-day situations (they are not an error, they are not an oversight; people still use non-UTF-8 locales), whereas non-unicode strings on Windows are (arguably) an error people shouldn't really rely on

That's absurd, the situation is the same on both platforms with regard to ill-formed UTF8/UTF16, you're just giving your personal preference for one platform over another.

@retep998
Copy link
Member

OsStr is specifically for working with strings provided by OS APIs. Windows does not use UTF-8 for its APIs. It either uses UCS-2 wide strings, or narrow strings with an arbitrary encoding that is most likely not UTF-8. I don't know of any APIs that only work with narrow strings, everything always has a wide version. Therefore you should never be getting a byte sequence in the first place since all correct code will be using the wide versions to handle unicode properly. If you're getting a byte sequence from somewhere other than the OS, what in the world are you using OsStr for?

@remram44
Copy link
Contributor Author

OsStr is not just for things you get from the OS, obviously: it's also for things you feed to the OS. One very obvious example of this is paths. Why wouldn't you want to read path names from a file or a socket, for example? (file contents are bytes)

@Diggsey
Copy link
Contributor

Diggsey commented Jul 31, 2015

Why wouldn't you want to read path names from a file or a socket, for example? (file contents are bytes)

In which case you must know how those path names are encoded. The file format or protocol will dictate what encoding should be used - this function misrepresents itself as being valid regardless of such context. Furthermore, it has different behaviour on each platform, and so is useless for data from files or sockets, which must read the same everywhere.

@retep998
Copy link
Member

@remram44 Also, if you're reading/writing bytes instead of UTF-8, while that allows for *nix paths which might not be unicode, it completely disregards Windows paths which might not be unicode. Blatant disregard for non-unicode paths on one side is not going to make your code cross-platform.

@alexcrichton
Copy link
Member

I personally would like to see these sorts of methods added and stabilized (in addition to the to_bytes, to_cstring, and perhaps a future to_str methods). I personally find myself commonly writing functions that converts between a list of bytes and an OS string, for example:

  • Cargo does this because paths are encoded into file names.
  • git2-rs does this because the libgit2 API does not expose an OS string, but all paths going in and out are *const c_char

I feel that the use case for "this path needs to be representable as bytes" really does come up from time to time. I'm quite thankful that I'm forced to acknowledge somewhere along the line that the path probably needs to be unicode on some platforms, but I don't think that we should never adds these convenience methods because they're seen as "wrong to call". In each of the above situations there are a few courses of action I could take:

  • For putting paths in files, I could create some form of encoding scheme which isn't lossy on Windows, but this doesn't exist as far as I know and it's much harder to do than just requiring unicode on Windows.
  • For libgit2 I don't believe there's any other recourse. The library wants a byte string there has to be a way to convert a path to that.

I totally agree that we should have proper names for all these to indicate that they're lossy, and that they should be very clearly documented with why they're returning an Option and how it can fail on some platforms. Despite the hazards of them, though, I do think the standard library should continue to provide them. These methods will be written in downstream crates whether the standard library contains them or not, so I'm not sure there's much benefit to forcing the implementation to be rewritten downstream (perhaps not as robustly in some cases as well).

@Diggsey
Copy link
Contributor

Diggsey commented Jul 31, 2015

@alexcrichton Both of those cases you describe are cases where you shouldn't be using this function - the problem is not that its lossy, but that it has entirely different behaviour on the two platforms.

For example, by using this implementation for cargo, it's now possible for crates to be published from linux inadvertantly using non-UTF8 filenames, which are then completely inaccessible to windows users.

The best option IMO, when you need paths to be modestly portable, is to just pick an encoding such as UTF8, and enforce it everywhere, requiring that paths be well-formed strings. That at least allows problems to be caught early.

The only exception, which your git example may fall under, is if you have to deal with old software which doesn't cater for strings not being arbitrary byte sequences: in this specific case, the best option is still not to just fail on windows when presented with invalid UTF8 - instead you have some (maybe lossy) mapping from byte sequences to valid windows filenames. For reference, MultiByteToWideChar will, by default, strip out invalid byte sequences rather than failing altogether, and this seems to be how TortoiseGit maps paths. Normally you'd want such a mapping to be customizable by the user, (eg. by having a config file tell you what encoding the filenames are expected to be in).

How exactly would you name this function such that its behaviour would not be surpising? from_utf8_on_windows_but_unencoded_bytes_elsewhere()? UTF8 is not even native to windows - the entire OS works in what is essentially UCS-2. All the windows API functions which take raw byte sequences in place of strings expect them to be encoded using the system codepage, not UTF8, and all C/C++ programs written in the recommended way by using TCHARs for strings will also use the system codepage, not UTF8 when compiled with multi-byte character set selected.

@alexcrichton
Copy link
Member

Both of those cases you describe are cases where you shouldn't be using this function

I personally don't quite agree with this. For example why should Cargo be hamstrung on Unix because it also supports Windows. Like the standard library, Cargo should work to the fullest extent on all platforms, and you just have to opt-in to cross platform support yourself. The only difference between Cargo and the standard library is that with Cargo you don't ever declare "allow non-unicode filenames".

The only exception, which your git example may fall under, is if you have to deal with old software which doesn't cater for strings not being arbitrary byte sequences

I don't really consider this the "only exception". It's quite ubiquitous to have filenames stored as an array of bytes. For example I just remembered the other day that the tarball archive format stores paths as bytes and sure enough I have a function to do the necessary platform logic.

the best option is still not to just fail on windows when presented with invalid UTF8

I agree! This is what current implementations do, but it would be much nicer if the origin of this error (e.g. the "this isn't utf-8" statement) came from the standard library instead of each crate that needs this logic.

How exactly would you name this function such that its behaviour would not be surpising?

It sounds like we should focus more on whether this function should exist in the standard library first, because that's what seems in question here. I am not personally a huge fan of from_platform_bytes but lack of a name right now does not mean that we shouldn't land this in any form.

@nagisa
Copy link
Member

nagisa commented Aug 3, 2015

tarball archive format stores paths as bytes and sure enough I have a function to do the necessary platform logic.

Looks like the “platform” part of from_platform_bytes in your example is very unix-centric (sure, tar is a unix-centric format) and towards the OS side you try to convert to the proper representation (and fail; only on windows, for some inputs). str::from_utf8 at least has the same semantics on all platforms, so maybe just go to OsStr through str instead?

At least you won’t have to pretend that any corner case is completely fine as long as the case is noted in the documentation.

@remram44
Copy link
Contributor Author

remram44 commented Aug 3, 2015

str::from_utf8 at least has the same semantics on all platforms, so maybe just go to OsStr through str instead?

That does the same thing except it chokes on valid filenames on UNIX. Why would anyone do that?

@nagisa
Copy link
Member

nagisa commented Aug 3, 2015

That does the same thing except it chokes on valid filenames on UNIX. Why would anyone do that?

That has the same semantics on all platforms.

I might get a little less opposed to this if there’s another already stable method/function that panics for same input on some platforms, but not others, and the suggested behaviour is as (or more) misleading (no such thing as “OS” bytes on windows).

Otherwise, I honestly care little about windows, so am not inclined to waste my energy arguing over this anymore, since I’m sure you all already got my point.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 3, 2015

Why should Cargo be hamstrung on Unix because it also supports Windows.

Why should Cargo be hamstrung on Windows because it also supports Unix?

Like the standard library, Cargo should work to the fullest extent on all platforms, and you just have to opt-in to cross platform support yourself. The only difference between Cargo and the standard library is that with Cargo you don't ever declare "allow non-unicode filenames".

With the standard libary, portability is the default. If you want to use platform specific features you can, but you have to opt-in explicitly by using std::os::linux or similar. That seems like a very good precedent to follow.

It's quite ubiquitous to have filenames stored as an array of bytes.

Among tools developed solely for unix-like platforms, yes. Unsurprisingly, few of those see extensive use on windows, outside of compatibility layers like cygwin and mingw.

It would be much nicer if the origin of this error (e.g. the "this isn't utf-8" statement) came from the standard library instead of each crate that needs this logic.

It would be much nicer if it didn't error at all. Any API which can error on some platforms and not others will suffer from programmers not bothering to handle the error, forwarding it to their users instead, as is demonstrated by your tar example, which will give up completely if it encounters a non-unicode filename!

Personally, I think this functionality should exist in a "unix compatibility" module, maybe in an external crate, and should perform a best-effort, potentially lossy mapping from byte arrays to OsStrings, which will always succeed. Invalid UTF8 sequences could be removed or replaced with unicode substitution characters on windows.

One could imagine a similar "windows compatibility" module, which takes a &[u16] and will perform a best-effort mapping to OsStrings. There could even be a PortableOsString type which remembers the source encoding, can be serialized to a file, and will perform a best-effort conversion to an OsString as-needed.

@alexcrichton
Copy link
Member

@Diggsey I think we're talking in circles, the point I'm trying to get across is that there are situations where you have no choice but to convert an array of bytes to a path. This typically happens because you're doing something like:

  • Using a C library that wants or gives you a *const c_char
  • Working with a predefined file format that uses bytes

The standard library can't just pretend these situations don't exist, and it would be quite nice if there were convenience functions for handling these cases as opposed to every crate having to reinvent the same logic. On unix this bytes-to-path conversion is fine, but on Windows there just fundamentally isn't that many courses of action to take. The only real ubiquitous one which the standard library can take is attempt to interpret the data as UTF-8. This of course means that some input will be rejected, but that's just a fact that can't be worked around.

These convenience functions being added in this PR will always be well documented in terms of what they mean for portability. They seem desired enough to also warrant inclusion into the standard library itself as well.


@nagisa

Looks like the “platform” part of from_platform_bytes in your example is very unix-centric (sure, tar is a unix-centric format) and towards the OS side you try to convert to the proper representation (and fail; only on windows, for some inputs). str::from_utf8 at least has the same semantics on all platforms, so maybe just go to OsStr through str instead?

I think @remram44 answered this as well, but there's just nothing else that can be done. This library (and I suspect many others) want to be as lossless as possible, so it doesn't always make sense to require that filenames (even on unix) be valid unicode when there's an otherwise valid interpretation.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 3, 2015

I think we're talking in circles, the point I'm trying to get across is that there are situations where you have no choice but to convert an array of bytes to a path.

I'm not disagreeing there, I'm simply saying that such a function should absolutely not take the form it does in the existing (thankfully unstable) from_bytes implementation, and that this implementation is inherently non-portable.

I think @remram44 answered this as well, but there's just nothing else that can be done.

There's alot else that can be done, as I've been trying to say. Why do you insist on an implementation that will (and already has, by looking at your own examples) lead to programs being broken on non-unix platforms? Especially when it is incompatible with how existing software like git for windows and 7-zip solve the exact same problem.

@remram44
Copy link
Contributor Author

remram44 commented Aug 3, 2015

There's alot else that can be done, as I've been trying to say. Why do you insist on an implementation that will (and already has, by looking at your own examples) lead to programs being broken on non-unix platforms? Especially when it is incompatible with how existing software like git for windows and 7-zip solve the exact same problem.

My opinion is that, on Windows, the goal was for filenames to be unicode, and on UNIX, bytes. The fact that Windows accepts invalid unicode data is a bug, and while it's good to support broken strings when it is possible, having them be handled as second-class citizens is fine. That is why I am ok with these functions choking on this (non-silently, and in a recoverable way), if the doc explains clearly why/when it happens.

How do "software like git or 7-zip solve the exact same problem"? Apparently MSYS-Git decodes UTF-8 if valid, else defaults to some other fixed-length encoding. Might be better, but it's still broken (try checking this out); how much complexity do we want to put in here?

I'm ok with touching up the Windows implementation (though it wouldn't be able to return references anymore), but of course if from_platform_bytes(to_playform_bytes(.)) doesn't roundtrip, it's in no way safer for the user.

Let me know what you think. Of course, all this has nothing to do with the PR.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 3, 2015

My opinion is that, on Windows, the goal was for filenames to be unicode, and on UNIX, bytes. The fact that Windows accepts invalid unicode data is a bug.

I still disagree with that sentiment, but discussing that is somewhat pointless.

it's still broken (try checking this out);

That behaviour is far better: equality between paths has always been ill-defined, even on a single platform. You can have different filesystems mounted, network filesystems, etc. some of which may be case-insensitive, or have other peculiarities. The fact that two very similar paths map to the same thing under some circumstances can happen on any platform, is not particularly surpising, and there's no way to avoid it, save for not using such similar paths.

However, the most important thing, is that I can check out that repository. I can even commit changes to either file, and although it may be difficult to modify both files in a single commit, I could probably tweak some config files and get it to work. If git relied on a function like this, I wouldn't even be able check out the repository!

if from_platform_bytes(to_playform_bytes(.)) doesn't roundtrip, it's in no way safer for the user.

This has nothing to do with "platform bytes": we're not talking about how byte sequences should be decoded on the current platform (which would be according the system codepage on windows), so let's call it what it actually is: from_ill_formed_utf8_lossy/to_ill_formed_utf8_lossy (utf16 versions would also be useful).

Valid unicode will round-trip on all platforms. Invalid unicode is not guaranteed to round-trip, but may (and will on linux). It doesn't need a complicated algorithm: replacing invalid bytes with the unicode replacement character is easy enough and works well (it's also how github renders paths with invalid utf8).

Being able to round-trip invalid UTF8 is not a guarantee that's necessary to get a far superior user experience, as your git example shows.

Of course, all this has nothing to do with the PR.

When I started this discussion I wasn't aware of the existing from_bytes method - we can continue this on the forums instead if you want. However, it would probably be a bad idea to add these new methods, and then soon after remove them in favour of something else, even if they don't get stabilized.

@remram44
Copy link
Contributor Author

remram44 commented Aug 4, 2015

You're starting to convince me here. Of course, we should consider whether this only makes sense for paths, and move it there if it does.

@alexcrichton
Copy link
Member

this implementation is inherently non-portable

True, but the concept of an array of bytes representing a path is also just inherently non-portable, so your hands are tied in situations like this.

There's alot else that can be done, as I've been trying to say. Why do you insist on an implementation that will (and already has, by looking at your own examples) lead to programs being broken on non-unix platforms? Especially when it is incompatible with how existing software like git for windows and 7-zip solve the exact same problem.

That's the problem with a non-portable encoding, there's no correct interpretation of a non-utf8 array of bytes on Windows. Each program can end up doing something subtly different and there's nothing guaranteeing that everyone agrees. Are you saying there's a "well accepted standard" for what do do in this case, though?


Overall I totally agree that we should try to work with as many paths as possible on all platforms, but attempting to work with non-utf8 arrays on Windows is just a hard operation to do. Applications surely need a pretty heft amount of support behind them to support this, and it's something that could easily exist as an external crates.io library. The standard library, however, can always provide a convenience when this isn't possible.

Correctly saying that a path can't be handled on Windows is in my opinion better than incorrectly attempting to translate it, later leading to conflicts and/or confusion which can corrupt unrelated files.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 4, 2015

Each program can end up doing something subtly different and there's nothing guaranteeing that everyone agrees.

At least if this is in the standard library, then all rust code will agree.

Are you saying there's a "well accepted standard" for what do do in this case, though?

Replacing invalid bytes with the unicode replacement character is a standard way to interpret ill-formed utf8, and one of the main reasons for the inclusion of that character within unicode.

Correctly saying that a path can't be handled on Windows is in my opinion better than incorrectly attempting to translate it, later leading to conflicts and/or confusion which can corrupt unrelated files.

Unix is normally case sensitive, so you can have two files, 'A' and 'a' which on windows will refer to the same thing. Are you suggesting that it should also return an error on windows unless all characters are in lower-case? (good luck figuring out which case-conversion algorithm microsoft used!)

Returning an error is actively bad because it implies that this function has any sort of guarantees at all about uniqueness and equality of paths, which it doesn't, even on linux.

@alexcrichton
Copy link
Member

@Diggsey I think that this is going off into the weeds so let's try to reel this back in. I don't understand where you're coming from about path equality, that seems pretty orthogonal to this helper function in the standard library. The purpose here is to provide a function which can return an interpretation of an array of bytes as an OsStr in a lossless fashion. There are some cases where that cannot be done, so it returns None.

We already have methods of dealing with a lossy conversion, such as replacing with the unicode replacement character, via to_string_lossy. This helper method is simply for situations where this lossiness is unacceptable.

Whether or not a lossy or lossless conversion should be done is up to the application at hand, for example one may wish to do something like:

let bytes: &[u8] = ...;
let os_string = OsString::from_platform_bytes(bytes).unwrap_or_else(|| {
    OsString::from(String::from_utf8_lossy(bytes))
});

What you seem to be advocating is that this function should not exist in the standard library because it is infallible on Unix and may fail on Windows, and this fallibility may lead to code unnecessarily failing on Windows where it would otherwise succeed on Unix. I think that it is certainly admirable to get as many applications working on all platforms as robustly as possible, but it shouldn't come at the cost of forcing all code to be less ergonomic or reimplemented conversions.

There are some situations where you simply must do a conversion of bytes to a path, and the application at hand is the one that gets to decide what to do in the erroneous cases. Just because there are erroneous cases doesn't mean that we should avoid any sort of support in the standard library.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 5, 2015

I don't understand where you're coming from about path equality

You previously objected to using a lossy conversion on the basis that it may cause two paths to refer to the same file, possibly resulting in one overwriting the other. My point was that this can happen anyway, even on linux with lossless conversions: it's not something you can realistically prevent.

What you seem to be advocating is that this function should not exist in the standard library because it is infallible on Unix and may fail on Windows

Yes. More specifically, it has additional requirements on the input for windows.

it shouldn't come at the cost of forcing all code to be less ergonomic or reimplemented conversions.

I think in many situations it is worth trading off ergonomics for portability. However, all of the uses of this function which we've discussed would be better served by using the potentially lossy version of this function. If that is provided as the de-facto way to convert bytes to a path, then they all get a consistent, sensible fallback mechanism. The alternative means that many users will not provide a fallback at all, and the rest will likely implement it inconsistently.

@retep998
Copy link
Member

Using a C library that wants or gives you a *const c_char

In this case you're not working with the OS so OsString isn't appropriate here. Plus if the C library is doing OS calls with narrow strings, then it is completely not unicode whatsoever, so attempting to interpret it as UTF-8 will fail flat on its face. For this case, where a C library is calling narrow OS functions, perhaps we should provide a method that calls to MultiByteToWideChar to convert from the current code page to unicode (and also include a big warning that you really should be using wide versions of Windows functions).

Working with a predefined file format that uses bytes

What sort of file format doesn't specify the encoding on text? If no encoding is specified, then there is absolutely no way to interpret it correctly. At best you can guess. Also, again, this isn't working with the OS so OsString still isn't relevant.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 10, 2015

@retep998 For most cases you're absolutely right on both counts. The one exception is when dealing with filesystem paths. Rust's PathBuf uses an OsString to store these, but some C libraries and file formats use unencoded bytes to match how paths are stored on linux.

On linux, it's desirable to be able to handle ill-formed utf8 in these paths. The question is how to handle conversion from bytes to path in a portable way, given the existence of non-linux platforms. My preferred option is to only provide a potentially lossy "bytes -> path" conversion which will always succeed, rather than a reversible "bytes -> path" conversion which may fail, but only when not on linux. The justification being that programs which rely on the conversion function being perfectly injective are inherently broken anyway (case insensitivity ftw!), and that it will result in a massively better compatibility story on windows.

Furthermore, the name of the function should at the very least include "utf8" and not "platform", and should exist for Path/PathBuf not OsString/OsStr.

@alexcrichton
Copy link
Member

@Diggsey

You previously objected to using a lossy conversion on the basis that it may cause two paths to refer to the same file, possibly resulting in one overwriting the other. My point was that this can happen anyway, even on linux with lossless conversions: it's not something you can realistically prevent.

I believe that issuing syscalls vs the interpretations of those issuings are orthogonal concepts. If we had our own lossy translation then we'd be issuing syscalls with the exact same set of parameters, whereas if various encodings were later treated the same then that's up to the filesystem and out of our control 100% of the time. It's the our job to represent data as faithfully as possible to the point where it hits the filesystem to enable all possible use cases.

I think in many situations it is worth trading off ergonomics for portability.

I feel that this is not the tradeoff in play here, however. Omitting functions like this are forcing applications on Unix to either:

  • Use unwrap or return errors more eagerly than they should.
  • Re-implement the same functionality described here, perhaps incorrectly in some cases.

Adding these functions means that strictly more cases are seamlessly handled across all platforms. It's true that some paths will be accepted on Unix where they'll be rejected on Windows, but it's not the case that every application has this kind of behavior and would run into it.

However, all of the uses of this function which we've discussed would be better served by using the potentially lossy version of this function.

This is possible to do today with the to_string_lossy function, this PR is pretty orthogonal to this as it's providing a function for those who want to deal with a non-lossy version.


@retep998

Using a C library that wants or gives you a *const c_char

In this case you're not working with the OS so OsString isn't appropriate here.

I don't think it's correct to make a blanket statement like this. These libraries exist, and they work on Windows, they've made the decision to take *const c_char, and most of the time this simply cannot change. We can't just turn a blind eye and present an unergonomic interface to all of these libraries, idiomatic Rust bindings will want to interpret these values as OsStr or some variant of it.

As a result there aren't many choices when binding these libraries. You have some bytes, and it's intended to be used as a Path in Rust, so you're basically forced to expose it somehow. I do think that you'd of course also want to expose the raw bytes for those extra-robust use cases.

The difference between OS strings on Windows and Unix should not plauge all of Rust and constantly make any OS interaction as unergonomic as possible. We've selected a pretty nice solution in the standard library, but that does not constitute the entire world. Working with other libraries on all platforms is also a strong point of Rust!


@Diggsey

Furthermore, the name of the function should at the very least include "utf8" and not "platform", and should exist for Path/PathBuf not OsString/OsStr.

I disagree with this because that's just the meaning on Windows, on Unix there's no utf-8 guarantee needed at all.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 10, 2015

I feel that this is not the tradeoff in play here, however. Omitting functions like this are forcing applications on Unix to either:

  • Use unwrap or return errors more eagerly than they should.
  • Re-implement the same functionality described here, perhaps incorrectly in some cases.

I don't follow? Surely those are both arguments against a version returning an Option! A version which always succeeds cannot result in more calls to unwrap, and there's no need to reimplement the functionality: the lossy version behaves exactly the same on linux as the non-lossy version does, except you don't need to unwrap it! The difference is that they don't have to reimplement the lossy fallback themselves, which as you point out, they are likely to omit or do incorrectly.

This is possible to do today with the to_string_lossy function, this PR is pretty orthogonal to this as it's providing a function for those who want to deal with a non-lossy version.

to_string_lossy is not the same, and although the function I'm advocating can be implemented in terms of it, that's no guarantee that people will do so correctly or consistently. (to clarify, the function I am advocating is equivalent to 'to_string_lossy' on windows, and a no-op on linux).

Also, a non-lossy version only makes sense on linux: there is no sensible non-lossy conversion from bytes to strings on windows, so just leave it in the platform specific part of the standard library!

I disagree with this because that's just the meaning on Windows, on Unix there's no utf-8 guarantee needed at all.

NO, it's NOT the meaning on windows. UTF8 has absolutely no connection to windows AT ALL. The only reason it even remotely makes sense to try decoding the paths as UTF8 is because they are coming from linux, where utf8 is the norm. This is why it should not include "platform" in the name.

Now, for why it should include UTF8 in the name: in rust, paths do have an encoding specified: potentially ill-formed UTF8. This is why, when you convert from a Path/OsString to a String, you don't need to specify an encoding (whereas to go from a Vec<u8> to a String, you do).

You're confusing the concepts of data which has an encoding, and data where the encoding is enforced. Paths are UTF8, it's just that they're not guaranteed to be valid UTF8: whenever you display a path in rust, it will interpret it as UTF8, replacing invalid bytes with escape sequences.

@aturon aturon added the I-needs-decision Issue: In need of a decision. label Aug 10, 2015
@aturon
Copy link
Member

aturon commented Aug 11, 2015

As explained in this comment, this issue has reached the point that an RFC is warranted to take any action around these APIs (including stabilization of the existing API). As such, I'm going to close out this PR, and hope everyone will continue discussion in a new RFC thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants