-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Add OsStr::from_platform_bytes() #27390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Should be from_bytes(), since that the name of the method in OsString; however this is already used in OsStrExt on Unix. Thus from_bytes_slice(), which makes it clear why the name is different (this one operates on slices without copying).
OsString::from_bytes() is now OsString::from_platform_bytes(), to match OsStr::from_platform_bytes(). The method was still unstable.
r? @aturon (rust_highfive has picked a reviewer for you, use r? to override) |
This is incorrect on windows: OsString does not use UTF8 encoding on windows, it uses WTF8, a superset of UTF8 designed to have a one-to-one mapping with "potentially ill-formed UTF16", which is the platform encoding of unicode strings on windows. This function can't exist in a cross-platform way in its current form: it should either restrict the input to valid UTF8 across all platforms, or it should put no additional restrictions on the input, by taking a |
I see absolutely no reason to rename; there’s UFCS for resolving such ambiguities. As far as conversion goes, you really want to convert from |
We already have I believe that exposing |
Interesting detail: I'm not adding functionality here, just providing the zero-copy &OsStr version with the same semantics as the existing (unstable) owned OsString method. As for the functionality, I know it's not perfectly right, but it's still closer than parsing UTF-8 then converting String to OsString. Non-UTF-8 strings definitely appear on UNIX systems in day-to-day situations (they are not an error, they are not an oversight; people still use non-UTF-8 locales), whereas non-unicode strings on Windows are (arguably) an error people shouldn't really rely on; I'm ok not supporting it in some methods (like this one) with proper warnings (which I can add if you want). You have to understand that the alternative is for everyone trying to get an OsString (and thus a Path) from bytes is to write a UNIX and a Windows version themselves. My view is that nobody is going to do that, everyone is going to use either this or go from String (UTF-8) and back (if we remove this feature), and it'll be a lot more broken (see things like getopts which just panics if you feed it non-UTF-8 arguments)! People will not get this right, don't take away the easy tool to get this right just because it misses one impossible-to-cover corner-case on one platform. |
There seems to be some misunderstanding about this PR. As @remram44, this is just filling out a variant of functionality we already have. While it's true that in general on Windows strings may be arbitrary |
btw tests don't pass because of the last commit, which should arguably just be dropped (no need to deprecate something that has never been stable?) |
@aturon I still don’t see how assumption that any byte sequence on Windows will be UTF-8 is correct; is there a document anywhere that mandates all bytestrings returned by the Windows “platform” to be valid UTF-8? EDIT: oooh, it returns an Option. Yeah, ok. I have nothing else to say here. |
@nagisa Consider the existing (The WTF8 encoding is a private representation detail for our |
if cfg!(windows) { | ||
str::from_utf8(bytes).ok().map(|s| s.as_ref()) | ||
} else { | ||
Some(unsafe { mem::transmute(bytes) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this use the safe OsStrExt::from_bytes
instead?
I don't think that function should have been added in the first place in this form, but it's a bit late for that.
On windows, strings are not byte sequences, so any code which blindly takes a byte sequence and converts to an OsString using this method will be broken. The closest approximation would be when using the non-unicode version of the windows API, in which case, assuming that the bytes are UTF8 is still wrong: the string should first be converted to a wide string using the correct code-page with the functions windows provides ( Can you give a single example where this would give the correct behaviour on both platforms?
Agreed, which is precisely why there shouldn't be a way to construct an OsString from a byte sequence. There should be platform specific function, which converts to/from a raw platform string on each platform, taking a
That's absurd, the situation is the same on both platforms with regard to ill-formed UTF8/UTF16, you're just giving your personal preference for one platform over another. |
|
|
In which case you must know how those path names are encoded. The file format or protocol will dictate what encoding should be used - this function misrepresents itself as being valid regardless of such context. Furthermore, it has different behaviour on each platform, and so is useless for data from files or sockets, which must read the same everywhere. |
@remram44 Also, if you're reading/writing bytes instead of UTF-8, while that allows for *nix paths which might not be unicode, it completely disregards Windows paths which might not be unicode. Blatant disregard for non-unicode paths on one side is not going to make your code cross-platform. |
I personally would like to see these sorts of methods added and stabilized (in addition to the
I feel that the use case for "this path needs to be representable as bytes" really does come up from time to time. I'm quite thankful that I'm forced to acknowledge somewhere along the line that the path probably needs to be unicode on some platforms, but I don't think that we should never adds these convenience methods because they're seen as "wrong to call". In each of the above situations there are a few courses of action I could take:
I totally agree that we should have proper names for all these to indicate that they're lossy, and that they should be very clearly documented with why they're returning an |
@alexcrichton Both of those cases you describe are cases where you shouldn't be using this function - the problem is not that its lossy, but that it has entirely different behaviour on the two platforms. For example, by using this implementation for cargo, it's now possible for crates to be published from linux inadvertantly using non-UTF8 filenames, which are then completely inaccessible to windows users. The best option IMO, when you need paths to be modestly portable, is to just pick an encoding such as UTF8, and enforce it everywhere, requiring that paths be well-formed strings. That at least allows problems to be caught early. The only exception, which your How exactly would you name this function such that its behaviour would not be surpising? |
I personally don't quite agree with this. For example why should Cargo be hamstrung on Unix because it also supports Windows. Like the standard library, Cargo should work to the fullest extent on all platforms, and you just have to opt-in to cross platform support yourself. The only difference between Cargo and the standard library is that with Cargo you don't ever declare "allow non-unicode filenames".
I don't really consider this the "only exception". It's quite ubiquitous to have filenames stored as an array of bytes. For example I just remembered the other day that the tarball archive format stores paths as bytes and sure enough I have a function to do the necessary platform logic.
I agree! This is what current implementations do, but it would be much nicer if the origin of this error (e.g. the "this isn't utf-8" statement) came from the standard library instead of each crate that needs this logic.
It sounds like we should focus more on whether this function should exist in the standard library first, because that's what seems in question here. I am not personally a huge fan of |
Looks like the “platform” part of At least you won’t have to pretend that any corner case is completely fine as long as the case is noted in the documentation. |
That does the same thing except it chokes on valid filenames on UNIX. Why would anyone do that? |
That has the same semantics on all platforms. I might get a little less opposed to this if there’s another already stable method/function that panics for same input on some platforms, but not others, and the suggested behaviour is as (or more) misleading (no such thing as “OS” bytes on windows). Otherwise, I honestly care little about windows, so am not inclined to waste my energy arguing over this anymore, since I’m sure you all already got my point. |
Why should Cargo be hamstrung on Windows because it also supports Unix?
With the standard libary, portability is the default. If you want to use platform specific features you can, but you have to opt-in explicitly by using
Among tools developed solely for unix-like platforms, yes. Unsurprisingly, few of those see extensive use on windows, outside of compatibility layers like cygwin and mingw.
It would be much nicer if it didn't error at all. Any API which can error on some platforms and not others will suffer from programmers not bothering to handle the error, forwarding it to their users instead, as is demonstrated by your Personally, I think this functionality should exist in a "unix compatibility" module, maybe in an external crate, and should perform a best-effort, potentially lossy mapping from byte arrays to OsStrings, which will always succeed. Invalid UTF8 sequences could be removed or replaced with unicode substitution characters on windows. One could imagine a similar "windows compatibility" module, which takes a |
@Diggsey I think we're talking in circles, the point I'm trying to get across is that there are situations where you have no choice but to convert an array of bytes to a path. This typically happens because you're doing something like:
The standard library can't just pretend these situations don't exist, and it would be quite nice if there were convenience functions for handling these cases as opposed to every crate having to reinvent the same logic. On unix this bytes-to-path conversion is fine, but on Windows there just fundamentally isn't that many courses of action to take. The only real ubiquitous one which the standard library can take is attempt to interpret the data as UTF-8. This of course means that some input will be rejected, but that's just a fact that can't be worked around. These convenience functions being added in this PR will always be well documented in terms of what they mean for portability. They seem desired enough to also warrant inclusion into the standard library itself as well.
I think @remram44 answered this as well, but there's just nothing else that can be done. This library (and I suspect many others) want to be as lossless as possible, so it doesn't always make sense to require that filenames (even on unix) be valid unicode when there's an otherwise valid interpretation. |
I'm not disagreeing there, I'm simply saying that such a function should absolutely not take the form it does in the existing (thankfully unstable) from_bytes implementation, and that this implementation is inherently non-portable.
There's alot else that can be done, as I've been trying to say. Why do you insist on an implementation that will (and already has, by looking at your own examples) lead to programs being broken on non-unix platforms? Especially when it is incompatible with how existing software like git for windows and 7-zip solve the exact same problem. |
My opinion is that, on Windows, the goal was for filenames to be unicode, and on UNIX, bytes. The fact that Windows accepts invalid unicode data is a bug, and while it's good to support broken strings when it is possible, having them be handled as second-class citizens is fine. That is why I am ok with these functions choking on this (non-silently, and in a recoverable way), if the doc explains clearly why/when it happens. How do "software like git or 7-zip solve the exact same problem"? Apparently MSYS-Git decodes UTF-8 if valid, else defaults to some other fixed-length encoding. Might be better, but it's still broken (try checking this out); how much complexity do we want to put in here? I'm ok with touching up the Windows implementation (though it wouldn't be able to return references anymore), but of course if Let me know what you think. Of course, all this has nothing to do with the PR. |
I still disagree with that sentiment, but discussing that is somewhat pointless.
That behaviour is far better: equality between paths has always been ill-defined, even on a single platform. You can have different filesystems mounted, network filesystems, etc. some of which may be case-insensitive, or have other peculiarities. The fact that two very similar paths map to the same thing under some circumstances can happen on any platform, is not particularly surpising, and there's no way to avoid it, save for not using such similar paths. However, the most important thing, is that I can check out that repository. I can even commit changes to either file, and although it may be difficult to modify both files in a single commit, I could probably tweak some config files and get it to work. If git relied on a function like this, I wouldn't even be able check out the repository!
This has nothing to do with "platform bytes": we're not talking about how byte sequences should be decoded on the current platform (which would be according the system codepage on windows), so let's call it what it actually is: Valid unicode will round-trip on all platforms. Invalid unicode is not guaranteed to round-trip, but may (and will on linux). It doesn't need a complicated algorithm: replacing invalid bytes with the unicode replacement character is easy enough and works well (it's also how github renders paths with invalid utf8). Being able to round-trip invalid UTF8 is not a guarantee that's necessary to get a far superior user experience, as your git example shows.
When I started this discussion I wasn't aware of the existing |
You're starting to convince me here. Of course, we should consider whether this only makes sense for paths, and move it there if it does. |
True, but the concept of an array of bytes representing a path is also just inherently non-portable, so your hands are tied in situations like this.
That's the problem with a non-portable encoding, there's no correct interpretation of a non-utf8 array of bytes on Windows. Each program can end up doing something subtly different and there's nothing guaranteeing that everyone agrees. Are you saying there's a "well accepted standard" for what do do in this case, though? Overall I totally agree that we should try to work with as many paths as possible on all platforms, but attempting to work with non-utf8 arrays on Windows is just a hard operation to do. Applications surely need a pretty heft amount of support behind them to support this, and it's something that could easily exist as an external crates.io library. The standard library, however, can always provide a convenience when this isn't possible. Correctly saying that a path can't be handled on Windows is in my opinion better than incorrectly attempting to translate it, later leading to conflicts and/or confusion which can corrupt unrelated files. |
At least if this is in the standard library, then all rust code will agree.
Replacing invalid bytes with the unicode replacement character is a standard way to interpret ill-formed utf8, and one of the main reasons for the inclusion of that character within unicode.
Unix is normally case sensitive, so you can have two files, 'A' and 'a' which on windows will refer to the same thing. Are you suggesting that it should also return an error on windows unless all characters are in lower-case? (good luck figuring out which case-conversion algorithm microsoft used!) Returning an error is actively bad because it implies that this function has any sort of guarantees at all about uniqueness and equality of paths, which it doesn't, even on linux. |
@Diggsey I think that this is going off into the weeds so let's try to reel this back in. I don't understand where you're coming from about path equality, that seems pretty orthogonal to this helper function in the standard library. The purpose here is to provide a function which can return an interpretation of an array of bytes as an We already have methods of dealing with a lossy conversion, such as replacing with the unicode replacement character, via Whether or not a lossy or lossless conversion should be done is up to the application at hand, for example one may wish to do something like:
What you seem to be advocating is that this function should not exist in the standard library because it is infallible on Unix and may fail on Windows, and this fallibility may lead to code unnecessarily failing on Windows where it would otherwise succeed on Unix. I think that it is certainly admirable to get as many applications working on all platforms as robustly as possible, but it shouldn't come at the cost of forcing all code to be less ergonomic or reimplemented conversions. There are some situations where you simply must do a conversion of bytes to a path, and the application at hand is the one that gets to decide what to do in the erroneous cases. Just because there are erroneous cases doesn't mean that we should avoid any sort of support in the standard library. |
You previously objected to using a lossy conversion on the basis that it may cause two paths to refer to the same file, possibly resulting in one overwriting the other. My point was that this can happen anyway, even on linux with lossless conversions: it's not something you can realistically prevent.
Yes. More specifically, it has additional requirements on the input for windows.
I think in many situations it is worth trading off ergonomics for portability. However, all of the uses of this function which we've discussed would be better served by using the potentially lossy version of this function. If that is provided as the de-facto way to convert bytes to a path, then they all get a consistent, sensible fallback mechanism. The alternative means that many users will not provide a fallback at all, and the rest will likely implement it inconsistently. |
In this case you're not working with the OS so
What sort of file format doesn't specify the encoding on text? If no encoding is specified, then there is absolutely no way to interpret it correctly. At best you can guess. Also, again, this isn't working with the OS so |
@retep998 For most cases you're absolutely right on both counts. The one exception is when dealing with filesystem paths. Rust's PathBuf uses an OsString to store these, but some C libraries and file formats use unencoded bytes to match how paths are stored on linux. On linux, it's desirable to be able to handle ill-formed utf8 in these paths. The question is how to handle conversion from bytes to path in a portable way, given the existence of non-linux platforms. My preferred option is to only provide a potentially lossy "bytes -> path" conversion which will always succeed, rather than a reversible "bytes -> path" conversion which may fail, but only when not on linux. The justification being that programs which rely on the conversion function being perfectly injective are inherently broken anyway (case insensitivity ftw!), and that it will result in a massively better compatibility story on windows. Furthermore, the name of the function should at the very least include "utf8" and not "platform", and should exist for Path/PathBuf not OsString/OsStr. |
I believe that issuing syscalls vs the interpretations of those issuings are orthogonal concepts. If we had our own lossy translation then we'd be issuing syscalls with the exact same set of parameters, whereas if various encodings were later treated the same then that's up to the filesystem and out of our control 100% of the time. It's the our job to represent data as faithfully as possible to the point where it hits the filesystem to enable all possible use cases.
I feel that this is not the tradeoff in play here, however. Omitting functions like this are forcing applications on Unix to either:
Adding these functions means that strictly more cases are seamlessly handled across all platforms. It's true that some paths will be accepted on Unix where they'll be rejected on Windows, but it's not the case that every application has this kind of behavior and would run into it.
This is possible to do today with the
I don't think it's correct to make a blanket statement like this. These libraries exist, and they work on Windows, they've made the decision to take As a result there aren't many choices when binding these libraries. You have some bytes, and it's intended to be used as a The difference between OS strings on Windows and Unix should not plauge all of Rust and constantly make any OS interaction as unergonomic as possible. We've selected a pretty nice solution in the standard library, but that does not constitute the entire world. Working with other libraries on all platforms is also a strong point of Rust!
I disagree with this because that's just the meaning on Windows, on Unix there's no utf-8 guarantee needed at all. |
I don't follow? Surely those are both arguments against a version returning an Option! A version which always succeeds cannot result in more calls to unwrap, and there's no need to reimplement the functionality: the lossy version behaves exactly the same on linux as the non-lossy version does, except you don't need to unwrap it! The difference is that they don't have to reimplement the lossy fallback themselves, which as you point out, they are likely to omit or do incorrectly.
Also, a non-lossy version only makes sense on linux: there is no sensible non-lossy conversion from bytes to strings on windows, so just leave it in the platform specific part of the standard library!
NO, it's NOT the meaning on windows. UTF8 has absolutely no connection to windows AT ALL. The only reason it even remotely makes sense to try decoding the paths as UTF8 is because they are coming from linux, where utf8 is the norm. This is why it should not include "platform" in the name. Now, for why it should include UTF8 in the name: in rust, paths do have an encoding specified: potentially ill-formed UTF8. This is why, when you convert from a Path/OsString to a String, you don't need to specify an encoding (whereas to go from a You're confusing the concepts of data which has an encoding, and data where the encoding is enforced. Paths are UTF8, it's just that they're not guaranteed to be valid UTF8: whenever you display a path in rust, it will interpret it as UTF8, replacing invalid bytes with escape sequences. |
As explained in this comment, this issue has reached the point that an RFC is warranted to take any action around these APIs (including stabilization of the existing API). As such, I'm going to close out this PR, and hope everyone will continue discussion in a new RFC thread. |
Rebased #26730
OsString::from_bytes()
turnsVec<u8>
intoOsString
.This adds the same thing, from
&[u8]
to&OsStr
, and renames both of them tofrom_platform_bytes()
to avoid conflict withOsStrExt::from_bytes()
already available on UNIX.I don't know if the last commit is needed (should from_bytes() just be removed without deprecation? It was never stable), let me know.