-
Notifications
You must be signed in to change notification settings - Fork 1.6k
RFC: Stabilize std::{c_str, c_vec} #494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Stabilize the `std::{c_str, c_vec}` modules by re-working their interfaces and refocusing each primitive for one particular task. The three broad categories of interoperating with C will work via: 1. If you have a Rust string/byte slice which needs to be given to C, then the `CString` type will be used to statically guarantee that a terminating nul character and no interior nuls exist. 2. If C hands you a string which you want to inspect, but not own, then a helper function will assist in converting the C string to a byte slice. 3. If C hands you a string which you want to inspect and own, then a helper type will consume ownership and will act as a `Box<[u8]>` in essence.
Note that this is largely an alternative to #435 |
|
||
impl CString { | ||
pub fn from_slice(s: &[u8]) -> CString { /* ... */ } | ||
pub fn from_vec(s: Vec<u8>) -> CString { /* ... */ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be Option<CString>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose for this to take the route of Path
where it panics by default as we expect that to be the overwhelming default and the invariant can always be checked beforehand to prevent a panic.
A couple of comments: Regarding Also, if we are working with CString there is a chance that our function is called from C code, then panics are unacceptable and CString should probably grow some non-panicing versions of its methods in the future. It would be nice to have a convenience constructor, taking
The scenario 3 - "A C type was handed to Rust, and Rust owns it" and Also, could you explicitly mention why Everything else seems okay :) |
@alexcrichton: Calling |
I think stabilizing these would be premature. The niche for interacting with a C string API that's not a path is so small that it's nearly non-existent. It's rarely correct because C strings are incompatible with both binary data and modern text. UTF-8 allows inner NUL so it's not possible to convert Rust strings to C strings in general. Features should not be in the standard libraries if there are widespread, correct use cases for them. I think this belongs in a repository outside of the standard library, at least until there is proof that it is widely applicable. |
@petrochenkov the major reason that In terms of panics this is somewhat of an orthogonal issue to We generally have been moving away from providing various convenience functions for types like I'm also not sure I see what you mean about case 3. Despite it being rare, and despite it having dangers, this case does happen and it would be nice to have facilities for it somehow. Do you think it's rare enough that it doesn't warrant inclusion in the standard library itself? @thestinger to be clear do you think that the standard library shouldn't provide Do you also think that we shouldn't stabilizing anything related to this area of programming? I'm not sure I agree that this is necessarily a niche use case in that one of the major targets for rust is the "embedded programming" case where Rust is talking to other languages via a C-like API. In cases like this transferring strings across boundaries is quite a common task, and just within the standard distribution itself the fallout is quite nontrivial (as you can see in the branch above). Do you have reservations, other than panicking constructors, which lead you to believe these are too unstable for inclusion at this time? |
@alexcrichton Much of There's definitely need for a typesafe libc/ffi helper/half-unsafe-but-compiles-away wrapper library, but it wouldn't need |
@Jurily: I don't think we could use @alexcrichton: A couple things:
|
I'm not talking about always keeping terminating 0 in I suggest to answer the question: How
Suddenly, there's no
If the main use of |
I don't think it's worth providing any form of this with memory management. It doesn't have common use cases that are correct and it's not portable. Carrying around a function pointer is much uglier than simply using a new type for the specific use case.
The most common use case for C strings is for file paths, because C strings are incompatible with binary data and only support a subset of encodings like UTF-8. They're not correct in general, and it's advisable to do cross-language interactions with a pointer and length anyway.
I've explained why I have reservations about it. It's not correct in general and not portable. |
Using an unboxed closure would make you end up with a different type for each usage. It's not comparable to custom types. |
@Jurily I believe that @erickt pointed out the reason why which is to give a guarantee that there are no interior nul bytes as well as precisely one trailing nul byte, which largely rules out using @erickt we talked in person but I'll respond inline as well:
This is because
We should! That's what
Due to the comments on this thread, it sounds like while this is a possible pattern that it's not one the standard library should support at this time. I'm thinking of removing
I've chosen to assert instead of check as it follows what @petrochenkov I don't quite understand your code snippet you've got there. If you'd like an answer to your question of how @thestinger I think I'm ok removing I'm not really sure I understand your stance on "C strings are not correct in general" because they're a fact that we have to live with. Many C libraries do not pass around a pointer length pair but instead consume and pass C strings. I'm also focusing on the "borrowing" aspect where ownership of strings is not crossing boundaries, just the contents for inspection. It is true that a C strings are not UTF-8 nor arbitrary data, and it is also true that you may not wish to write new C apis with a C string-like interface, but it is a fact that many existing libraries do. These libraries need to be able to interoperate with Rust, and the purpose of |
I've updated with a revision to remove |
The conversation seems to be peaking here, so I'm going to be looking to merge this soon pending further comments. |
It doesn't matter that Rust happily assumes the system locale is UTF-8 anyway, why is this such a big issue? |
@Jurily I'm not sure I understand your concerns. The Also, I'm not sure what you mean by strlen-allocate-copy-free because this RFC currently removes I do believe the assertion that most Can you also explain where locales come into this? The purpose of the |
@Jurily: UTF-8 includes inner NUL and C strings only support a subset of it. Using C strings is not a common use case, because it's rarely correct. It does not work for either binary or text. It only works for domain specific encodings without inner NUL like file paths. Rust is doing the same thing as other languages with real strings like Java, C#, Go, Python, etc. |
It does not rely on their absence. Buffers represented as a pointer and length are very common in C. Even legacy software has support for this because it's a necessity for dealing with binary data. C strings are not interoperable because they can't be mixed with real UTF-8 string implementations. I don't know of a string implementation in a modern language without support for Unicode's inner NUL. |
@alexcrichton Locales are interesting in this case because C strings are not UTF-8 in any standard. Their encoding is defined by the system locale, which is often UTF-8 nowadays, but not required. If we can ignore that complexity in favor of the common case, why not interior nulls? A @thestinger I'm not sure why you dismiss the filesystem API as not common. It's not even just paths that are null-terminated strings, so are the command line, envvars and half a century's worth of C. We can't get rid of them without writing our own OS and porting everything. Whether or not it's good design, it exists. Existing code is already aware of the distinction between text and binary , I don't see how we could unify that even with interior nulls. The "UTF-8 goes here" and "pointer + size" camps don't seem to overlap much. Text takes no size, binary doesn't require valid UTF-8. The gtk family deals only with null-terminated UTF-8, glib is used by half of the Linux desktop. The C++ Is there a major code base that produces or expects interior nulls in UTF-8 that doesn't also take random binary data? |
There is no UTF-8 guarantee for any of those so it has no relevance to Rust's
C string APIs rarely have a UTF-8 guarantee / requirement. They are only able to handle a subset of UTF-8 anyway, so they couldn't be directly mixed with real Unicode strings even if they did.
In general it doesn't enforce or require a specific string encoding. It has Unicode manipulation functions but strings throughout GTK/glib aren't guaranteed to be UTF-8.
This doesn't have much to do with whether text is assumed / guaranteed to be UTF-8.
The You have to explicitly use the facet stuff for anything locale dependent, and it's almost entirely useless because it operates on
QByteArray allows allows inner NUL and makes no guarantee about encoding.
QString is done with pointer/length just like std::string, QByteArray and other real Unicode string types. UTF-16 isn't compatible with C strings either. |
Yes, there's lot of C code handling UTF-8 text correctly. There's also a huge amount of code treating everything as a binary blob. It's not possible to take views into C strings without dynamic memory allocation so they're unusable in many niches. There are hacks like I don't really understand what you're trying to argue. Rust is not going to regress from existing modern languages by supporting only a subset of UTF-8. There used to be a NUL at the end of |
@alexcrichton: The on-stack conversion optimization will be entirely redundant once there's a small vector type. I don't think an API solely for that purpose should be stabilized. |
I really don't think this is a problem that the standard library should take on. Calling into C code is fundamentally Converting a string / vector with an inner NUL is a logic error (unintentional truncation) but I don't think paying the cost of an O(n) search at all of the FFI boundaries is going to be acceptable in general, especially for large texts. It's far better to do the work to maintain that guarantee up-front and avoid paying the cost everywhere. |
Some concerns raised in #435 that I think should be noted here as well:
|
@mzabaluev: C does require that It has nothing to do with future proofing the language. It has to do with legacy systems that C had to support but that Rust is never going to support. |
Code using C strings isn't very performance aware anyway. See below. I do think that an I don't think it belongs in the standard library and all of the disagreeing voices about the design back up that opinion. If there isn't a clear way to do it properly, then it should be left for third party libraries until a time when there's strong consensus on a specific solution.
That's not true. A pre-computed length nearly always improves performance relative to a sentinel at the end. It's much friendlier to compiler auto-vectorization and CPU pipelining. C strings are less efficient nearly across the board before you even get to the major issues like inability to do slicing without copies. C strings mandate pervasive dynamic allocation or wasteful over-allocation via oversized static buffers. |
Isn't it the whole reason for
I totally agree with that.
Let's specifically consider the two traits I mentioned. Note that one reason for implementing them on
All things considered, I would not declare conversion to slices, much less to owned copies, a clearly better alternative here without some testing. And those are basically the only operations I'd like to be directly available on
That's one irreducible issue with the design proposed here. The current design of |
@thestinger If the bit width of |
It doesn't matter that semantically a byte-by-byte comparison is required. It can and will be auto-vectorized and pipelined. Using a length instead of a sentinel makes the code run faster. It's exactly what I said above:
|
That's not what I'm talking about. The
It doesn't make sense to do an O(n) scan rather than just upholding the guarantee. At most it should check for a trailing |
I didn't say anything like that. |
Doing it as I suggested fully eliminates the performance cost and eliminates the API surface of the standard library. It doesn't make sense to define reusable interfaces when they're slow and don't provide an improvement over doing it by hand. |
I don't think requiring that it is 8 bits (like POSIX and Windows) is related to whether or not it should match the platform definition. Rust just won't work on platforms without 8 bit bytes or without either little or big endian byte ordering (PDP endian). I think it makes sense for the definitions to match how they are defined by the platform. It could be a distinct type rather than an alias but it would be more complicated... and I don't think it's important. |
While in theory C strings would be a pointer/length pair going across the boundary, in reality there are many many C libraries which use a null-terminated string to interoperate with foreign code. The standard library needs to support this fact of life both for its own purposes as well as for the benefit of users of Rust. One of the main targets for Rust is being embedded into other applications or languages where having strings cross the boundary is largely just a requirement. You seem to have concerns about the unsafety of this type and operation, but this is all pushed on to the user if necessary (e.g. calling the ffi function at the boundary). The
We may have to disagree on the point about allowing truncation, but this is why there is an unchecked constructor.
This is true! I expect this idiom to be fairly rare, however. We could grow support for this over time, but for now I think it's ok to focus solely on the lending C -> Rust and Rust -> C cases, not the more flavorful C -> Rust -> C cases or transferring ownership cases. Also note that the
Note that this RFC is now explicitly not handling transferring ownership across language boundaries, and bytes can always be copied with |
To think of it, there could be functions to promote a pub fn take_bytes(src: Vec<u8>) -> Option<CString> { /* ... */ }
pub unsafe fn take_bytes_unchecked(src: Vec<u8>) -> CString { /* ... */ }
pub fn take_string(src: String) -> Option<CString> { /* ... */ }
pub unsafe fn take_string_unchecked(src: String) -> CString { /* ... */ } |
... or better, generically, a |
|
||
* `std::str::from_c_str` - this function should be replaced with | ||
`c_str::from_raw_buf` plus one of `str::from_utf8` or | ||
`str::from_utf8_unchecked`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_raw_buf takes *c_char but from_utf8 produces Vec<u8>
. Seems like there is another step here. What is it?
FYI, I have now incorporated many ideas from this RFC and the discussion into #435. |
I've also added a commit now indicating that the new functionality will instead be provided under |
After consideration of this RFC and #435, and the discussions on both, the core team has decided to go with this more conservative design for now. Note that the RFC places the module under a new Thanks everyone for pitching in to the design here, and @mzabaluev for your work on the counter-proposal! |
I like that the general idea of this design ( However, the design feels incomplete, and has inconsistent naming in some places (e.g. The most problematic use case is a safe FFI wrapper that passes on a string to a C function. fn f(s: &CString) {
ffi::f(s.as_ptr());
} but this forces callers to provide a heap-allocated string. If I instead make my function accept I feel that, just as I'm currently writing an implementation of |
@dgrunwald, have you looked at my project for the rejected RFC #435? My I plan to release the crate as |
@mzabaluev: I looked at that, but I don't quite like that approach. You usually shouldn't pass ownership of strings from rust to C or back -- if the C library is statically linked, it might use a different libc::free() than Rust does. On Windows, it's quite normal to have multiple C runtimes in the same process. Here's my approach for a DST CStr. |
@dgrunwald: Sometimes you don't have a choice: a library function hands you an allocated string and tells you to free it with another function, typically provided by the same library. This is what GLib does a lot, for example. To address this, there are generic destructors, and no default destructor on the Regarding the DST borrow: as I said, you can't borrow your way out of the fact that Rust strings are not NUL-terminated, and an inner NUL in a dynamically created Rust string cannot be passed to C without causing interpretation problems and potential security exploits. You can append every owned string you are given with a NUL, but then the string becomes useless on the Rust side unless further sliced or truncated back, so you might as well consume it towards the invariant-enforcing value. |
@dgrunwald, thanks for prodding me in the right direction. I have now implemented deref/borrow introducing my innovative irrelevantly sized type. |
Filed a follow-on RFC PR: #592 |
Stabilize the
std::{c_str, c_vec}
modules by re-working their interfaces andrefocusing each primitive for one particular task. The three broad categories of
interoperating with C will work via:
CString
type will be used to statically guarantee that a terminating nulcharacter and no interior nuls exist.
function will assist in converting the C string to a byte slice.
will consume ownership and will act as a
Box<[u8]>
in essence.