Skip to content

Commit d35f558

Browse files
authored
Merge pull request #4 from Manishearth/invalid-values
Section on invalid values + minor stuff
2 parents b40f0b0 + 7988899 commit d35f558

File tree

4 files changed

+188
-2
lines changed

4 files changed

+188
-2
lines changed

src/SUMMARY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,14 @@
33
- [Introduction](./introduction.md)
44
- [Undefined behavior](./undefined_behavior.md)
55
- [Core unsafety](./core_unsafety.md)
6-
- [Invalid values](./core_unsafety/invalid_values.md)
76
- [Dangling and unaligned pointers](./core_unsafety/dangling_and_unaligned_pointers.md)
87
- [Data races](./core_unsafety/data_races.md)
98
- [Intrinsics](./core_unsafety/intrinsics.md)
109
- [ABI and FFI](./core_unsafety/abi_and_ffi.md)
1110
- [Platform features](./core_unsafety/platform_features.md)
1211
- [Inline assembly](./core_unsafety/inline_assembly.md)
1312
- [Advanced unsafety](./advanced_unsafety.md)
13+
- [Invalid values](./core_unsafety/invalid_values.md)
1414
- [Pointer aliasing](./advanced_unsafety/pointer_aliasing.md)
1515
- [Immutable data](./advanced_unsafety/immutable_data.md)
1616
- [Atomic ordering](./advanced_unsafety/atomic_ordering.md)
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Invalid values
2+
3+
> _“If you tell the truth, you don't have to remember anything.”__
4+
> _Mark Twain_
5+
6+
Values of a particular type in Rust may never have an "invalid" bit pattern for that type. This is true even if that value is never read from afterwards, or if that value simply exists behind an unread reference. From [the reference]:
7+
8+
> "Producing" a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation.
9+
10+
11+
12+
A lot of basic types _don't_ have any rules about invalid values. For example, all bit patterns of the integer types (and arrays of the integer types) are valid. But most other types have some concept of validity.
13+
14+
## Types of invalid values
15+
16+
### Uninitialized memory
17+
18+
Values of _any_ type can be "uninitialized", which is considered instantly UB even for types like integers. We discuss this further in [the chapter on uninitialized memory][uninit-chapter]. For now this chapter will largely cover cases where a type may have an invalid _bit pattern_, rather than other cases where it may be invalid due to e.g. not having an initialized bit representation at all.
19+
20+
### Primitive types with invalid values
21+
22+
`bool`s that have bit patterns other than those for `true` and `false` are invalid. The same goes for `char`s representing byte patterns that are considered invalid in UTF-32 (anything that is either a surrogate character, or greater than `char::MAX`).
23+
24+
25+
### Pointers with invalid values
26+
27+
`&T` and `&mut T` may not be null, nor may they be [unaligned] for values of type `T`.
28+
29+
`fn` pointers and the metadata part of `dyn Trait` may not be null either.
30+
31+
Most smart pointer types like `Box<T>` and `Rc<T>` are invalid when null. Library types may achieve the same behavior using the [`NonNull<T>`] pointer type.
32+
33+
It's also currently invalid for `Vec<T>` to have a null pointer for its buffer! `Vec<T>` uses [`NonNull<T>`] internally, and empty vectors use a pointer value equal to the alignment of `T`.
34+
35+
There are a lot of other reasons that a pointer type may not be valid, but these are the ones where the bit pattern is statically known to be invalid regardless of context. We'll be covering these in more depth in other chapters (@@note: where?), but, for example, all of these pointers must not only be non-null, they must also point to an actual valid instance of that type at all times (except `Vec<T>`, which is allowed to refer to invalid-but-aligned-and-non-null memory when it is empty)
36+
37+
38+
### Enums with invalid values
39+
40+
41+
Any bit pattern not covered by a variant of an enum is also invalid. For example, with the following enum:
42+
43+
```rust
44+
enum Colors {
45+
Red = 1,
46+
Orange = 2,
47+
Yellow = 3,
48+
Green = 4,
49+
Blue = 5,
50+
Indigo = 6,
51+
Violet = 7,
52+
}
53+
```
54+
55+
a bit pattern of `8` or `0` (assuming that it gets represented as the explicit discriminant integers) is undefined behavior.
56+
57+
Or in this enum:
58+
59+
```rust
60+
enum Stuff {
61+
Char(char),
62+
Number(u32),
63+
}
64+
```
65+
66+
setting the discriminant bit to something that is not the discriminant of `Char` or `Number` is invalid. Similarly, setting the discriminant bit to that for `Char` but having the value be invalid for a `char` is also invalid.
67+
68+
### `str`
69+
70+
The string slice type `str` does not actually have any validity constraints: Despite being only for UTF-8 encoded strings, it is valid for `str`s to be in any bit pattern, provided you do not call any methods on the string that are not about directly accessing the memory behind it.
71+
72+
Basically, the UTF-8 validity of `str` is an implicit safety requirement for most of its methods, however it is fine to _hold on to_ an `&str` that points to random bytes. This is a difference between things being "insta-UB" and "UB on use": invalid value UB is typically "insta UB" (it's UB even if you don't _do_ anything with the invalid value), but here you're allowed to do this as long as you don't use the data in certain ways.
73+
74+
This is something that can be relied on when doing things like manipulating or constructing `str`s byte-by-byte, where there may be intermediate invalid states.
75+
76+
Of course, reference types like `&str` must still satisfy all of the rules about reference validity (being non-null, etc).
77+
78+
### Invalid values for general library types
79+
80+
In general, types may have various invalid values based on their internal representation (which may not be stable!).
81+
In addition to [`NonNull<T>`], the Rust standard library provides [`NonZeroUsize`] and a bunch of other similar `NonZero` integer types that work as its integer counterparts, and libraries may use these internally.
82+
83+
84+
Note that Rust's default representation for types is not stable! What might be a valid bit pattern one day may become invalid later, unless you're only relying on things that are known to be invariant. Converting a type to its bits, sending it over the network, and converting it back is extremely fragile, and will break if the two sides are on different platforms or even Rust versions.
85+
86+
As a library user you may not assume anything about the representation of a library type unless it is explicitly documented as such, or if it has a public representation that is known to be stable (for example a public `#[repr(C)]` enum)
87+
88+
89+
90+
## When you might end up making an invalid value
91+
92+
93+
Invalid values have a chance to crop up when you're reinterpreting a chunk of memory as a value of a different type. This can happen when calling [`mem::transmute()`], [`mem::transmute_copy()`], or [`mem::zeroed()`], when casting a reference to a region of memory into one of a different type, or when accessing the wrong variant of a `union`. The value need not be on the stack to be considered invalid: if you gin up an `&bool` that points to a bit pattern that is not a valid `bool`, that is instantly UB even if you don't read from the reference.
94+
95+
They can also happen when receiving values over FFI where either the signature of the function is incorrect (e.g. saying an FFI function accepts `bool` when the other side thinks it accepts a `u8`), or where there are differences in notions of validity across languages.
96+
97+
A subtle case of this comes up occasionally in FFI code due to differences in expectations between how enums are used in Rust and C.
98+
99+
In C, it is common to use enums to represent _bitmasks_, doing something like this:
100+
101+
```c
102+
typedef enum {
103+
Active = 0x01;
104+
Visible = 0x02;
105+
Updating = 0x03;
106+
Focused = 0x04;
107+
} NodeStatus;
108+
```
109+
110+
where the value make take states like `Active | Focused | Visible`. These combined values, as well as the "no flags set" value `0` are invalid in Rust. If this type is represented as an enum in Rust ([even if it is `#[repr(C)]`][reprc-enum]!), it will be UB to accept values of this type over FFI from C. Generally in such cases it is recommended to use an integer type instead, and represent the mask values as constants.
111+
112+
113+
## Signs an invalid value was involved
114+
115+
The compiler is allowed to assume that values are never invalid; and it may use invalid states to signal other things, or pack types into smaller spaces.
116+
117+
For example, the type `Option<Box<T>>` will use the fact that the reference cannot be null to fit the entire type into the the same space `Box<T>` takes up, with the null pointer state representing `None`.
118+
119+
This can go even further with stuff like `Option<Option<Option<bool>>>` fitting into a single byte, up to and including the type with 254 `Option`s surrounding one `bool`. This general class of optimization is known as a "niche optimization", with bits representing invalid values being called "niches".
120+
121+
In such scenarios, invalid values may lead to values being interpreted as a different value, for example an `Option<NodeStatus>` using the enum from above would be interpreted as `None` if `NodeStatus` were represented as a Rust enum and an "empty status" value was received over C.
122+
123+
Furthermore, invalid values will break `match` statements, usually (but not necessarily) leading to an abort.
124+
125+
Debuggers also tend to behave strangely with invalid values, displaying incorrect values, or even having the value change from read to read.
126+
127+
This is not an exhaustive list: ultimately, having an invalid value is UB and it remains illegal even if there are no optimizations that will break.
128+
129+
130+
131+
[unaligned]: ../core_unsafety/dangling_and_unaligned_pointers.md
132+
[uninit-chapter]: ../undef_memory.md
133+
[`mem::transmute()`]: https://doc.rust-lang.org/stable/std/mem/fn.transmute.html
134+
[`mem::transmute_copy()`]: https://doc.rust-lang.org/stable/std/mem/fn.transmute_copy.html
135+
[`mem::zeroed()`]: https://doc.rust-lang.org/stable/std/mem/fn.zeroed.html
136+
[`NonNull<T>`]: https://doc.rust-lang.org/stable/std/ptr/struct.NonNull.html
137+
[`NonZeroUsize`]: https://doc.rust-lang.org/stable/std/num/struct.NonZeroUsize.html
138+
[reprc-enum]: https://doc.rust-lang.org/reference/type-layout.html#reprc-field-less-enums
139+
[the reference]: https://doc.rust-lang.org/reference/behavior-considered-undefined.html

src/core_unsafety/invalid_values.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

src/undefined_behavior.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Undefined behavior
22

3+
> _“People shouldn't call for demons unless they really mean what they say.”_
4+
>
5+
> _C.S. Lewis, The Last Battle_
6+
37
"Undefined behavior" is a bit of a strange notion. On one hand, the reference
48
[clearly defines][reference_ub] some (but not all) causes of undefined behavior.
59
This list includes some causes that are generally well-known: dereferencing a
@@ -160,5 +164,49 @@ code written upholds all of the conditions required to avoid undefined behavior.
160164
Any unsafe code that can trigger undefined behavior _even when its safey
161165
conditions are upheld_ is unsound.
162166

167+
168+
## Common misconceptions
169+
170+
There are a couple misconceptions about UB that often muddy the water when talking about it.
171+
172+
### "If it works, it's sound"
173+
174+
Undefined Behavior may be present even if the compiler does end up compiling the
175+
code according to the programmer's intent. A future version of the compiler may
176+
behave differently, or future changes to an innocuous portion of the code may
177+
cause it to fall to the other side of an invisible threshhold. Technically it
178+
may even compile differently but only on Tuesdays, though that type of
179+
nondeterminism is generally rare.
180+
181+
182+
### "UB is about what the optimizer is allowed to do"
183+
184+
This is to _some extent_ true but the actual situation is far more nuanced.
185+
186+
It's common for people to think about UB in terms of what an optimizer "is and
187+
isn't allowed to do", and in terms of optimizations they know can occur. For
188+
example, it's pretty straightforward to see that sneakily writing to memory
189+
that you're not supposed to can cause undefined behavior when the optimizer
190+
decides to elide a memory read that occurs after your illicit write.
191+
192+
Firstly, some forms of UB just have to do with rules the underlying processor
193+
enforces.
194+
195+
But more than that, there are plenty of miscompiles that are hard to explain by
196+
simply thinking in terms why the optimizer would do such a thing.
197+
198+
This is because it's less about what the optimizer is "allowed to do" and more
199+
about what it is "allowed to assume". When a code has UB, the optimizer may
200+
make an incorrect assumption that snowballs into bigger and bigger incorrect
201+
assumptions that cause very unexpected behavior.
202+
203+
It's often very _useful_ to think of potential optimizations the optimizer may
204+
do around your code, but that is not sufficient for evaluating whether your
205+
code has UB.
206+
207+
Throughout this book there will be examples of how various optimizations may
208+
break code exhibiting undefined behavior, however it is crucial to learn the
209+
rule behind the breakage rather than just the nature of the optimization.
210+
163211
[reference_ub]: https://doc.rust-lang.org/reference/behavior-considered-undefined.html
164212
[ferrocene]: https://ferrous-systems.com/blog/the-ferrocene-language-specification-is-here/

0 commit comments

Comments
 (0)