Skip to content

Commit 0f19b6d

Browse files
committed
Fix pathological performance in trait solver cycles with errors
Fuchsia's Starnix system has had a multi-year long bug where occasionally a typo could cause the rust compiler to take 10+ hours to report an error. This was particularly hard to trace down since Starnix's codebase is massive, over 384 thousand lines as of writing. With the help of treereduce, cargo-minimize, and rustmerge, after about a month of running we reduced it down to a couple [lines of code], which only takes about 35 seconds to report an error on my machine. The bug also appears to happen with `-Z next-solver=no` and `-Z next-solver=coherence`, but does not occur with `-Z next-solver` or `-Z next-solver=globally`. I used Gemini to help diagnose the problem and proposed solution (which is the one proposed in this patch): 1. The trait solver gets stuck in an exponential loop evaluating auto-trait bounds (like Send and Sync) on cyclic types that contain compilation errors (TyKind::Error). 2. Normally, if the solver detects a cycle, it prevents the result from being stored in the Global Cache because the result depends on the current evaluation stack. However, when an error is involved, the depth tracking gets pinned to a low value, forcing the solver to rely on the short-lived Provisional Cache. Since the provisional cache is cleared between high-level iterations of the fulfillment loop, the solver ends up re-discovering and re-evaluating the same large cycle thousands of times. 3. Allow global caching of results even if they appear stack-dependent, provided that the inference context is already "tainted by errors" (`self.infcx.tainted_by_errors().is_some()`). This violates the strict invariant that global cache entries shouldn't depend on the stack, but it is safe because the compilation is already guaranteed to fail due to the presence of errors. Prioritizing compiler responsiveness and termination over perfect correctness in error states is the correct trade-off here. I added the reduction as the test case for this. However, I don't see an easy way to catch if this bug comes back. Should we add some way to timeout the test if it takes longer than 10 seconds to compile? That could be a source of flakes though. I don't have any experience with the trait solver code, but I did try to review the code to the best of my ability. This approach seems a bit of a bandaid to the solution, but I don't see a better solution. We could try to teach the solver to not clear the provisional cache in this circumstance, but I suspect that'd be a pretty invasive change. I'm guessing if this does cause problems, it might report an incorrect error, but I (and Gemini) were unable to come up with an example that reported a different error with and without this fix. [lines of code]: https://gist.github.com/erickt/255bc4006292cac88de906bd6bd9220a
1 parent a5c825c commit 0f19b6d

3 files changed

Lines changed: 147 additions & 0 deletions

File tree

compiler/rustc_trait_selection/src/traits/select/mod.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1113,6 +1113,14 @@ impl<'cx, 'tcx> SelectionContext<'cx, 'tcx> {
11131113
debug!("CACHE MISS");
11141114
self.insert_evaluation_cache(param_env, fresh_trait_pred, dep_node, result);
11151115
stack.cache().on_completion(stack.dfn);
1116+
} else if let Some(_guar) = self.infcx.tainted_by_errors() {
1117+
// When an error has occurred, we allow global caching of results even if they
1118+
// appear stack-dependent. This prevents exponential re-evaluation of cycles
1119+
// in the presence of errors, avoiding compiler hangs like #150907.
1120+
// This is safe because compilation will fail anyway.
1121+
debug!("CACHE MISS (tainted by errors)");
1122+
self.insert_evaluation_cache(param_env, fresh_trait_pred, dep_node, result);
1123+
stack.cache().on_completion(stack.dfn);
11161124
} else {
11171125
debug!("PROVISIONAL");
11181126
debug!(
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
// Test that we don't hang or take exponential time when evaluating
2+
// auto-traits for cyclic types containing errors, based on the reproducer
3+
// provided in issue #150907.
4+
5+
use std::sync::Arc;
6+
use std::marker::PhantomData;
7+
8+
struct Weak<T>(PhantomData<T>);
9+
unsafe impl<T: Sync + Send> Send for Weak<T> {}
10+
unsafe impl<T: Sync + Send> Sync for Weak<T> {}
11+
12+
struct BTreeMap<K, V>(K, V);
13+
14+
trait DeviceOps: Send {}
15+
16+
struct SerialDevice {
17+
terminal: Weak<Terminal>,
18+
}
19+
20+
impl DeviceOps for SerialDevice {}
21+
22+
struct TtyState {
23+
terminals: Weak<Terminal>,
24+
}
25+
26+
struct TerminalMutableState {
27+
controller: TerminalController,
28+
}
29+
30+
struct Terminal {
31+
weak_self: Weak<Self>,
32+
state: Arc<TtyState>,
33+
mutable_state: Weak<TerminalMutableState>,
34+
}
35+
36+
struct TerminalController {
37+
session: Weak<Session>,
38+
}
39+
40+
struct Kernel {
41+
weak_self: Weak<Kernel>,
42+
kthreads: KernelThreads,
43+
pids: Weak<PidTable>,
44+
}
45+
46+
struct KernelThreads {
47+
system_task: SystemTask,
48+
kernel: Weak<Kernel>,
49+
}
50+
51+
struct SystemTask {
52+
system_thread_group: Weak<ThreadGroup>,
53+
}
54+
55+
enum ProcessEntry {
56+
ThreadGroup(Weak<ThreadGroup>),
57+
}
58+
59+
struct PidEntry {
60+
task: Arc<Task>,
61+
process: ProcessEntry,
62+
}
63+
64+
struct PidTable {
65+
table: PidEntry,
66+
process_groups: Arc<ProcessGroup>,
67+
}
68+
69+
struct ProcessGroupMutableState {
70+
thread_groups: Weak<ThreadGroup>,
71+
}
72+
73+
struct ProcessGroup {
74+
session: Arc<Session>,
75+
mutable_state: Arc<ProcessGroupMutableState>,
76+
}
77+
78+
struct SessionMutableState {
79+
process_groups: BTreeMap<(), Weak<ProcessGroup>>,
80+
controlling_terminal: ControllingTerminal,
81+
}
82+
83+
struct Session {
84+
mutable_state: Weak<SessionMutableState>,
85+
}
86+
87+
struct ControllingTerminal {
88+
terminal: Arc<Terminal>,
89+
}
90+
91+
struct TaskPersistentInfoState {
92+
thread_group_key: ThreadGroupKey,
93+
}
94+
95+
struct Task {
96+
thread_group_key: ThreadGroupKey,
97+
kernel: Arc<Kernel>,
98+
thread_group: Arc<ThreadGroup>,
99+
persistent_info: Arc<TaskPersistentInfoState>,
100+
vfork_event: Arc, //~ ERROR missing generics for struct `Arc`
101+
}
102+
103+
struct ThreadGroupKey {
104+
thread_group: Arc<ThreadGroup>,
105+
}
106+
107+
struct ThreadGroupMutableState {
108+
tasks: BTreeMap<(), TaskContainer>,
109+
children: BTreeMap<(), Weak<ThreadGroup>>,
110+
process_group: Arc<ProcessGroup>,
111+
}
112+
113+
struct ThreadGroup {
114+
kernel: Arc<Kernel>,
115+
mutable_state: Weak<ThreadGroupMutableState>,
116+
}
117+
118+
struct TaskContainer(Arc<Task>, Arc<TaskPersistentInfoState>);
119+
120+
fn main() {
121+
// Trigger auto-trait check for one of the cyclic types
122+
is_send::<Kernel>();
123+
}
124+
125+
fn is_send<T: Send>() {}
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
error[E0107]: missing generics for struct `Arc`
2+
--> $DIR/cycle-cache-err-150907.rs:100:18
3+
|
4+
LL | vfork_event: Arc,
5+
| ^^^ expected at least 1 generic argument
6+
|
7+
help: add missing generic argument
8+
|
9+
LL | vfork_event: Arc<T>,
10+
| +++
11+
12+
error: aborting due to 1 previous error
13+
14+
For more information about this error, try `rustc --explain E0107`.

0 commit comments

Comments
 (0)