Tighten compilation cache invariants around eagle

zou3519 · zou3519 · commit 0512f5acde20 · 2025-05-09T12:10:03.000-07:00
I'm recording down my understanding of how eagle and the compilation cache works after discussing vllm-project#17211 with @luyuzhe111 and @WoosukKwon. In the future we likely will have a situation where we want to torch.compile multiple pieces of code (e.g. decoder and encoder separately) and then we'll need to refactor the system to support it (each compiled region needs its own cache directory with its own hash) But until then the current design seems fine. Signed-off-by: rzou <zou3519@gmail.com>
diff --git a/vllm/compilation/backends.py b/vllm/compilation/backends.py
@@ -417,6 +417,22 @@ def __call__(self, graph: fx.GraphModule, example_inputs) -> Callable:
             self.compilation_config.cache_dir = cache_dir
 
         if compilation_counter.num_graphs_seen > 0:
+            # NOTE: Eagle compilation
+            # The eagle head is a separate model that gets run, so it needs
+            # its own cache dir (each cache dir is 1:1 with a model.forward).
+            #
+            # We currently assume that the eagle head does not need its own
+            # hash: in the vLLM repo, the hash of the original model currently
+            # entirely determines the config of the eagle head.
+            # It's very possible that this assumption will change in the
+            # future and we'll need to update this code.
+            #
+            # If you are here because you are using multiple torch.compile
+            # calls in a single model, please open an issue and let's discuss.
+            speculative_config = self.vllm_config.speculative_config
+            assert speculative_config is not None
+            assert speculative_config.method.use_eagle()
+
             cache_dir = self.compilation_config.cache_dir + \
                 f'-{compilation_counter.num_graphs_seen}'
         else: