We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
1 parent e26d5c6 commit 1131acdCopy full SHA for 1131acd
examples/profiling/README.md
@@ -312,4 +312,7 @@ kernels, so this tiny sync balloons to 2.3s.
312
* As mentioned above, we profiled with regional compilation so it's possible that
313
there are still some gaps outside the compiled regions. A full compilation
314
will likely mitigate it. In case it doesn't, the above observations could
315
-be useful to mitigate that.
+be useful to mitigate that.
316
+* Use of CUDA Graphs can also help mitigate CPU overhead related issues. When
317
+using "reduce-overhead" and "max-autotune" in `torch.compile` triggers the
318
+use of CUDA Graphs.
0 commit comments