|
| 1 | +# ExecuTorch Vulkan Delegate |
| 2 | + |
| 3 | +The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is |
| 4 | +built on top of the cross-platform Vulkan GPU API standard. It is primarily |
| 5 | +designed to leverage the GPU to accelerate model inference on Android devices, |
| 6 | +but can be used on any platform that supports an implementation of Vulkan: |
| 7 | +laptops, servers, and edge devices. |
| 8 | + |
| 9 | +::::{note} |
| 10 | +The Vulkan delegate is currently under active development, and its components |
| 11 | +are subject to change. |
| 12 | +:::: |
| 13 | + |
| 14 | +## What is Vulkan? |
| 15 | + |
| 16 | +Vulkan is a low-level GPU API specification developed as a successor to OpenGL. |
| 17 | +It is designed to offer developers more explicit control over GPUs compared to |
| 18 | +previous specifications in order to reduce overhead and maximize the |
| 19 | +capabilities of the modern graphics hardware. |
| 20 | + |
| 21 | +Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both |
| 22 | +desktop and mobile) in the market support Vulkan. Vulkan is also included in |
| 23 | +Android from Android 7.0 onwards. |
| 24 | + |
| 25 | +**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it |
| 26 | +provides a way to execute compute and graphics operations on a GPU, but does not |
| 27 | +come with a built-in library of performant compute kernels. |
| 28 | + |
| 29 | +## The Vulkan Compute Library |
| 30 | + |
| 31 | +The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as |
| 32 | +the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to |
| 33 | +provide GPU implementations for PyTorch operators via GLSL compute shaders. |
| 34 | + |
| 35 | +The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html). |
| 36 | +The core components of the PyTorch Vulkan backend were forked into ExecuTorch |
| 37 | +and adapted for an AOT graph-mode style of model inference (as opposed to |
| 38 | +PyTorch which adopted an eager execution style of model inference). |
| 39 | + |
| 40 | +The components of the Vulkan Compute Library are contained in the |
| 41 | +`executorch/backends/vulkan/runtime/` directory. The core components are listed |
| 42 | +and described below: |
| 43 | + |
| 44 | +``` |
| 45 | +runtime/ |
| 46 | +├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects |
| 47 | +└── graph/ .................. ComputeGraph class which implements graph mode inference |
| 48 | + └── ops/ ................ Base directory for operator implementations |
| 49 | + ├── glsl/ ........... GLSL compute shaders |
| 50 | + │ ├── *.glsl |
| 51 | + │ └── conv2d.glsl |
| 52 | + └── impl/ ........... C++ code to dispatch GPU compute shaders |
| 53 | + ├── *.cpp |
| 54 | + └── Conv2d.cpp |
| 55 | +``` |
| 56 | + |
| 57 | +## Features |
| 58 | + |
| 59 | +The Vulkan delegate currently supports the following features: |
| 60 | + |
| 61 | +* **Memory Planning** |
| 62 | + * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference. |
| 63 | +* **Capability Based Partitioning**: |
| 64 | + * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs |
| 65 | +* **Support for upper-bound dynamic shapes**: |
| 66 | + * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering |
| 67 | + |
| 68 | +In addition to increasing operator coverage, the following features are |
| 69 | +currently in development: |
| 70 | + |
| 71 | +* **Quantization Support** |
| 72 | + * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future. |
| 73 | +* **Memory Layout Management** |
| 74 | + * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication. |
| 75 | +* **Selective Build** |
| 76 | + * We plan to make it possible to control build size by selecting which operators/shaders you want to build with |
| 77 | + |
| 78 | +## End to End Example |
| 79 | + |
| 80 | +To further understand the features of the Vulkan Delegate and how to use it, |
| 81 | +consider the following end to end example with MobileNet V2. |
| 82 | + |
| 83 | +### Compile and lower a model to the Vulkan Delegate |
| 84 | + |
| 85 | +Assuming ExecuTorch has been set up and installed, the following script can be |
| 86 | +used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`. |
| 87 | + |
| 88 | +``` |
| 89 | +import torch |
| 90 | +import torchvision.models as models |
| 91 | +
|
| 92 | +from torch.export import export, ExportedProgram |
| 93 | +from torchvision.models.mobilenetv2 import MobileNet_V2_Weights |
| 94 | +from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner |
| 95 | +from executorch.exir import EdgeProgramManager, ExecutorchProgramManager, to_edge |
| 96 | +from executorch.exir.backend.backend_api import to_backend |
| 97 | +
|
| 98 | +mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() |
| 99 | +sample_inputs = (torch.randn(1, 3, 224, 224), ) |
| 100 | +
|
| 101 | +exported_program: ExportedProgram = export(mobilenet_v2, sample_inputs) |
| 102 | +edge: EdgeProgramManager = to_edge(exported_program) |
| 103 | +
|
| 104 | +# Lower the model to Vulkan backend |
| 105 | +edge = edge.to_backend(VulkanPartitioner()) |
| 106 | +
|
| 107 | +exec_prog = edge.to_executorch() |
| 108 | +
|
| 109 | +with open("vulkan_mobilenetv2.pte", "wb") as file: |
| 110 | + exec_prog.write_to_file(file) |
| 111 | +``` |
| 112 | + |
| 113 | +Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate |
| 114 | +using the `to_backend()` API. The Vulkan Delegate implements the |
| 115 | +`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph |
| 116 | +that are supported by the Vulkan delegate, and separates compatible sections of |
| 117 | +the model to be executed on the GPU. |
| 118 | + |
| 119 | +This means the a model can be lowered to the Vulkan delegate even if it contains |
| 120 | +some unsupported operators. This will just mean that only parts of the graph |
| 121 | +will be executed on the GPU. |
| 122 | + |
| 123 | + |
| 124 | +::::{note} |
| 125 | +The [Vulkan partitioner code](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/vulkan_partitioner.py) |
| 126 | +can be inspected to examine which ops are currently implemented in the Vulkan |
| 127 | +delegate. |
| 128 | +:::: |
| 129 | + |
| 130 | +### Build Vulkan Delegate libraries |
| 131 | + |
| 132 | +The easiest way to build and test the Vulkan Delegate is to build for Android |
| 133 | +and test on a local Android device. Android devices have built in support for |
| 134 | +Vulkan, and the Android NDK ships with a GLSL compiler, which is needed to |
| 135 | +compile the Vulkan Compute Library's GLSL compute shaders. |
| 136 | + |
| 137 | +The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON` |
| 138 | +when building with CMake. |
| 139 | + |
| 140 | +First, make sure that you have the Android NDK installed - Android NDK r25c is |
| 141 | +recommended. The Android SDK should also be installed so that you have access |
| 142 | +to `adb`. |
| 143 | + |
| 144 | +```shell |
| 145 | +# Recommended version is Android NDK r25c. |
| 146 | +export ANDROID_NDK=<path_to_ndk> |
| 147 | +# Select an appropriate Android ABI |
| 148 | +export ANDROID_ABI=arm64-v8a |
| 149 | +# All subsequent commands should be performed from ExecuTorch repo root |
| 150 | +cd <path_to_executorch_root> |
| 151 | +# Make sure adb works |
| 152 | +adb --version |
| 153 | +``` |
| 154 | + |
| 155 | +To build and install ExecuTorch libraries (for Android) with the Vulkan |
| 156 | +Delegate: |
| 157 | + |
| 158 | +```shell |
| 159 | +# From executorch root directory |
| 160 | +(rm -rf cmake-android-out && \ |
| 161 | + pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \ |
| 162 | + -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ |
| 163 | + -DANDROID_ABI=$ANDROID_ABI \ |
| 164 | + -DEXECUTORCH_BUILD_VULKAN=ON \ |
| 165 | + -DPYTHON_EXECUTABLE=python \ |
| 166 | + -Bcmake-android-out && \ |
| 167 | + cmake --build cmake-android-out -j16 --target install) |
| 168 | +``` |
| 169 | + |
| 170 | +### Run the Vulkan model on device |
| 171 | + |
| 172 | +::::{note} |
| 173 | +Since operator support is currently limited, only binary arithmetic operators |
| 174 | +will run on the GPU. Expect inference to be slow as the majority of operators |
| 175 | +are being executed via Portable operators. |
| 176 | +:::: |
| 177 | + |
| 178 | +Now, the partially delegated model can be executed (partially) on your device's |
| 179 | +GPU! |
| 180 | + |
| 181 | +```shell |
| 182 | +# Build a model runner binary linked with the Vulkan delegate libs |
| 183 | +cmake --build cmake-android-out --target vulkan_executor_runner -j32 |
| 184 | + |
| 185 | +# Push model to device |
| 186 | +adb push vulkan_mobilenetv2.pte /data/local/tmp/vulkan_mobilenetv2.pte |
| 187 | +# Push binary to device |
| 188 | +adb push cmake-android-out/backends/vulkan/vulkan_executor_runner /data/local/tmp/runner_bin |
| 189 | + |
| 190 | +# Run the model |
| 191 | +adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vulkan_mobilenetv2.pte |
| 192 | +``` |
0 commit comments