AQLM custom kernels for Android #1

BlackSamorez · 2024-08-07T11:06:57Z

This PR contains all the necessary modifications to Executorch v0.3.0 to run AQLM models on an Android device.

It is designed to be compatible with their Llama demo app build and deploy process.

BlackSamorez · 2024-08-07T11:11:32Z

CMakeLists.txt

@@ -467,6 +467,9 @@ if(EXECUTORCH_BUILD_KERNELS_CUSTOM)
  add_subdirectory(
    ${CMAKE_CURRENT_SOURCE_DIR}/examples/models/llama2/custom_ops
  )
+  add_subdirectory(


Instruct CMake to traverse the subdirectory containing the AQLM kernels. Keeping the directory in the same CMake tree makes it easier to link those cutom operators libs.

BlackSamorez · 2024-08-07T11:14:08Z

CMakeLists.txt

@@ -633,13 +636,15 @@ if(EXECUTORCH_BUILD_PYBIND)
  # TODO(larryliu): Fix macOS 2 dylibs having 2 sets of static variables issue
  if(EXECUTORCH_BUILD_KERNELS_CUSTOM_AOT AND NOT APPLE)
    list(APPEND _dep_libs custom_ops_aot_lib)
+    list(APPEND _dep_libs aqlm)


Link the custom kernel to the portable_lib library. If custom operators are to be compiled that is.

BlackSamorez · 2024-08-07T11:19:40Z

CMakeLists.txt

@@ -633,13 +636,16 @@ if(EXECUTORCH_BUILD_PYBIND)
  # TODO(larryliu): Fix macOS 2 dylibs having 2 sets of static variables issue
  if(EXECUTORCH_BUILD_KERNELS_CUSTOM_AOT AND NOT APPLE)
    list(APPEND _dep_libs custom_ops_aot_lib)
+    list(APPEND _dep_libs aqlm_aot_lib)


Add torch bindings for AQLM to the portable_lib.

BlackSamorez · 2024-08-07T11:21:23Z

CMakeLists.txt

  endif()
  # TODO(laryliu): Fix linux duplicate registation problem. In GH CI worker
  # libcustom_ops.a doesn't dedup with the one indirectly linked from
  # libcustom_ops_aot_lib.a
  if(EXECUTORCH_BUILD_KERNELS_CUSTOM AND APPLE)
    target_link_options_shared_lib(custom_ops)
    list(APPEND _dep_libs custom_ops)
+    target_link_options_shared_lib(aqlm)


Force the linkage of core aqlm ops to portable_lib. This step is NECESSARY for the EXECUTORCH_LIBRARY macro to work. Otherwise, kernels won't be properly loaded during startup.

BlackSamorez · 2024-08-07T11:23:44Z

CMakeLists.txt

@@ -699,7 +705,7 @@ if(EXECUTORCH_BUILD_PYBIND)
      PROPERTIES # Assume that this library will be installed in
                 # `site-packages/executorch/extension/pybindings`, and that
                 # the custom_ops_aot_lib should be found with relative path.
-                 BUILD_RPATH "$ORIGIN:$ORIGIN/../../examples/models/llama2/custom_ops"
+                 BUILD_RPATH "$ORIGIN:$ORIGIN/../../examples/models/llama2/custom_ops:$ORIGIN/../../examples/models/llama2/aqlm"


BlackSamorez · 2024-08-07T11:24:28Z

examples/models/llama2/CMakeLists.txt

@@ -87,6 +87,7 @@ endif()
 # custom ops library
 if(EXECUTORCH_BUILD_KERNELS_CUSTOM)
  add_subdirectory(custom_ops)
+  add_subdirectory(aqlm)


Traverse the subdirectory containing the AQLM code.

BlackSamorez · 2024-08-07T11:25:27Z

examples/models/llama2/CMakeLists.txt

@@ -129,6 +130,9 @@ list(APPEND link_libraries quantized_kernels quantized_ops_lib)
 if(EXECUTORCH_BUILD_KERNELS_CUSTOM)
  target_link_options_shared_lib(custom_ops)
  list(APPEND link_libraries custom_ops)
+
+  target_link_options_shared_lib(aqlm)


Force the linkage of core aqlm ops to the llama_runner executable. This step is NECESSARY for the EXECUTORCH_LIBRARY macro to work. Otherwise, kernels won't be properly loaded during startup.

BlackSamorez · 2024-08-07T11:26:16Z

examples/models/llama2/aqlm/CmakeLists.txt

@@ -0,0 +1,111 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


Mostly copied from examples/models/llama2/custom_ops/CMakeLists.txt.

BlackSamorez · 2024-08-07T11:27:53Z

examples/models/llama2/aqlm/CmakeLists.txt

+# list(APPEND aqlm_libs OpenMP::OpenMP_CXX)
+# list(APPEND aqlm_libs omp)
+
+add_library(aqlm ${_aqlm__srcs})


A library containing core AQLM ops and a EXECUTORCH_LIBRARY macro invocation for their automatic registration into executorch runtime when linked with target_link_options_shared_lib.

BlackSamorez · 2024-08-07T11:29:57Z

examples/models/llama2/aqlm/CmakeLists.txt

+  # Add a AOT library
+  find_package(Torch CONFIG REQUIRED)
+  add_library(
+    aqlm_aot_lib SHARED ${CMAKE_CURRENT_SOURCE_DIR}/lut_kernel_pytorch.cpp


A library to be loaded into PyTorch with torch.ops.load_library. Contains TORCH_LIBRARY macro invocations to register and provide implementation for AQLM operations.

BlackSamorez · 2024-08-07T11:37:03Z

examples/models/llama2/aqlm/linear.py

+        self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous()
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        return torch.ops.aqlm.code2x8_lut_matmat(


Invoke the custom op loaded from lut_kernel.cpp
The op is loaded when importing lut_kernel.py.

BlackSamorez · 2024-08-07T11:39:02Z

examples/models/llama2/aqlm/linear.py

+            self.register_parameter("bias", None)
+
+    def transpose_codes(self):
+        self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous()


The codes layout for C++ kernels differs from the CUDA one. We need to do some weights preprocessing.

BlackSamorez · 2024-08-07T11:39:23Z

examples/models/llama2/aqlm/lut_kernel.cpp

+#include <numeric>
+#include <functional>
+
+#include <executorch/extension/kernel_util/make_boxed_from_unboxed_functor.h>


Needed for TORCH_LIBRARY

BlackSamorez · 2024-08-07T11:39:49Z

examples/models/llama2/aqlm/lut_kernel.cpp

+#include <executorch/runtime/core/exec_aten/util/dim_order_util.h>
+#include <executorch/runtime/core/exec_aten/util/scalar_type_util.h>
+
+#include <executorch/kernels/optimized/blas/CPUBlas.h>


For ::executorch::cpublas::gemm

BlackSamorez · 2024-08-07T11:45:27Z

examples/models/llama2/aqlm/lut_kernel.cpp

+namespace torch {
+  namespace executor {
+    namespace native {
+      Tensor& code2x8_lut_matmat_out(


Those are torch::executor::Tensor, which offer far fewer operations than torch::Tensor.

BlackSamorez · 2024-08-07T11:47:14Z

examples/models/llama2/aqlm/lut_kernel.cpp

+        auto num_input_vectors = std::accumulate(input_sizes.begin(), input_sizes.end(), 1, std::multiplies<int64_t>()) / input_vector_size;
+
+        // Allocate LUT
+        auto lut_data = ctx.allocate_temp(


We need to manually allocate all the temporary memory we need. One way to do so is to invoke allocate_temp on the RuntimeContext of the operation. Just make sure that the context is provided with a temp allocator.

BlackSamorez · 2024-08-07T11:47:51Z

examples/models/llama2/aqlm/lut_kernel.cpp

+            out_features
+        );
+
+        return out;


_out operation returns it's first argument.

BlackSamorez · 2024-08-07T12:03:40Z

setup.py

@@ -489,7 +489,7 @@ def run(self):
                "-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON",  # add llama sdpa ops to pybindings.
                "-DEXECUTORCH_BUILD_KERNELS_CUSTOM_AOT=ON",
            ]
-            build_args += ["--target", "custom_ops_aot_lib"]


Build aqlm_aot_lib for the python library.

BlackSamorez · 2024-08-07T12:04:21Z

setup.py

@@ -569,6 +569,13 @@ def get_ext_modules() -> list[Extension]:
                "executorch/examples/models/llama2/custom_ops",
            )
        )
+        ext_modules.append(


Add the compiled dynamic library with AQLM bindings to the pip installation.

BlackSamorez · 2024-08-09T14:27:44Z

examples/models/llama2/aqlm/linear.py

+                dtype=torch.int8,
+            ),
+            requires_grad=False,
+        )  #  [num_in_groups, num_out_groups, num_codebooks]


NOTE: different from the usual AQLM layout.

BlackSamorez · 2024-08-09T14:28:18Z

examples/models/llama2/aqlm/lut_kernel.cpp

+        ).get();
+
+        // A @ B.T
+        ::executorch::cpublas::gemm(


No matmul, so we have to use low-level ops.

BlackSamorez · 2024-08-09T14:29:41Z

extension/android/jni/jni_layer_llama.cpp

@@ -130,7 +130,7 @@ class ExecuTorchLlamaJni
      facebook::jni::alias_ref<ExecuTorchLlamaCallbackJni> callback) {
    runner_->generate(
        prompt->toStdString(),
-        128,


Increase context length for more meaningful generations.

BlackSamorez · 2024-08-12T15:21:49Z

examples/models/llama2/aqlm/CmakeLists.txt

+target_include_directories(
+  aqlm PRIVATE "${CMAKE_CURRENT_BINARY_DIR}/../../../../include"
+)
+target_link_libraries(aqlm PUBLIC ${aqlm_libs} -fopenmp -static-openmp)


Additional flags for OMP to link on Android

BlackSamorez · 2024-08-12T15:22:01Z

extension/android/CMakeLists.txt

    cpublas
    eigen_blas
    quantized_kernels
    quantized_ops_lib
+    -fopenmp


Additional flags for OMP to link on Android

BlackSamorez · 2024-08-12T15:22:45Z

Update: added OMP for 2.5x speedup on 4 cores.

larryliu0820 · 2024-08-14T20:18:48Z

@BlackSamorez thanks for working on this, really appreciate the feedback you gave in the blog post. Looking at this PR I have some takeaways, on how to improve ExecuTorch to make this flow easier. Let me know if they make sense!

Documentations:
- README.md should include instructions on how to write CMake for custom kernels.
- Maybe provide an overview to CMake build system?
Packaging:
- Provide better tools for users like you to use ExecuTorch as a library instead of pulling the source code and modify inside.

I'm curious to learn where did you spend most of your time, in order to make this work?

swolchok · 2024-08-19T21:37:09Z

examples/models/llama2/aqlm/lut_kernel.cpp

+            values_vec = vmulq_f32(values_vec, scales_vec);
+            if (bias != nullptr) {
+                values_vec = vaddq_f32(values_vec, bias_vec);
+            }


you probably want to issue an FMA (https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#fused-multiply-accumulate) here if bias is not nullptr. I would also recommend generating a separate function (e.g., by adding a template parameter to ignore the bias) to make sure that you get a separate kernel generated for with and without bias so you can be sure not to pay the cost of the test and branch on every iteration.

some amount of loop unrolling is also probably advisable; hopefully the compiler will do that for you, but I would recommend checking the generated assembly.

swolchok · 2024-08-19T21:45:21Z

examples/models/llama2/aqlm/lut_kernel.cpp

+    const int b_alt_stride = 2 * out_features;
+
+    for (int input = 0; input < num_inputs; ++input) {
+        #pragma omp parallel for num_threads(4)


FWIW, I don't think OMP works on iOS.

swolchok · 2024-08-19T21:47:58Z

examples/models/llama2/aqlm/lut_kernel.cpp

+            for (int i = 0; i < out_features; ++i) {
+                output_vec[input * out_features + i] += lut_ptr[b_alt_ptr[i * 2]];
+                output_vec[input * out_features + i] += lut_ptr[256 + b_alt_ptr[i * 2 + 1]];
+            }


I would recommend experimenting with unrolling this loop. With clang, you put #pragma unroll 4 (or whatever unroll count) on the line before the for; Google says (https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html) the GCC equivalent would be #pragma GCC unroll 4.

swolchok · 2024-08-21T18:31:31Z

one other note: have you compared performance with the AQLM paper's numba kernel? It looks like that setup is using a JIT to specialize over (in_group_size, out_features, in_features, num_codebooks), which should do a better job of exposing optimization opportunities to the compiler; if you find there is a performance gap (e.g., running both on an ARM or x86 server/laptop/whatever) you might want to experiment with templatizing your kernel over a similar suite of parameters.

BlackSamorez · 2024-08-21T19:12:11Z

It looks like that setup is using a JIT to specialize over (in_group_size, out_features, in_features, num_codebooks)

No, in reality, it's also using [num_in_groups, num_out_groups, num_codebooks]. The comment with shapes is wrong.
That layout is much faster than [num_out_groups, num_in_groups, num_codebooks]. Mostly because the LUT memory accesses are into a contiguous memory array in the innermost loop when in_features is ~last dim.

swolchok · 2024-08-21T21:10:11Z

It looks like that setup is using a JIT to specialize over (in_group_size, out_features, in_features, num_codebooks)

No, in reality, it's also using [num_in_groups, num_out_groups, num_codebooks]. The comment with shapes is wrong. That layout is much faster than [num_out_groups, num_in_groups, num_codebooks]. Mostly because the LUT memory accesses are into a contiguous memory array in the innermost loop when in_features is ~last dim.

I'm not talking so much about the layout as I am talking about specializing over specific values as loop trip counts.

BlackSamorez · 2024-08-22T10:03:45Z

specializing over specific values as loop trip counts

Oh, I see. I forgot that I did that for the Numba kernel. Yes, we could templatize the code and instantiate all the Llama shapes there are with an eager fallback. Thanks!

Andrey Panferov added 7 commits August 4, 2024 17:05

aqlm support

a562b76

faster lut

7c3e191

Separate scales fn

db94aba

link aqlm to android lib

97c86e3

row_wise_scaling_and_bias for neon

008f9c3

gradle update

0645fee

revert gradle

4389601

BlackSamorez commented Aug 7, 2024

View reviewed changes

libs linking fix

cade4c3

BlackSamorez commented Aug 7, 2024

View reviewed changes

Andrey Panferov added 6 commits August 7, 2024 14:42

removed unnecessary includes

3ad5904

more temp buffers

94d4544

Cleaner export

3156fc2

New export

91b1378

срфе зкщьзе фтв ылшз дщфвштп

8d1053b

8da4w for head

b478f06

BlackSamorez commented Aug 9, 2024

View reviewed changes

omp

2bfd66f

BlackSamorez commented Aug 12, 2024

View reviewed changes

swolchok reviewed Aug 19, 2024

View reviewed changes

		@@ -0,0 +1,111 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

AQLM custom kernels for Android #1

Are you sure you want to change the base?

AQLM custom kernels for Android #1

Uh oh!

Conversation

BlackSamorez commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlackSamorez Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlackSamorez Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlackSamorez Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlackSamorez Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlackSamorez commented Aug 12, 2024

Uh oh!

larryliu0820 commented Aug 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swolchok commented Aug 21, 2024

Uh oh!

BlackSamorez commented Aug 21, 2024

Uh oh!

swolchok commented Aug 21, 2024

Uh oh!

BlackSamorez commented Aug 7, 2024 •

edited

Loading

BlackSamorez Aug 7, 2024 •

edited

Loading

BlackSamorez Aug 7, 2024 •

edited

Loading

BlackSamorez Aug 7, 2024 •

edited

Loading

BlackSamorez Aug 7, 2024 •

edited

Loading