This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Add GPU support to ggml #914
Labels
enhancement
New feature or request
hardware
Hardware related
help wanted
Extra attention is needed
research 🔬
Intro
This issue is more suitable for the https://github.com/ggerganov/ggml repo, but adding it here for more visibility.
First, I don't see adding a GPU framework that is tightly integrated with
ggml
anytime soon because it usually comes with a lot of maintenance drawbacks, architecture changes and issues. However, there is an alternative approach that might be relatively easy to implement and I think would be a very cool way for new developers to join in and help.Description
ggml
produces computation graphs which are basically directed acyclic graphs (DAGs) that can be easily exported, iterated, etc. A graph contains the information about all necessary tensor operations and buffers needed to evaluate the model. The idea is to first add basicggml
functionality for exporting the graphs in some trivial text format that can be parsed as a second step by a separateggml
tool. Having the exported graphs, one can process them and construct hardware-specific code for evaluating them.For example, a
ggml-cuda
tool can parse the exported graph and construct the necessary CUDA kernels and GPU buffers to evaluate it on a NVIDIA GPU. Another tool, for exampleggml-mps
, can do similar stuff but for Metal Performance Shaders. Etc.This approach preserves the cross-platform nature of
ggml
and allows custom hardware support, via compiler-like translation of the exported computation graphs.Still, the most difficult part of implementing the respective kernels remains the biggest obstacle.
I think this decoupled approach of the implementation would make the development process much easier and can potentially allow for some interesting optimizations. My biggest fear of adding a tightly integrated GPU backend to
ggml
is that I don't know the important details for supporting the respective backend, which could lead to bad software design decisions that in turn can potentially affect negatively even the cure CPU implementation.However, with the proposed approach in this issue, we eliminate this risk and allow multiple independent implementations to be provided without any negative side effects on the core
ggml
implementation.Another cool thing about this idea is that there could be separate leading developers for each backend.
So if you have a good knowledge and understanding about a certain hardware architecture, you are one step away from initiating the kernel "translation" process and making a very significant contribution to the project.
Guiding principles
I don't know all the specifics of a GPU code, but I believe one could try to adopt the fundamental principles of
ggml
.For example, there could be a single memory buffer allocated and all the tensors can be distributed within that memory buffer at certain offsets. Each graph operation will correspond to a kernel with source tensors as input and a destination tensor for output which will be all part of that single memory buffer allocated at the start of the execution.
Additionally, I think we don't need to explicitly add 3rd party dependencies (e.g. CUDA SDK, OpenCL, etc.) to
ggml
to achieve that. The newggml
tools will simply generate code, which will be up to the user to compile and run.I've heard the concept of "super-shaders" / "super-kernels" - probably this is something we should try to achieve.
Taking shortcuts and making custom hacks in favor of better performance is very welcome.
Why?
Currently,
ggml
is one of the few ML frameworks that provides efficient 4-bit quantization and demonstrates effective application for transformer evaluation. The code is compact, easily comprehensible with very little bloat. I thinkggml
has a slight leading edge in this regard compared to other general purpose frameworks and if we utilize it now, it has the potential of becoming a very respectable machine learning framework in the future with a focus for on-device inference.Links
Starting point: .dot file of ggml_graph can not be generated to .png file #589 (comment)
Sample graph for LLaMA 7B:
The text was updated successfully, but these errors were encountered: