A fine-grained layer-level LLM profiler that calculates the algorithmic operations (e.g. ADD, MULT), parameters, cache (buffers) and activations.
We use virualenv
in this repository. To set up the python environment, in the terminal,
$ virtualenv env
$ source env/bin/activate
(env) $ pip install -r requirements.txt
In the root directory of this repo,
(env) $ python3 simple_test.py
The output in the terminal should be
model: meta-llama/Llama-3.2-1B-Instruct
transformer with input batchsize 16 and sequence length 200:
ADD: 58.28 (10^9)
MULT: 59.17 (10^9)
DIV: 3.49 (10^6)
SQRT: 16
PARAMS: 1.24 (10^9)
CACHE: 59.10 (10^6)
ACT: 327.18 (10^6)
The following instruction is based on the structure of simple_test.py
.
Give the model name (MODEL_ID
, which should be available on HuggingFace) and the input batchsize BATCH_SIZE
and maximum sequence length TEXT_LENGTH
. If you don't have HuggingFace access token, create one according to here.
from src import get_model_config, get_profile
MODEL_ID="meta-llama/Llama-3.2-1B-Instruct"
HF_TOKEN="Your HuggingFace access token here"
BATCH_SIZE=16
TEXT_LENGTH=200
model_config = get_model_config(model_id = MODEL_ID, hf_token = HF_TOKEN)
kwargs = {
'model_config': model_config,
'input_shape': [BATCH_SIZE,TEXT_LEN]
}
Get the total transformer FLOPs (including ADD and MULT) during the inference (a single forwarding pass):
get_profile(kwargs, "flops")
The output in the terminal should be something like
transformer with input batchsize 16 and sequence length 200:
FLOPs (ADD and MULT): 117.30 (10^9)
get_profile(kwargs, "mult")
The output in the terminal should be something like
transformer with input batchsize 16 and sequence length 200:
MULT: 59.02 (10^9)
get_profile(kwargs, "mult", "attn")
The output in the terminal should be something like
attn with input batchsize 16 and sequence length 200:
MULT: 45.22 (10^9)
get_profile(kwargs, "params", "attn")
The output in the terminal should be something like
attn with input batchsize 16 and sequence length 200:
PARAMS: 2.68 (10^9)
get_profile(kwargs, "act")
The output in the terminal should be something like
transformer with input batchsize 16 and sequence length 200:
ACT: 15.58 (10^9)
get_profile(kwargs, "act", "mlp")
The output in the terminal should be something like
mlp with input batchsize 16 and sequence length 200:
ACT: 12.88 (10^9)
For how to set kwargs
and what kind of data are available, please refer to docs/kwargs
Arithmetic Operations that are considered:
ADD
: element-wise additionMULT
: element-wise multiplicationDIV
: vector or tensor division, can be implemented differently according to hardware architectures, etcSQRT
: vector or tensor square root, can be implemented differently according to the computing systems, etcFLOPs
: element-wise addition and multiplication
Dynamics that are considered:
PARAMS
: the layers' parameters or the transformer parametersACT
: activations (intermediate output between the layers) of the layers or the transformerCACHE
: temporary inner buffers of the layers or the transformer (not necessarily KV cache) during the forward pass
NOT considered:
- logistic sigmoid (in activation functions)
output_hidden_states=True
oroutput_attentions=True
in the transformer forward pass (e.g. inLlamaModel.forward
), i.e. attention weights outputs are not currently supported bysdpa_attention_forward
- dynamic rotary embedding (currently only
llama3 rotary embedding
is supported) - systematic peak memory (can be various along with different hardware)
Plan to consider:
-[] self-defined Model Configuration
-[] other LLM architecture (e.g. Gemma)
-[] other scope like torch.nn.Linear
and torch.nn.Module.named_parameter
(e.g. model.layers.12.self_attn.k_proj
and model.layers.1.mlp.up_proj
)
NOTE: For detailed reference code/explanation for FLOPs calculation, a sample reference soure code is in docs/flops.
The emphasis of llm-profiler
is layer-level, as well as fine-grained algorithmic operations, rather than simply use FLOPs to describe. We separate intermediate output between the layers, temporay buffer and parameters, while others mainly focus on overall operations and system peak memory.
Other profilers are:
Gives FLOPs per layer, however, doesn't distinguish addition and summation, wich can be very different (latency & energy consumption) when LLMs deployed on different devices.
Coarse-grained profiler that gives the FLOPs and memory, CPU usage, etc., on the whole model level, rather than layer-level.
A more visualized, systematic model-level profiler based on PyTorch Profiler.