Mingxue-Xu / llm-profiler Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A fine-grained LLM profiler that calculates the algorithmic operations (e.g. ADD, MULT), parameters, cache (buffers) and activations (intermediate output between the layers).

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
scripts		scripts
src		src
README.md		README.md
other_profilers.py		other_profilers.py
requirements.txt		requirements.txt
simple_test.py		simple_test.py

Repository files navigation

LLM Profiler

A fine-grained layer-level LLM profiler that calculates the algorithmic operations (e.g. ADD, MULT), parameters, cache (buffers) and activations.

Setup Environment

We use virualenv in this repository. To set up the python environment, in the terminal,

$ virtualenv env
$ source env/bin/activate
(env) $ pip install -r requirements.txt

Quick Start

In the root directory of this repo,

(env) $ python3 simple_test.py

The output in the terminal should be

model:	meta-llama/Llama-3.2-1B-Instruct
transformer with input batchsize 16 and sequence length 200:
    ADD:	58.28	(10^9)
    MULT:	59.17	(10^9)
    DIV:	3.49	(10^6)
    SQRT:	16
    PARAMS:	1.24	(10^9)
    CACHE:	59.10	(10^6)
    ACT:	327.18	(10^6)

The following instruction is based on the structure of simple_test.py.

Instruction

Preparation

Give the model name (MODEL_ID, which should be available on HuggingFace) and the input batchsize BATCH_SIZE and maximum sequence length TEXT_LENGTH. If you don't have HuggingFace access token, create one according to here.

from src import get_model_config, get_profile

MODEL_ID="meta-llama/Llama-3.2-1B-Instruct" 
HF_TOKEN="Your HuggingFace access token here" 
BATCH_SIZE=16
TEXT_LENGTH=200

Get the fine-grained FLOPs of a single forward pass

model_config = get_model_config(model_id = MODEL_ID, hf_token = HF_TOKEN)
kwargs =    { 
                'model_config': model_config, 
                'input_shape':  [BATCH_SIZE,TEXT_LEN]
            }

Get the total transformer FLOPs (including ADD and MULT) during the inference (a single forwarding pass):

get_profile(kwargs, "flops")

The output in the terminal should be something like

transformer with input batchsize 16 and sequence length 200:
    FLOPs (ADD and MULT):      117.30  (10^9)

Get the total transformer element-wise multiplication operations during the inference

get_profile(kwargs, "mult")

The output in the terminal should be something like

transformer with input batchsize 16 and sequence length 200:
    MULT:       59.02  (10^9)

Get the element-wise multiplication operations involved in the all attention layers

get_profile(kwargs, "mult", "attn")

The output in the terminal should be something like

attn with input batchsize 16 and sequence length 200:
    MULT:       45.22  (10^9)

Get the model parameters involved in the all attention layers

get_profile(kwargs, "params", "attn")

The output in the terminal should be something like

attn with input batchsize 16 and sequence length 200:
    PARAMS:     2.68  (10^9)

Get the activations (intermediate output between the layers) of the transformer

get_profile(kwargs, "act")

The output in the terminal should be something like

transformer with input batchsize 16 and sequence length 200:
    ACT:        15.58  (10^9)

Get the activations (intermediate output between the layers) of the all the mlp layers

get_profile(kwargs, "act", "mlp")

The output in the terminal should be something like

mlp with input batchsize 16 and sequence length 200:
    ACT:        12.88   (10^9)

For how to set kwargs and what kind of data are available, please refer to docs/kwargs

Detailed Explanation

What are considered and not considered in `llm-profiler`?

Arithmetic Operations that are considered:

ADD: element-wise addition
MULT: element-wise multiplication
DIV: vector or tensor division, can be implemented differently according to hardware architectures, etc
SQRT: vector or tensor square root, can be implemented differently according to the computing systems, etc
FLOPs: element-wise addition and multiplication

Dynamics that are considered:

PARAMS: the layers' parameters or the transformer parameters
ACT: activations (intermediate output between the layers) of the layers or the transformer
CACHE: temporary inner buffers of the layers or the transformer (not necessarily KV cache) during the forward pass

NOT considered:

logistic sigmoid (in activation functions)
output_hidden_states=True or output_attentions=True in the transformer forward pass (e.g. in LlamaModel.forward), i.e. attention weights outputs are not currently supported by sdpa_attention_forward
dynamic rotary embedding (currently only llama3 rotary embedding is supported)
systematic peak memory (can be various along with different hardware)

Plan to consider: -[] self-defined Model Configuration -[] other LLM architecture (e.g. Gemma) -[] other scope like torch.nn.Linear and torch.nn.Module.named_parameter (e.g. model.layers.12.self_attn.k_proj and model.layers.1.mlp.up_proj)

NOTE: For detailed reference code/explanation for FLOPs calculation, a sample reference soure code is in docs/flops.

Differences between `llm-profiler` and other profilers

The emphasis of llm-profiler is layer-level, as well as fine-grained algorithmic operations, rather than simply use FLOPs to describe. We separate intermediate output between the layers, temporay buffer and parameters, while others mainly focus on overall operations and system peak memory.

Other profilers are:

DeepSpeed

Gives FLOPs per layer, however, doesn't distinguish addition and summation, wich can be very different (latency & energy consumption) when LLMs deployed on different devices.

PyTorch Profiler and `torch.autograd.profiler`

Coarse-grained profiler that gives the FLOPs and memory, CPU usage, etc., on the whole model level, rather than layer-level.

Holistic Trace Analysis

A more visualized, systematic model-level profiler based on PyTorch Profiler.

About

A fine-grained LLM profiler that calculates the algorithmic operations (e.g. ADD, MULT), parameters, cache (buffers) and activations (intermediate output between the layers).

Report repository

Releases

Packages

No packages published

Languages

Python 100.0%