Skip to content

A fine-grained LLM profiler that calculates the algorithmic operations (e.g. ADD, MULT), parameters, cache (buffers) and activations (intermediate output between the layers).

Notifications You must be signed in to change notification settings

Mingxue-Xu/llm-profiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Profiler

A fine-grained layer-level LLM profiler that calculates the algorithmic operations (e.g. ADD, MULT), parameters, cache (buffers) and activations.

Setup Environment

We use virualenv in this repository. To set up the python environment, in the terminal,

$ virtualenv env
$ source env/bin/activate
(env) $ pip install -r requirements.txt

Quick Start

In the root directory of this repo,

(env) $ python3 simple_test.py

The output in the terminal should be

model:	meta-llama/Llama-3.2-1B-Instruct
transformer with input batchsize 16 and sequence length 200:
    ADD:	58.28	(10^9)
    MULT:	59.17	(10^9)
    DIV:	3.49	(10^6)
    SQRT:	16
    PARAMS:	1.24	(10^9)
    CACHE:	59.10	(10^6)
    ACT:	327.18	(10^6)

The following instruction is based on the structure of simple_test.py.

Instruction

Preparation

Give the model name (MODEL_ID, which should be available on HuggingFace) and the input batchsize BATCH_SIZE and maximum sequence length TEXT_LENGTH. If you don't have HuggingFace access token, create one according to here.

from src import get_model_config, get_profile

MODEL_ID="meta-llama/Llama-3.2-1B-Instruct" 
HF_TOKEN="Your HuggingFace access token here" 
BATCH_SIZE=16
TEXT_LENGTH=200

Get the fine-grained FLOPs of a single forward pass

model_config = get_model_config(model_id = MODEL_ID, hf_token = HF_TOKEN)
kwargs =    { 
                'model_config': model_config, 
                'input_shape':  [BATCH_SIZE,TEXT_LEN]
            }   

Get the total transformer FLOPs (including ADD and MULT) during the inference (a single forwarding pass):

get_profile(kwargs, "flops")

The output in the terminal should be something like

transformer with input batchsize 16 and sequence length 200:
    FLOPs (ADD and MULT):      117.30  (10^9)

Get the total transformer element-wise multiplication operations during the inference

get_profile(kwargs, "mult")                   

The output in the terminal should be something like

transformer with input batchsize 16 and sequence length 200:
    MULT:       59.02  (10^9)

Get the element-wise multiplication operations involved in the all attention layers

get_profile(kwargs, "mult", "attn")         

The output in the terminal should be something like

attn with input batchsize 16 and sequence length 200:
    MULT:       45.22  (10^9)

Get the model parameters involved in the all attention layers

get_profile(kwargs, "params", "attn")       

The output in the terminal should be something like

attn with input batchsize 16 and sequence length 200:
    PARAMS:     2.68  (10^9)

Get the activations (intermediate output between the layers) of the transformer

get_profile(kwargs, "act")                

The output in the terminal should be something like

transformer with input batchsize 16 and sequence length 200:
    ACT:        15.58  (10^9)

Get the activations (intermediate output between the layers) of the all the mlp layers

get_profile(kwargs, "act", "mlp")        

The output in the terminal should be something like

mlp with input batchsize 16 and sequence length 200:
    ACT:        12.88   (10^9)

For how to set kwargs and what kind of data are available, please refer to docs/kwargs

Detailed Explanation

What are considered and not considered in llm-profiler?

Arithmetic Operations that are considered:

  • ADD: element-wise addition
  • MULT: element-wise multiplication
  • DIV: vector or tensor division, can be implemented differently according to hardware architectures, etc
  • SQRT: vector or tensor square root, can be implemented differently according to the computing systems, etc
  • FLOPs: element-wise addition and multiplication

Dynamics that are considered:

  • PARAMS: the layers' parameters or the transformer parameters
  • ACT: activations (intermediate output between the layers) of the layers or the transformer
  • CACHE: temporary inner buffers of the layers or the transformer (not necessarily KV cache) during the forward pass

NOT considered:

  • logistic sigmoid (in activation functions)
  • output_hidden_states=True or output_attentions=True in the transformer forward pass (e.g. in LlamaModel.forward), i.e. attention weights outputs are not currently supported by sdpa_attention_forward
  • dynamic rotary embedding (currently only llama3 rotary embedding is supported)
  • systematic peak memory (can be various along with different hardware)

Plan to consider: -[] self-defined Model Configuration -[] other LLM architecture (e.g. Gemma) -[] other scope like torch.nn.Linear and torch.nn.Module.named_parameter (e.g. model.layers.12.self_attn.k_proj and model.layers.1.mlp.up_proj)

NOTE: For detailed reference code/explanation for FLOPs calculation, a sample reference soure code is in docs/flops.

Differences between llm-profiler and other profilers

The emphasis of llm-profiler is layer-level, as well as fine-grained algorithmic operations, rather than simply use FLOPs to describe. We separate intermediate output between the layers, temporay buffer and parameters, while others mainly focus on overall operations and system peak memory.

Other profilers are:

Gives FLOPs per layer, however, doesn't distinguish addition and summation, wich can be very different (latency & energy consumption) when LLMs deployed on different devices.

Coarse-grained profiler that gives the FLOPs and memory, CPU usage, etc., on the whole model level, rather than layer-level.

A more visualized, systematic model-level profiler based on PyTorch Profiler.

About

A fine-grained LLM profiler that calculates the algorithmic operations (e.g. ADD, MULT), parameters, cache (buffers) and activations (intermediate output between the layers).

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages