-
Notifications
You must be signed in to change notification settings - Fork 18
Adding basic API for memory profiling (src/training_hub/profiling) #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 12 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
31e59e9
First pass at memory estimator
f3ca036
Fixing bugs with the OSFT implementation
mazam-lab fcc9383
Merge branch 'Red-Hat-AI-Innovation-Team:main' into main
mazam-lab 7ae33de
Cleanup prior to draft PR
mazam-lab dba04d0
Updating PR with updated thresholds
mazam-lab 4590ead
Restructing Memory estimator as a class
mazam-lab 9c24220
Restructing Memory estimator as a class
mazam-lab 7d0e05c
Fixing bug in OSFT
mazam-lab 581fa5e
Polished up the documentation and added verbosity feature for the PR
mazam-lab 4f9ebee
Notebook giving an example on how to use the memory estimator
mazam-lab 88ad308
Addressing coderabbit comments
mazam-lab a6b7375
Patching in a simpler estimator for OSFT, updating the notebook, hotf…
mazam-lab e0f2752
Addressing coderabbit review
mazam-lab 7cd94a9
Hotfixing coderabbit issue
mazam-lab c781b7c
Addressing Mustafa's comments on the readme, adjusting an typcheck fr…
mazam-lab c06e934
Simplifying the linear mapping based on Nikhil's comment
mazam-lab File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,357 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "187e6115", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Memory Estimator \n", | ||
| "\n", | ||
| "This notebook will provide some examples on how to use the memory_estimator API\n", | ||
| "to estimate the amount of GPU memory consumed when fine-tuning an LLM model in Training Hub.\n", | ||
| "This notebook will cover:\n", | ||
| "1. How the package's primary class implemented, \n", | ||
| "2. How it can be subclassed for further extensions,\n", | ||
| "3. How it can be used via both class instantiation and via convenience function,\n", | ||
| "\n", | ||
| "Tips on how LLM memory usage is calculated and how the memory can be reduced will also be mentioned as needed." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "a08d4d7c", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Setup" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "b8274236", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from training_hub.profiling.memory_estimator import BasicEstimator, OSFTEstimator, OSFTEstimatorExperimental, estimate" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "c61f401a", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "The estimation depends on several key factors that should be user inputted. These are:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "3e3515f5", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "#### The Pre-Trained Model to be Fine-Tuned" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "66208c2a", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "model_path = \"ibm-granite/granite-3.3-2b-instruct\" " | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "b98e920e", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "#### The Number and Size of Your GPUs\n", | ||
| "\n", | ||
| "The given default values will assume you are training on 2x L40s, each containing 48 GB of memory." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "70462895", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "num_gpus = 2\n", | ||
| "gpu_memory = 48 * (2**30) # 48 GB in bytes" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "5cf719d0", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "#### The Maximum Number of Tokens You'll Place Onto a GPU\n", | ||
| "\n", | ||
| "Note that in training hub, minibatches will be operated in such a way that\n", | ||
| "the number of tokens on the GPU never exceeds this value" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "a735ecbc", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "max_tokens_per_gpu = 8192" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "5b643b37", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "#### The Unfreeze Rank Ratio\n", | ||
| "\n", | ||
| "This is the OSFT parameter that determines what proportion of the parameters can be updated\n", | ||
| "during the OSFT fine-tuning step. Setting this to 0.33 should give you an estimation similar to SFT,\n", | ||
| "and setting this to 1 should you give you an estimation about twice as large as SFT's" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "117917b3", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "unfreeze_rank_ratio = 0.25" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "62eba14c", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Profiler Overview\n", | ||
| "\n", | ||
| "At a lower level, the profiling module provides a class `BasicEstimator` that implements the memory estimation for training an LLM normally (via SFT).\n", | ||
| "\n", | ||
| "The estimator computes this values in the `estimate` function through the following procedure:\n", | ||
| "\n", | ||
| "1. Calculate the memory needed to store the model parameters (`_calc_model_params`)\n", | ||
| "\n", | ||
| "2. Calculate the memory needed to store the model's gradients (`_calc_gradients`)\n", | ||
| "\n", | ||
| "3. Calculate the memory needed to store the model's optimizer states (`_calc_optimizer`)\n", | ||
| " - The values of Steps 1-3 is proportional to the number of parameters within the the model.\n", | ||
| " - This estimator assumes the AdamW optimizer, which stores 2 optimizer parameters per model parameter\n", | ||
| " - Some non-Adam optimizers use only 1 optimizer parameter, although training hub uses AdamW by default\n", | ||
| "\n", | ||
| "4. Calculate the memory needed to store the intermediate activations within the model (`_calc_intermediate_activations`)\n", | ||
| " - This value is the product of the number of tokens being passed onto a GPU, the number of layers in the model, and the model's hidden dimensionality\n", | ||
| "\n", | ||
| "5. Calculate the memory needed to store the activated output the model (`_calc_outputs`)\n", | ||
| " - This value is the product of the number of tokens being passed onto a GPU and the vocabulary size of the model.\n", | ||
| "\n", | ||
| "6. Calculate any additional memory the model might use (this value is 0 for SFT) (`_calc_additional`)\n", | ||
| "\n", | ||
| "7. Sum up the memory calculated in Steps 1-6\n", | ||
| "\n", | ||
| "8. Apply multiplers representing possible overhead to get the low bound (1x), expected (1.1x), and upper bound (1.3x) for the memory usage of this model (`_apply_overhead`)\n", | ||
| "\n", | ||
| "Note that training hub assumes that all of the above values are stored in Float32 (4 bytes per tensor entry)\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "7e2277c9", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Basic SFT Estimation" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "e0486d2a", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "my_sft_estimator = BasicEstimator(num_gpus=num_gpus,\n", | ||
| " gpu_memory=gpu_memory,\n", | ||
| " model_path=model_path,\n", | ||
| " max_tokens_per_gpu=max_tokens_per_gpu,\n", | ||
| " verbose=2\n", | ||
| " )\n", | ||
| "\n", | ||
| "sft_lower_bound, sft_expected, sft_upper_bound = my_sft_estimator.estimate()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "4396155f", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## OSFT Estimation and Subclassing\n", | ||
| "Training Hub plans to implement a wide variety of different methods for training LLMs,\n", | ||
| "with OSFT having been recently implemented.\n", | ||
| "\n", | ||
| "Because the estimator is implemented as a class, the individual components for\n", | ||
| "calculating the memory are their own functions, and LLM methods tend to have similarities\n", | ||
| "in how they consume memories, we can create new estimators by simply subclassing `BasicEstimator`\n", | ||
| "and overriding any of the respective methods for the individual pieces of memory computation\n", | ||
| "with formulas that are more accurate for that training method.\n", | ||
| "\n", | ||
| "For example, the estimator for OSFT is implemented as the subclass `OSFTEstimator`.\n", | ||
| "On top of some under-the-hood changes, its main adjustment is overriding `_calc_model_params`\n", | ||
| "to use the U, Sigma, and V matrices obtained through SVD calculation instead of the typical\n", | ||
| "model weight matrix." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "93b6afc4", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "my_osft_estimator = OSFTEstimator(num_gpus=num_gpus,\n", | ||
| " gpu_memory=gpu_memory,\n", | ||
| " model_path=model_path,\n", | ||
| " max_tokens_per_gpu=max_tokens_per_gpu,\n", | ||
| " verbose=2,\n", | ||
| " unfreeze_rank_ratio=unfreeze_rank_ratio\n", | ||
| " )\n", | ||
| "\n", | ||
| "osft_lower_bound, osft_expected, osft_upper_bound = my_osft_estimator.estimate()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "eaefc58e", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## OSFT Estimation with Liger Kernels\n", | ||
| "`BasicEstimator` includes support for Liger Kernels. Liger Kernels aim to drastically\n", | ||
| "speed up the time needed to fine-tune LLM models as well as reduce the memory footprint\n", | ||
| "of the fine-tuning process.\n", | ||
| "\n", | ||
| "Empirically, the main memory optimization of Liger Kernels is to recalculate the activated outputs\n", | ||
| "of the model rather than directly storing them on the GPU for future use. This can drastically\n", | ||
| "improve the memory footprint when training use very large batch sizes. \n", | ||
| "\n", | ||
| "For the purposes of this estimator, enabling Liger Kernels will force `_calc_outputs` to always be 0.\n", | ||
| "\n", | ||
| "In Training Hub, OSFT uses Liger Kernels by default." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "12a4e81a", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "my_liger_estimator = OSFTEstimator(num_gpus=num_gpus,\n", | ||
| " gpu_memory=gpu_memory,\n", | ||
| " model_path=model_path,\n", | ||
| " max_tokens_per_gpu=max_tokens_per_gpu,\n", | ||
| " verbose=2,\n", | ||
| " use_liger=True,\n", | ||
| " unfreeze_rank_ratio=unfreeze_rank_ratio\n", | ||
| " )\n", | ||
| "\n", | ||
| "liger_lower_bound, liger_expected, liger_upper_bound = my_liger_estimator.estimate()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "48c82224", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Perform Estimation with the convenience function\n", | ||
| "\n", | ||
| "For higher level usage, rather than needing to directly instantiate an estimator object,\n", | ||
| "we have provided a simple convenience function named `estimate`, in which you can\n", | ||
| "provide the standard initialization arguments for your estimator as well as the\n", | ||
| "type of training method you want to estimate for, and you can immediately obtain the estimation bounds.\n", | ||
| "\n", | ||
| "To specify the estimation type, you can pass in `\"sft\"` to the `training_method` argument to\n", | ||
| "estimate for SFT, or `\"osft\"` to estimate for OSFT." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "f8ede8ca", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "conv_sft_lower_bound, conv_sft_expected, conv_sft_upper_bound = estimate(\n", | ||
| " training_method=\"sft\",\n", | ||
| " num_gpus=num_gpus,\n", | ||
| " gpu_memory=gpu_memory,\n", | ||
| " model_path=model_path,\n", | ||
| " max_tokens_per_gpu=max_tokens_per_gpu,\n", | ||
| " verbose=2\n", | ||
| " )" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "d2e3f526", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "conv_osft_lower_bound, conv_osft_expected, conv_osft_upper_bound = estimate(\n", | ||
| " training_method=\"osft\",\n", | ||
| " num_gpus=num_gpus,\n", | ||
| " gpu_memory=gpu_memory,\n", | ||
| " model_path=model_path,\n", | ||
| " max_tokens_per_gpu=max_tokens_per_gpu,\n", | ||
| " verbose=2,\n", | ||
| " unfreeze_rank_ratio=unfreeze_rank_ratio\n", | ||
| " )" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "ab863891", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "conv_liger_lower_bound, conv_liger_expected, conv_liger_upper_bound = estimate(\n", | ||
| " training_method=\"osft\",\n", | ||
| " num_gpus=num_gpus,\n", | ||
| " gpu_memory=gpu_memory,\n", | ||
| " model_path=model_path,\n", | ||
| " max_tokens_per_gpu=max_tokens_per_gpu,\n", | ||
| " verbose=2,\n", | ||
| " use_liger=True,\n", | ||
| " unfreeze_rank_ratio=unfreeze_rank_ratio\n", | ||
| " )" | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "th_dev", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.12.12" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 5 | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"LLM Model is redundant"