-
Notifications
You must be signed in to change notification settings - Fork 18
Adding basic API for memory profiling (src/training_hub/profiling) #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a memory estimation module implementing SFT and two OSFT estimators with a unified Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as Caller
participant Est as Estimator (Basic / OSFT)
participant HF as AutoModel.from_pretrained
participant CFG as Model Config
U->>Est: call estimate(...) or instantiate estimator
Est->>HF: AutoModel.from_pretrained(model_path, trust_remote_code=...)
HF-->>Est: model instance
Est->>CFG: read metadata (num_parameters, trainable_params, layers, hidden_size)
CFG-->>Est: config values
rect #F0FFF4
note right of Est: compute tokens_per_gpu, dtype sizes\nand constants
Est->>Est: per-layer analysis (_check_layer_params for OSFT experimental)
Est->>Est: _calc_model_params / _calc_osft_params
end
Est->>Est: compute components: params, grads, optimizer, activations, outputs
Est->>Est: apply overhead multipliers -> (lower, expected, upper)
Est-->>U: return bounds (and verbose prints if enabled)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
1 similar comment
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/training_hub/profiling/memory_estimator.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.2)
src/training_hub/profiling/memory_estimator.py
130-131: Do not assign a lambda expression, use a def
Rewrite calc_subtotal as a def
(E731)
132-133: Do not assign a lambda expression, use a def
Rewrite calc_overhead as a def
(E731)
143-143: Do not assign a lambda expression, use a def
Rewrite rounded_helper as a def
(E731)
228-228: Multiple statements on one line (colon)
(E701)
229-229: Multiple statements on one line (colon)
(E701)
230-230: Multiple statements on one line (colon)
(E701)
231-231: Multiple statements on one line (colon)
(E701)
|
|
||
| # NOTE: Typically, K is somewhere in [10, 30], but some research suggests that K=17. | ||
| # TODO: Add papers | ||
| gpu_vram_act_low: int = tokens_per_gpu * FLOAT32_BYTES_N * 10 * num_layers * hidden_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empirically if we see these estimates are not too far from actual memory this should be fine. But isn't it possible to calculate the exact upper bound given max sequence length and effective batch size? Can't we go through each layer of the model and sum the activation size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very shortly after I did this PR I made some discoveries that change this formula:
Basically, the activations are made up of two components:
- The mid-layer activations (gpu_vram_act), which can be calculated as max_tokens_per_gpu * FLOAT32_BYTES_N * num_layers * hidden_size (set K to 1)
- The output tensor (gpu_vram_out), which can be calculated as max_tokens_per_gpu * vocab_size * FLOAT32_BYTES_N.
From what I've seen empirically, you can get a reasonably good estimate if you use gpu_vram_act + (gpu_vram_out * 2) as your activation memory estimate
With liger kernels enabled you can simply use gpu_vram_act for the entire activation estimate
But the theoretical justification on using these two formulas is still a bit wishy-washy, so I want to tighten that up a bit if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazam-lab yeah I think this formula should get us quite close to the actual activation usage. One thing to note though is that not all matrices/layers have the same output dimension. Example some matrices may have dimensions of [4h, h] in which case the output size would be four times larger. But overall I think this should bring us pretty close to the correct value.
| lambda tot_val, act_val: tot_val - gpu_vram_par - gpu_vram_opt - gpu_vram_grad - act_val | ||
|
|
||
| # Perform those calculations using the lambdas! | ||
| gpu_vram_total_high: int = int(1.3 * calc_subtotal(gpu_vram_act_high)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these scalars 1.3, 1.2, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most formulas for calculating memory add about 10-30% to the initial subtotal to account for overhead, hence the scalars used here
NikhilNayak-debug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks like a good draft!
| the given hardware to run the model, and note how much more memory | ||
| is needed to make the training more feasible. | ||
|
|
||
| Note that this estimate assumes training_hub will be used, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this estimate assumes training_hub will be used, in which all data types are float32 and the optimizer is always AdamW.
The reads like training hub houses the implementation for all algorithms, rather than just being an interface which exposes them. While these are probably the best settings in 95% of training scenarios, it also may not always be the case (e.g. FP8 training could be supported by an underlying backend).
Can you please update the messaging in your docs to instead be general?
E.g. This estimate assumes mixed precision training settings (storage in full precision/FP32, compute in half precision/BF16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this comment is to let the user know that we'll be assuming all tensors placed on VRAM will be Float32. So just to make sure I'm understanding correctly, should we let the user know that the underlying backend calculations might not necessarily be entirely done in Float32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazam-lab No, the FP32 clarification is perfect, you don't need to change that. But right now it sounds like the float32 is because of training hub specifically, rather than being an established practice which we refer to as mixed precision training.
RobotSail
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @mazam-lab, I took a look and left a few comments. Since the calculator is implemented as a script, maybe you could also add a notebook that provides a few examples on its usage?
| # TODO: Validate these estimations empirically | ||
|
|
||
| # Load model directly | ||
| model = AutoModel.from_pretrained(model_path, trust_remote_code=True, dtype="auto") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have this load as dtype="float32", or does "auto" end up making the calculations work well here? How does the rest of the script differ if the base model itself is stored as float32 vs. the same model in bfloat16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script is hardcoded to use Float32 for all calculations regardless of the AutoModel's loaded data type, so changing this doesn't matter anymore (I originally had a data type toggle as a feature, but removed it since it wasn't too necessary if everything in training_hub is Float32 in practice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazam-lab That makes sense, thanks!
Yeah I can do that; should the notebook provide some empirical results on what the actual memory usage for some select models and training setups are? (i.e. have a table showing the actual memory usage of Granite 2B on OSFT with Liger Kernels and 16,384 tokens) |
|
@NikhilNayak-debug @RobotSail Thank you both so much for the comments! |
|
Class structure has been implemented and an example notebook has been added |
|
@Maxusmusti @RobotSail @NikhilNayak-debug I'm ready to move this to a normal PR now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
♻️ Duplicate comments (2)
src/training_hub/profiling/memory_estimator.py (2)
316-320: Critical: Incorrect SVD storage calculation for OSFT.The calculation for U and V matrices uses the full matrix dimensions (
shape[0] * shape[0]andshape[1] * shape[1]), which is incorrect for SVD decomposition. SVD produces matrices U (m×k), S (k), and V (k×n) where k = min(m, n).Based on past review comments.
Apply this diff to fix the SVD calculation:
for term in TARGET_TERMS: if layer_name.find(term) > -1 and layer_name.find('weight') > -1: - U_bytes_n: int = layer_data.shape[0] * layer_data.shape[0] - S_bytes_n: int = layer_data.shape[0] if layer_data.shape[0] < \ - layer_data.shape[1] else layer_data.shape[1] - V_bytes_n: int = layer_data.shape[1] * layer_data.shape[1] + m = layer_data.shape[0] + n = layer_data.shape[1] + k = min(m, n) + U_bytes_n: int = m * k + S_bytes_n: int = k + V_bytes_n: int = n * k return (U_bytes_n + S_bytes_n + V_bytes_n) * FLOAT32_BYTES_N
227-228: Generalize documentation to avoid training_hub-specific assumptions.The comment suggests memory characteristics are specific to training_hub, but these are standard mixed-precision training practices. Generalize the message as discussed in past review comments.
Based on past review comments.
Apply this diff:
- Note that this estimate assumes training_hub will be used, - in which all data types are float32 and the optimizer is always AdamW. + Note that this estimate assumes mixed precision training settings + (storage in full precision/FP32, compute in half precision/BF16) + and the AdamW optimizer.
🧹 Nitpick comments (2)
src/training_hub/profiling/memory_estimator.py (2)
14-14: Remove unused constant.
FLOAT8_BYTES_Nis defined but never used in the codebase.Consider removing this constant or documenting its intended future use:
FLOAT32_BYTES_N: int = 4 FLOAT16_BYTES_N: int = 2 -FLOAT8_BYTES_N: int = 1 ADAMW_PARAMS_N: int = 2
122-126: Remove unused kwargs parameter.The
**kwargsparameter in_calc_additionalis never used. Either remove it or document its purpose if it's intended for future subclass extensions.Apply this diff:
- def _calc_additional(self, **kwargs): + def _calc_additional(self): """ Calculate any additional VRAM that this training method might need """ return 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/notebooks/memory_estimator_example.ipynb(1 hunks)src/training_hub/profiling/memory_estimator.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
src/training_hub/profiling/memory_estimator.py
18-18: Do not assign a lambda expression, use a def
Rewrite ROUNDER as a def
(E731)
46-46: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
47-47: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
48-48: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
122-122: Unused method argument: kwargs
(ARG002)
281-281: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
282-282: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
283-283: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
351-351: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
352-352: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
353-353: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
🔇 Additional comments (2)
examples/notebooks/memory_estimator_example.ipynb (1)
1-539: LGTM! Excellent documentation and examples.The notebook provides clear, comprehensive examples of the memory estimator API, including:
- Proper setup and configuration
- Usage of both
BasicEstimatorandOSFTEstimatorclasses- Demonstration of the
estimate()convenience function- Explanation of how Liger Kernels affect memory usage
- Good pedagogical flow explaining the calculation methodology
The examples effectively illustrate the API's capabilities and will help users understand how to estimate memory requirements for their training scenarios.
src/training_hub/profiling/memory_estimator.py (1)
117-117: Verifyembed_tokenspresence or add defensive fallback
Confirm that all target Granite models exposemodel.embed_tokens.num_embeddings, or implement fallback logic usingmodel.config.vocab_sizeormodel.get_input_embeddings(), and raise a clear error if none are available.
|
Coderabbit comments have been addressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/training_hub/profiling/memory_estimator.py (1)
344-406: Duplicate: trust_remote_code parameter issue.This function attempts to pass
trust_remote_codetoOSFTEstimator(line 393), which will fail due to the missing parameter in its__init__method. This issue is addressed in the review comment for lines 273-288.
🧹 Nitpick comments (1)
src/training_hub/profiling/memory_estimator.py (1)
120-124: Remove unused kwargs parameter.The
**kwargsparameter is unused and flagged by static analysis. Since no subclass currently uses it, consider removing it for clarity.Apply this diff:
- def _calc_additional(self, **kwargs): + def _calc_additional(self): """ Calculate any additional VRAM that this training method might need """ return 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/notebooks/memory_estimator_example.ipynb(1 hunks)src/training_hub/profiling/memory_estimator.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py
120-120: Unused method argument: kwargs
(ARG002)
🔇 Additional comments (6)
src/training_hub/profiling/memory_estimator.py (5)
1-16: LGTM!The imports are minimal and necessary, constants are well-defined, and the ROUNDER function has been correctly converted from a lambda to a proper function definition as recommended in previous reviews.
17-80: LGTM!The class definition and initialization look good. Past review feedback has been properly addressed:
- Type hints now use explicit
int | Nonesyntax (lines 43-45, 48)trust_remote_codeparameter has been added (line 48)- Model loading correctly uses the
trust_remote_codeparameter (line 57)
127-264: LGTM!The overhead application, printing methods, and main estimate orchestration are well-implemented. The logic is clear and the breakdown provides useful information to users.
291-342: LGTM!The OSFT-specific parameter calculations look correct. The SVD decomposition logic for trainable layers and the override of
_calc_model_paramsare properly implemented.
110-118: Guard against missingembed_tokensby falling back toconfig.vocab_size.
Line 115 assumesself.model.embed_tokens.num_embeddingsexists, which can break on architectures without that attribute. Replace with:- return (self.tokens_per_gpu * self.main_dtype_bytes * self.model.embed_tokens.num_embeddings) * self.output_constant + vocab_layer = getattr(self.model, "embed_tokens", None) + vocab_size = vocab_layer.num_embeddings if vocab_layer else self.model.config.vocab_size + return (self.tokens_per_gpu * self.main_dtype_bytes * vocab_size) * self.output_constantexamples/notebooks/memory_estimator_example.ipynb (1)
1-539: LGTM!The example notebook is well-structured and provides clear demonstrations of:
BasicEstimatorusage for SFTOSFTEstimatorusage for OSFT- OSFT with Liger Kernels
- The
estimate()convenience functionThe explanations are educational and help users understand how memory estimation works. The past review comment about the inline comment has been correctly addressed (line 82 now correctly states "48 GB in bytes").
| """ | ||
| Calculate the VRAM for storing the model's intermediate activations | ||
| """ | ||
| return (self.tokens_per_gpu * self.main_dtype_bytes * self.num_layers * self.hidden_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazam-lab each layer has more than one matrix, and you need to account for that. For example each Llama layer has 4 attention projections and 3 MLP projections. Also, some layers like the MLP don't always have output dimension h (they expand to a larger width and then project back to h), but we can probably ignore this for now since it won't make a big difference and keeps the tool simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean by this? Because from what I understand from reading blogs, you don't need to store the intermediate matrix computations, you only store the final activations of each layer. So even though Llama has all of those matrices, I think the only thing that gets stored in memory is whatever gets passed into the SiLU at the end, which should be a matrix of size (tokens x hidden_dim):
I don't think this needs adjustment because it's not a large part of the calculation and this lines up with empirical results since the SFT estimator (which does make use of this formula) has been really accurate from my testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what i understand, you store this temporarily so you can do the backpropogation efficiently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazam-lab if empirically it is accurate I think it should be okay. But these intermediate activations do add up because output of each matrix is required to compute the gradient unless one trains with activation checkpointing in which case you calculate the gradients layer by layer. But if SFT estimates are accurate, this should be fine!
| if layer_data.dim() < 2: | ||
| return layer_data.numel() * FLOAT32_BYTES_N | ||
| for term in TARGET_TERMS: | ||
| if layer_name.find(term) > -1 and layer_name.find('weight') > -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazam-lab this wouldn't work for rectangular matrices where dimension m is not equal to n. In that case U would be (m, min(m, n)) and V would be (min(m, n), n). Can we make that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay will do; So I thought U was supposed to be of size m x m and V was supposed to be size n x n, because that's the only way you can perform matrix multiplication with S of size m x n.
If I'm understanding things right, the reason U and V need to be size (m x min(m, n)) and (min(m, n) x n) respectively is because S is actually of size (min(m, n) x min(m, n))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazam-lab yeah so for a matrix W of shape (m, n), the call U, S, Vt = torch.linalg.svd(W) returns:
U → shape (m, k)
S → shape (k,)
Vt → shape (k, n)
where k=min(m,n)
The code where we do this is here - https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer/blob/cb8907090945f12f6eb8d70b14eb69f9e5af5a9e/src/mini_trainer/osft_utils.py#L321C9-L321C60
|
|
||
| # If the layer is trainable, i.e. it contains any of these terms, | ||
| # then we will need to store a SVD decomposition of the layer in memory. | ||
| TARGET_TERMS: list[str] = ['self_attn.q_proj', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazam-lab these are the right parameter keys for Llama, Qwen, Mistral. But other model families have different names that we do the SVD on. For the full list refer to https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer/blob/cb8907090945f12f6eb8d70b14eb69f9e5af5a9e/src/mini_trainer/osft_utils.py#L100 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah perfect, thank you!
|
@mazam-lab also right now in the PR we are not accounting for the unfreeze rank ratio that is how much of U and V we are actually updating. The total parameter count will use the full matrix size of U and V, but gradients and optimizer states should scale with the rank ratio since only that fraction of U/V participates in training. |
…ixing __init__ issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (5)
src/training_hub/profiling/memory_estimator.py (5)
4-4: Remove unused import.The
numpymodule is imported but never used in this file.Apply this diff to remove the unused import:
-import numpy as np
11-13: Remove or document unused constants.
FLOAT16_BYTES_NandFLOAT8_BYTES_Nare defined but never used. If these are planned for future precision support, add a comment; otherwise, remove them to reduce clutter.Apply this diff if not planning to use them:
FLOAT32_BYTES_N: int = 4 -FLOAT16_BYTES_N: int = 2 -FLOAT8_BYTES_N: int = 1 ADAMW_PARAMS_N: int = 2
17-17: Extract magic number to named constant.The value
1073741824(bytes per GiB) is a magic number. Extract it to a named constant for clarity.Apply this diff:
+# Conversion constant +BYTES_PER_GIB: int = 1073741824 + # Helper function to do the rounding when printing -def ROUNDER(value: int) -> str: return str(round(value / 1073741824, 1)) +def ROUNDER(value: int) -> str: return str(round(value / BYTES_PER_GIB, 1))
123-127: Remove unused kwargs parameter.The
kwargsparameter in_calc_additional()is unused (as flagged by static analysis). If it's intended for future subclass extensions, add a docstring note; otherwise, remove it.As per static analysis hints.
Apply this diff if not needed:
- def _calc_additional(self, **kwargs): + def _calc_additional(self): """ Calculate any additional VRAM that this training method might need """ return 0
423-423: Document the polynomial ratio formula.Line 423 uses a hardcoded polynomial formula:
-0.7802 * x^2 + 2.5302 * x + 0.25. This appears to be empirically derived but lacks explanation.Add a comment explaining the source and derivation of these coefficients:
@override def _apply_overhead(self, subtotal): """ In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio. The ratio_val polynomial was empirically derived from [source/experiment] to approximate OSFT's memory scaling behavior relative to SFT. """ # Polynomial approximation: f(x) = -0.7802*x^2 + 2.5302*x + 0.25 ratio_val = -0.7802 * (self.unfreeze_rank_ratio ** 2) + 2.5302 * self.unfreeze_rank_ratio + 0.25 return super()._apply_overhead(subtotal * ratio_val)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/notebooks/memory_estimator_example.ipynb(1 hunks)src/training_hub/__init__.py(2 hunks)src/training_hub/profiling/memory_estimator.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/notebooks/memory_estimator_example.ipynb
🧰 Additional context used
🧬 Code graph analysis (1)
src/training_hub/__init__.py (1)
src/training_hub/profiling/memory_estimator.py (7)
BasicEstimator(20-267)OSFTEstimatorExperimental(270-382)estimate(218-267)estimate(366-382)estimate(427-431)estimate(434-500)OSFTEstimator(385-431)
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py
123-123: Unused method argument: kwargs
(ARG002)
🔇 Additional comments (2)
src/training_hub/__init__.py (1)
5-5: LGTM!The imports and exports are correctly structured and consistent with each other. The new memory estimation API is properly exposed to users of the training_hub package.
Also applies to: 18-22
src/training_hub/profiling/memory_estimator.py (1)
434-500: LGTM!The module-level
estimate()function provides a clean, unified API for both SFT and OSFT estimation. The implementation correctly dispatches to the appropriate estimator class based on thetraining_methodparameter.
| if effective_batch_size is None or max_seq_len is None: | ||
| self.tokens_per_gpu: int = max_tokens_per_gpu | ||
| elif max_tokens_per_gpu is None: | ||
| self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert float division to int.
Division by num_gpus produces a float, but tokens_per_gpu is annotated as int (line 72) and should remain an integer for consistent calculations. Use integer division (//) or explicit int() cast.
Apply this diff:
elif max_tokens_per_gpu is None:
- self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+ self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus)
else:
self.tokens_per_gpu: int = min(max_tokens_per_gpu,
- effective_batch_size * max_seq_len / self.num_gpus)
+ int(effective_batch_size * max_seq_len / self.num_gpus))Also applies to: 76-77
🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around lines 74 and also lines
76-77, the division operations produce floats while tokens_per_gpu (and related
variables) are annotated as ints; change the float divisions to integer results
by using integer division (//) or wrapping the expressions with int(...) so the
assigned values are integers and downstream calculations remain consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/training_hub/profiling/memory_estimator.py (1)
77-83: Cast division results to int to match type annotation.Lines 80 and 82–83 perform division by
num_gpus, which produces afloat, buttokens_per_gpuis annotated asint. This type mismatch may cause issues with type checkers or downstream integer-only operations.Apply this diff to cast the results:
if effective_batch_size is None or max_seq_len is None: self.tokens_per_gpu: int = max_tokens_per_gpu elif max_tokens_per_gpu is None: - self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus + self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus) else: self.tokens_per_gpu: int = min(max_tokens_per_gpu, - effective_batch_size * max_seq_len / self.num_gpus) + int(effective_batch_size * max_seq_len / self.num_gpus))
🧹 Nitpick comments (2)
src/training_hub/profiling/memory_estimator.py (2)
19-22: Document the derivation or source of the quadratic formula.The quadratic coefficients in
OSFT_RATIO(−0.7802, 2.5302, 0.25) are unexplained beyond the boundary conditions. Consider adding a comment or docstring that clarifies:
- The empirical or theoretical basis for this formula
- Whether it's derived from experimental data or a mathematical model
- Any references or sources that justify these specific coefficients
This will help future maintainers understand and validate the scaling behavior.
143-147: Remove unusedkwargsparameter.The
kwargsparameter in_calc_additionalis declared but never used. If it's not needed for future extensibility or polymorphism, consider removing it to keep the signature clean.Apply this diff:
- def _calc_additional(self, **kwargs): + def _calc_additional(self): """ Calculate any additional VRAM that this training method might need """ return 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/training_hub/profiling/memory_estimator.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py
130-130: Do not use bare except
(E722)
133-133: Do not use bare except
(E722)
136-136: Do not use bare except
(E722)
137-137: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
137-137: Avoid specifying long messages outside the exception class
(TRY003)
143-143: Unused method argument: kwargs
(ARG002)
| def _calc_gradients(self): | ||
| """ | ||
| Override the optimizer parameter calculation by calculating based on OSFT's parameters | ||
| """ | ||
| return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add _calc_optimizer override to scale optimizer states by unfreeze ratio.
Currently, _calc_gradients is scaled by OSFT_RATIO(unfreeze_rank_ratio), but optimizer states are not. Since AdamW maintains two state tensors per parameter and only the unfrozen fraction participates in training, optimizer memory should also scale with the rank ratio.
As per user feedback from NikhilNayak-debug.
Add this override after _calc_gradients:
@override
def _calc_gradients(self):
"""
Override the optimizer parameter calculation by calculating based on OSFT's parameters
"""
return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+ @override
+ def _calc_optimizer(self):
+ """
+ Override the optimizer calculation to scale with unfreeze_rank_ratio
+ """
+ return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _calc_gradients(self): | |
| """ | |
| Override the optimizer parameter calculation by calculating based on OSFT's parameters | |
| """ | |
| return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) | |
| def _calc_gradients(self): | |
| """ | |
| Override the optimizer parameter calculation by calculating based on OSFT's parameters | |
| """ | |
| return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) | |
| @override | |
| def _calc_optimizer(self): | |
| """ | |
| Override the optimizer calculation to scale with unfreeze_rank_ratio | |
| """ | |
| return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio) |
🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around lines 374 to 378, add
an override method named _calc_optimizer immediately after _calc_gradients that
returns the optimizer state memory sized to the unfrozen parameter fraction:
calculate base optimizer parameter count via self._calc_model_params(), multiply
by OSFT_RATIO(self.unfreeze_rank_ratio) to account for only the unfrozen
parameters participating, and multiply by 2 to reflect the two state tensors
AdamW keeps per parameter (i.e., optimizer memory = _calc_model_params() *
OSFT_RATIO(self.unfreeze_rank_ratio) * 2); implement as a concise method that
parallels _calc_gradients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
src/training_hub/profiling/memory_estimator.py (2)
374-379: Scale optimizer states byunfreeze_rank_ratio(Experimental OSFT)Gradients are scaled but optimizer states aren’t. AdamW’s two state tensors should scale with the unfrozen fraction too.
@override def _calc_gradients(self): """ Override the optimizer parameter calculation by calculating based on OSFT's parameters """ return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) + + @override + def _calc_optimizer(self): + """ + Scale optimizer state memory with the unfrozen fraction (two states for AdamW) + """ + return self._calc_model_params() * self.opt_params * OSFT_RATIO(self.unfreeze_rank_ratio)
80-83: Maketokens_per_gpuan integerFloat division produces a float; this should be an integer for consistent byte math.
- self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus + self.tokens_per_gpu: int = (effective_batch_size * max_seq_len) // self.num_gpus else: - self.tokens_per_gpu: int = min(max_tokens_per_gpu, - effective_batch_size * max_seq_len / self.num_gpus) + self.tokens_per_gpu: int = min( + max_tokens_per_gpu, + (effective_batch_size * max_seq_len) // self.num_gpus, + )
🧹 Nitpick comments (9)
src/training_hub/profiling/memory_estimator.py (9)
1-1: Ensureoverrideimport works on Python < 3.12
typing.overrideis 3.12+. Add a fallback totyping_extensionsor switch to it directly to avoid ImportError on older runtimes.-from typing import override +try: + from typing import override # Python 3.12+ +except ImportError: # pragma: no cover + from typing_extensions import override # type: ignore
4-4: Remove unused NumPy import
numpy as npis not used.-import numpy as np
16-17: BroadenROUNDERtype and formatIt’s called with floats; update the hint and use a multi-line def for readability.
-def ROUNDER(value: int) -> str: return str(round(value / 1073741824, 1)) +def ROUNDER(value: float) -> str: + return str(round(value / 1073741824, 1))
89-91: Explain magicoutput_constantDocument the 8/3 factor (empirical vs theoretical) to aid maintainability.
- self.output_constant = 8/3 + # Empirical scaling for output tensor footprint (see PR notes); tune as needed + self.output_constant = 8/3
143-147: Silence unused kwargs warningRename to
_kwargsto address linter ARG002 without changing behavior.- def _calc_additional(self, **kwargs): + def _calc_additional(self, **_kwargs):
321-325: Stop at first matching model familyBreak after the first match to avoid accidental overrides when multiple keys match.
- for key in MODEL_CONFIGS.keys(): - if self.model_path.find(key) > -1: - self.target_terms = MODEL_CONFIGS[key]['patterns'] + for key in MODEL_CONFIGS.keys(): + if self.model_path.find(key) > -1: + self.target_terms = MODEL_CONFIGS[key]['patterns'] + break
342-343: Useminfor clarityMinor readability improvement.
- lowest_dim = layer_data.shape[0] if layer_data.shape[0] < \ - layer_data.shape[1] else layer_data.shape[1] + lowest_dim = min(layer_data.shape[0], layer_data.shape[1])
419-419: Validateunfreeze_rank_ratioinput (OSFT)Same check for the basic OSFT estimator.
self.unfreeze_rank_ratio = unfreeze_rank_ratio + if not (0.0 <= self.unfreeze_rank_ratio <= 1.0): + raise ValueError("unfreeze_rank_ratio must be within [0, 1]")
127-137: Minor: Tighten exception message (TRY003)Prefer a concise message.
- except AttributeError as e: - raise ValueError("Could not find the given model's vocabulary size") from e + except AttributeError as e: + raise ValueError("Could not resolve vocab_size") from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/training_hub/profiling/memory_estimator.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py
137-137: Avoid specifying long messages outside the exception class
(TRY003)
143-143: Unused method argument: kwargs
(ARG002)
| @@ -0,0 +1,516 @@ | |||
| from typing import override | |||
| from transformers import AutoModel | |||
| from mini_trainer.osft_utils import MODEL_CONFIGS | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid hard dependency at import time on mini_trainer
Importing mini_trainer.osft_utils at module import breaks SFT-only use if mini_trainer isn’t installed. Defer import to the OSFT Experimental constructor and handle absence gracefully.
Apply both changes:
- Remove the module-level import:
-from mini_trainer.osft_utils import MODEL_CONFIGS- Load lazily inside
OSFTEstimatorExperimental.__init__and fall back safely:
- # Check to see which terms need to be included in the search for valid layers
- self.target_terms = MODEL_CONFIGS['default']['patterns']
- for key in MODEL_CONFIGS.keys():
- if self.model_path.find(key) > -1:
- self.target_terms = MODEL_CONFIGS[key]['patterns']
+ # Resolve target layer patterns lazily to avoid hard dependency on mini_trainer
+ try:
+ from mini_trainer.osft_utils import MODEL_CONFIGS
+ except Exception:
+ MODEL_CONFIGS = {"default": {"patterns": []}} # minimal safe default
+ self.target_terms = MODEL_CONFIGS.get("default", {}).get("patterns", [])
+ for key in MODEL_CONFIGS.keys():
+ if key in self.model_path:
+ self.target_terms = MODEL_CONFIGS[key]["patterns"]
+ break📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from mini_trainer.osft_utils import MODEL_CONFIGS | |
| ++ b/src/training_hub/profiling/memory_estimator.py | |
| @@ –1,5 +1,4 @@ | |
| -from mini_trainer.osft_utils import MODEL_CONFIGS | |
| import torch | |
| from typing import Callable, Optional, override | |
| from transformers import AutoModel | |
| @@ class OSFTEstimatorExperimental(BasicEstimator): | |
| def __init__( | |
| self, | |
| num_gpus: int = 8, | |
| gpu_memory: int = 85899345920, | |
| model_path: str = "ibm-granite/granite-3.3-8b-instruct", | |
| effective_batch_size: int | None = None, | |
| max_seq_len: int | None = None, | |
| max_tokens_per_gpu: int | None = None, | |
| use_liger: bool = False, | |
| verbose: int = 1, | |
| ): | |
| super().__init__(...) | |
| - # Check to see which terms need to be included in the search for valid layers | |
| - self.target_terms = MODEL_CONFIGS['default']['patterns'] | |
| - for key in MODEL_CONFIGS.keys(): | |
| - if self.model_path.find(key) > -1: | |
| # Resolve target layer patterns lazily to avoid hard dependency on mini_trainer | |
| try: | |
| from mini_trainer.osft_utils import MODEL_CONFIGS | |
| except Exception: | |
| MODEL_CONFIGS = {"default": {"patterns": []}} # minimal safe default | |
| self.target_terms = MODEL_CONFIGS.get("default", {}).get("patterns", []) | |
| for key, cfg in MODEL_CONFIGS.items(): | |
| if key in self.model_path: | |
| self.target_terms = cfg.get("patterns", []) | |
| break |
🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around line 3, remove the
module-level import of mini_trainer.osft_utils to avoid a hard dependency at
import time; instead, in the OSFTEstimatorExperimental.__init__ perform a lazy
import inside a try/except ImportError block (e.g., try to import MODEL_CONFIGS
from mini_trainer.osft_utils and assign it to an instance attribute or module
variable) and on ImportError set that attribute to None or a safe default and
log or raise a clear runtime error only when OSFT functionality is actually
invoked; also update any code paths in the class to reference the
instance/module attribute and handle the None/default case gracefully so
SFT-only usage works without mini_trainer installed.
| def _apply_overhead(self, subtotal): | ||
| """ | ||
| In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio | ||
| """ | ||
| ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio) | ||
| return super()._apply_overhead(subtotal * ratio_val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep subtotal logic clean; scale only grads/optimizer (OSFT)
Scaling the entire subtotal in _apply_overhead makes the component breakdown inconsistent. Override _calc_gradients and _calc_optimizer instead, and leave _apply_overhead to just apply overhead.
def _apply_overhead(self, subtotal):
"""
In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
"""
- ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)
- return super()._apply_overhead(subtotal * ratio_val)
+ return super()._apply_overhead(subtotal)Add these overrides:
+ @override
+ def _calc_gradients(self):
+ return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+ @override
+ def _calc_optimizer(self):
+ return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _apply_overhead(self, subtotal): | |
| """ | |
| In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio | |
| """ | |
| ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio) | |
| return super()._apply_overhead(subtotal * ratio_val) | |
| def _apply_overhead(self, subtotal): | |
| """ | |
| In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio | |
| """ | |
| return super()._apply_overhead(subtotal) | |
| @override | |
| def _calc_gradients(self): | |
| return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio) | |
| @override | |
| def _calc_optimizer(self): | |
| return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio) |
🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around lines 422 to 427,
currently _apply_overhead multiplies the entire subtotal by
OSFT_RATIO(self.unfreeze_rank_ratio); instead, remove that scaling so
_apply_overhead only applies the 0-30% overhead to the subtotal, and implement
two new overrides: _calc_gradients and _calc_optimizer which call the parent
implementations (or existing internal logic) and then multiply only the
gradients and optimizer component values by
OSFT_RATIO(self.unfreeze_rank_ratio). Ensure you do not change the return shape
or keys of those methods, and keep the OSFT scaling isolated to those two
overrides so the component breakdown remains consistent.
| "# Memory Estimator \n", | ||
| "\n", | ||
| "This notebook will provide some examples on how to use the memory_estimator API\n", | ||
| "to estimate the amount of GPU memory consumed when fine-tuning an LLM model in Training Hub.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"LLM Model is redundant"
|
Only other thing is adding a note in the |
…om coderabbit, some other documentation cleaning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
src/training_hub/profiling/memory_estimator.py (4)
427-433: Incorrect scaling approach: scale only gradients/optimizer, not entire subtotal.Scaling the entire subtotal in
_apply_overheadmakes the component breakdown inconsistent and applies the OSFT scaling to activations/outputs which shouldn't be affected. Only gradients and optimizer states should scale withunfreeze_rank_ratio.As per user feedback from NikhilNayak-debug.
Remove the scaling from
_apply_overheadand add overrides for_calc_gradientsand_calc_optimizer:@override def _apply_overhead(self, subtotal): """ - In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio + Apply the standard 0-30% overhead """ - ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio) - return super()._apply_overhead(subtotal * ratio_val) + return super()._apply_overhead(subtotal) + + @override + def _calc_gradients(self): + """ + Override to scale gradients by unfreeze_rank_ratio + """ + return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio) + + @override + def _calc_optimizer(self): + """ + Override to scale optimizer states by unfreeze_rank_ratio + """ + return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)
3-3: Avoid hard dependency onmini_trainerat module import.The module-level import of
mini_trainer.osft_utilscreates a hard dependency that breaks SFT-only usage whenmini_traineris not installed. This import is only needed forOSFTEstimatorExperimental.Move the import inside
OSFTEstimatorExperimental.__init__with graceful fallback:-from mini_trainer.osft_utils import MODEL_CONFIGS import numpy as npThen in
OSFTEstimatorExperimental.__init__(after line 322):if not (0.0 <= self.unfreeze_rank_ratio <= 1.0): raise ValueError("Ratio must be in the range [0, 1]") # Check to see which terms need to be included in the search for valid layers + try: + from mini_trainer.osft_utils import MODEL_CONFIGS + except ImportError: + MODEL_CONFIGS = {"default": {"patterns": []}} self.target_terms = MODEL_CONFIGS['default']['patterns'] for key in MODEL_CONFIGS.keys(): if self.model_path.find(key) > -1: self.target_terms = MODEL_CONFIGS[key]['patterns']
80-83: Convert float division results to int.Division by
num_gpusproduces float values, buttokens_per_gpuis annotated asint. Use explicitint()cast to maintain type consistency.Apply this diff:
elif max_tokens_per_gpu is None: - self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus + self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus) else: self.tokens_per_gpu: int = min(max_tokens_per_gpu, - effective_batch_size * max_seq_len / self.num_gpus) + int(effective_batch_size * max_seq_len / self.num_gpus))
377-383: Add_calc_optimizeroverride to scale optimizer states.Currently
_calc_gradientsscales byOSFT_RATIO(unfreeze_rank_ratio), but optimizer states are not scaled. Since AdamW maintains two state tensors per parameter and only the unfrozen fraction participates in training, optimizer memory should also scale with the rank ratio.As per user feedback from NikhilNayak-debug.
Add this override after
_calc_gradients:@override def _calc_gradients(self): """ Override the optimizer parameter calculation by calculating based on OSFT's parameters """ return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) + + @override + def _calc_optimizer(self): + """ + Override the optimizer calculation to scale with unfreeze_rank_ratio. + AdamW maintains two state tensors per parameter, and only the unfrozen + fraction participates in training. + """ + return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) * ADAMW_PARAMS_N
🧹 Nitpick comments (2)
src/training_hub/profiling/memory_estimator.py (1)
145-149: Remove unusedkwargsparameter.The
**kwargsparameter in_calc_additionalis never used and creates unnecessary API surface.As per static analysis hints.
Apply this diff:
- def _calc_additional(self, **kwargs): + def _calc_additional(self): """ Calculate any additional VRAM that this training method might need """ return 0examples/README.md (1)
83-83: Simplify wordy phrase.The phrase "built off of" is unnecessarily wordy. Consider using a more concise alternative.
As per static analysis hints.
Apply this diff:
-training_hub includes a library for estimating the expected amount of GPU memory that will be allocated during the fine-tuning of a given model using SFT or OSFT. The calculations are built off of formulas presented in the blog post [How To Calculate GPU VRAM Requirements for an Large-Language Model](https://apxml.com/posts/how-to-calculate-vram-requirements-for-an-llm). +training_hub includes a library for estimating the expected amount of GPU memory that will be allocated during the fine-tuning of a given model using SFT or OSFT. The calculations are based on formulas presented in the blog post [How To Calculate GPU VRAM Requirements for an Large-Language Model](https://apxml.com/posts/how-to-calculate-vram-requirements-for-an-llm).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/README.md(1 hunks)examples/notebooks/memory_estimator_example.ipynb(1 hunks)src/training_hub/profiling/memory_estimator.py(1 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/README.md
[style] ~83-~83: ‘off of’ might be wordy. Consider a shorter alternative.
Context: ...SFT or OSFT. The calculations are built off of formulas presented in the blog post [Ho...
(EN_WORDINESS_PREMIUM_OFF_OF)
[grammar] ~83-~83: There might be a mistake here.
Context: ...calculate-vram-requirements-for-an-llm). NOTE: This feature is still a work in-pr...
(QB_NEW_EN)
[grammar] ~87-~87: There might be a mistake here.
Context: ...l results may still vary. Tutorials: - [Memory Estimation Example](notebooks/mem...
(QB_NEW_EN)
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py
139-139: Avoid specifying long messages outside the exception class
(TRY003)
145-145: Unused method argument: kwargs
(ARG002)
322-322: Avoid specifying long messages outside the exception class
(TRY003)
425-425: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (2)
examples/notebooks/memory_estimator_example.ipynb (1)
1-357: LGTM! Well-structured educational notebook.The notebook provides a comprehensive, step-by-step tutorial on using the memory estimation API. It effectively demonstrates:
- Core concepts and memory calculation components
- Both class-based instantiation and convenience function usage
- SFT, OSFT, and Liger Kernels scenarios
- Clear explanations of parameters and their impact
The examples align well with the implemented API and will help users understand how to estimate memory requirements for their training scenarios.
examples/README.md (1)
100-100: Fix syntax error in code example.Missing comma after
use_liger=Trueparameter.Apply this diff:
estimate(training_method='osft', num_gpus=2, model_path="/path/to/model", max_tokens_per_gpu=8192, use_liger=True, verbose=2, - unfreeze_rank_ratio: float = 0.25 + unfreeze_rank_ratio=0.25 )Likely an incorrect or invalid review comment.
| # Helper function to calculate how much the given unfrozen_rank_ratio | ||
| # will be affecting the OSFT estimation (through a quadratic mapping where | ||
| # 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value) | ||
| def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Document the magic constant 0.33 in OSFT_RATIO.
The quadratic mapping formula uses 0.33 as a pivot point where OSFT equals SFT memory usage, but this isn't explained. Extract to a named constant with documentation.
Apply this refactor:
+# Default unfreeze ratio where OSFT memory usage equals SFT memory usage (empirically determined)
+DEFAULT_UNFREEZE_RATIO: float = 0.33
+
# Helper function to calculate how much the given unfrozen_rank_ratio
# will be affecting the OSFT estimation (through a quadratic mapping where
-# 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value)
-def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25
+# 0 is 0.25 of SFT's value, DEFAULT_UNFREEZE_RATIO is equal to SFT's value, and 1 is twice of SFT's value)
+def OSFT_RATIO(value: float) -> float:
+ """Calculate OSFT memory scaling factor based on unfreeze_rank_ratio."""
+ return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Helper function to calculate how much the given unfrozen_rank_ratio | |
| # will be affecting the OSFT estimation (through a quadratic mapping where | |
| # 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value) | |
| def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25 | |
| # Default unfreeze ratio where OSFT memory usage equals SFT memory usage (empirically determined) | |
| DEFAULT_UNFREEZE_RATIO: float = 0.33 | |
| # Helper function to calculate how much the given unfrozen_rank_ratio | |
| # will be affecting the OSFT estimation (through a quadratic mapping where | |
| # 0 is 0.25 of SFT's value, DEFAULT_UNFREEZE_RATIO is equal to SFT's value, and 1 is twice of SFT's value) | |
| def OSFT_RATIO(value: float) -> float: | |
| """Calculate OSFT memory scaling factor based on unfreeze_rank_ratio.""" | |
| return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25 |
🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around lines 19-22, the
quadratic mapping in OSFT_RATIO uses the unexplained magic pivot value 0.33;
extract that pivot into a named constant (e.g.
UNFROZEN_RANK_RATIO_SFT_EQUIVALENT = 0.33) with a short docstring that explains
"pivot point where OSFT memory equals SFT memory (OSFT_RATIO == 1.0)". Replace
the literal 0.33 in comments/description with the constant and add a brief
docstring above the OSFT_RATIO function describing the quadratic mapping, the
pivot constant, and what the function returns.
RobotSail
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| # Helper function to calculate how much the given unfrozen_rank_ratio | ||
| # will be affecting the OSFT estimation (through a quadratic mapping where | ||
| # 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value) | ||
| def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazam-lab cant we just linearly interpolate here instead of using a quadratic formula? Starting from half of SFT for unfreeze rank ratio of 0 to 2x of SFT for unfreeze rank ratio of 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that could work too. if you're confident in that estimation I'll go ahead and implement it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did
|
@mazam-lab we dont need to scale OSFT params by the rank ratio. Just reuse the exact SFT total and linearly interpolate a scale factor between 0.5× (ratio=0) and 2.0× (ratio=1), then apply it to the SFT memory consumption thats it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
src/training_hub/profiling/memory_estimator.py (4)
377-382: Critical: Missing_calc_optimizeroverride to scale optimizer states.The
_calc_gradientsmethod is scaled byOSFT_RATIO(self.unfreeze_rank_ratio), but optimizer states are not. Since AdamW maintains two state tensors per parameter and only the unfrozen fraction participates in training, optimizer memory should also scale with the rank ratio.As per user feedback from NikhilNayak-debug and past review comments.
Add this override immediately after
_calc_gradients:@override def _calc_gradients(self): """ Override the optimizer parameter calculation by calculating based on OSFT's parameters """ return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) + + @override + def _calc_optimizer(self): + """ + Override the optimizer calculation to scale with unfreeze_rank_ratio + """ + return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) * self.opt_params
427-433: Critical: Incorrect scaling approach—scale only gradients and optimizer, not entire subtotal.Scaling the entire
subtotalin_apply_overheadmakes the component breakdown inconsistent and incorrectly scales model parameters and activations. Only gradients and optimizer states should scale withunfreeze_rank_ratiobecause only the unfrozen fraction participates in training.As per past review comments.
Replace the
_apply_overheadoverride with proper overrides for_calc_gradientsand_calc_optimizer:- @override - def _apply_overhead(self, subtotal): - """ - In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio - """ - ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio) - return super()._apply_overhead(subtotal * ratio_val) + @override + def _calc_gradients(self): + """ + Override the gradients calculation to scale with unfreeze_rank_ratio + """ + return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio) + + @override + def _calc_optimizer(self): + """ + Override the optimizer calculation to scale with unfreeze_rank_ratio + """ + return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)Based on past review comments.
3-3: Critical: Hard dependency onmini_trainerat import time blocks SFT-only usage.Importing
mini_trainer.osft_utilsat the module level prevents users from usingBasicEstimator(SFT) ifmini_traineris not installed. This violates the stated goal of making the API usable for SFT-only scenarios.Move the import inside
OSFTEstimatorExperimental.__init__and handleImportErrorgracefully:from typing import override from transformers import AutoModel -from mini_trainer.osft_utils import MODEL_CONFIGS import numpy as npThen in
OSFTEstimatorExperimental.__init__(around line 324):self.unfreeze_rank_ratio = unfreeze_rank_ratio if not (0.0 <= self.unfreeze_rank_ratio <= 1.0): raise ValueError("Ratio must be in the range [0, 1]") # Check to see which terms need to be included in the search for valid layers + try: + from mini_trainer.osft_utils import MODEL_CONFIGS + except ImportError: + MODEL_CONFIGS = {"default": {"patterns": []}} self.target_terms = MODEL_CONFIGS['default']['patterns'] for key in MODEL_CONFIGS.keys(): if self.model_path.find(key) > -1: self.target_terms = MODEL_CONFIGS[key]['patterns']Based on past review comments.
80-80: Convert float division to int.Division by
num_gpusproduces afloat, buttokens_per_gpuis annotated asint(line 78). This inconsistency can lead to unexpected behavior in downstream calculations.Apply this diff:
elif max_tokens_per_gpu is None: - self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus + self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus) else: self.tokens_per_gpu: int = min(max_tokens_per_gpu, - effective_batch_size * max_seq_len / self.num_gpus) + int(effective_batch_size * max_seq_len / self.num_gpus))Based on past review comments.
🧹 Nitpick comments (1)
src/training_hub/profiling/memory_estimator.py (1)
346-347: Prefer built-inmin()for clarity.The ternary expression is unnecessarily verbose.
Apply this diff:
for term in self.target_terms: - lowest_dim = layer_data.shape[0] if layer_data.shape[0] < \ - layer_data.shape[1] else layer_data.shape[1] + lowest_dim = min(layer_data.shape[0], layer_data.shape[1]) if layer_name.find(term) > -1 and layer_name.find('weight') > -1:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/notebooks/memory_estimator_example.ipynb(1 hunks)src/training_hub/profiling/memory_estimator.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/notebooks/memory_estimator_example.ipynb
🧰 Additional context used
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py
139-139: Avoid specifying long messages outside the exception class
(TRY003)
145-145: Unused method argument: kwargs
(ARG002)
322-322: Avoid specifying long messages outside the exception class
(TRY003)
425-425: Avoid specifying long messages outside the exception class
(TRY003)
The backend logic still needs some work on my end, but I'm opening the PR now to ensure the API signature and code standards is looking good.
Please note that for 3.0, we'll only be adding a very basic memory profiling API. Hardware discovery, runtime estimations, hyperparameter and hardware recommendations, etc. are planned for 3.1+
Summary by CodeRabbit
New Features
Documentation