Skip to content

Conversation

@mazam-lab
Copy link
Contributor

@mazam-lab mazam-lab commented Sep 29, 2025

The backend logic still needs some work on my end, but I'm opening the PR now to ensure the API signature and code standards is looking good.

Please note that for 3.0, we'll only be adding a very basic memory profiling API. Hardware discovery, runtime estimations, hyperparameter and hardware recommendations, etc. are planned for 3.1+

Summary by CodeRabbit

  • New Features

    • Adds a memory estimation tool for fine‑tuning that reports per‑GPU VRAM needs (parameters, optimizer state, gradients, activations, outputs) and returns low/expected/high bounds.
    • Supports SFT and OSFT (including an experimental OSFT mode), Liger‑aware adjustments, and a unified top‑level estimate entry point.
  • Documentation

    • Adds a notebook and README section demonstrating SFT/OSFT workflows, estimator options, and example outputs.

@coderabbitai
Copy link

coderabbitai bot commented Sep 29, 2025

Walkthrough

Adds a memory estimation module implementing SFT and two OSFT estimators with a unified estimate(...) entry point, exports these symbols from the package, and includes a notebook plus README docs demonstrating usage and examples.

Changes

Cohort / File(s) Summary
Profiling — Memory estimator module
src/training_hub/profiling/memory_estimator.py
New module adding BasicEstimator, OSFTEstimatorExperimental, OSFTEstimator, and a module-level estimate(...). Loads models via AutoModel.from_pretrained (supports trust_remote_code), extracts model metadata, computes per‑GPU VRAM components (parameters, gradients, optimizer state, activations, outputs), applies overhead bounds (lower/expected/upper), and includes OSFT-specific layer/parameter calculations and experimental SVD/Liger handling.
Public API — Package export
src/training_hub/__init__.py
Imports and exposes new symbols by adding BasicEstimator, OSFTEstimatorExperimental, OSFTEstimator, and estimate to the package __all__.
Examples — Notebook
examples/notebooks/memory_estimator_example.ipynb
New Jupyter notebook demonstrating class-based and convenience-function usage for SFT and OSFT (including Liger kernel notes), showing per‑GPU breakdowns, bounds, and example outputs.
Docs — Examples README
examples/README.md
Adds a Memory Estimation section documenting the experimental GPU memory estimation feature, usage notes, and a quick example calling training_hub.estimate(...).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Caller
  participant Est as Estimator (Basic / OSFT)
  participant HF as AutoModel.from_pretrained
  participant CFG as Model Config

  U->>Est: call estimate(...) or instantiate estimator
  Est->>HF: AutoModel.from_pretrained(model_path, trust_remote_code=...)
  HF-->>Est: model instance
  Est->>CFG: read metadata (num_parameters, trainable_params, layers, hidden_size)
  CFG-->>Est: config values

  rect #F0FFF4
    note right of Est: compute tokens_per_gpu, dtype sizes\nand constants
    Est->>Est: per-layer analysis (_check_layer_params for OSFT experimental)
    Est->>Est: _calc_model_params / _calc_osft_params
  end

  Est->>Est: compute components: params, grads, optimizer, activations, outputs
  Est->>Est: apply overhead multipliers -> (lower, expected, upper)
  Est-->>U: return bounds (and verbose prints if enabled)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through tensors, byte by byte,
Counting layers in the soft moonlight.
SFT hums steady, OSFT peels thin,
I tally VRAM with a twitching chin.
A rabbit's estimate—precise and bright. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly describes the addition of a basic memory profiling API in the specified module and directly reflects the primary change in the pull request.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@RobotSail
Copy link
Collaborator

@CodeRabbit review

@coderabbitai
Copy link

coderabbitai bot commented Oct 2, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

1 similar comment
@coderabbitai
Copy link

coderabbitai bot commented Oct 2, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc2175d and 7ae33de.

📒 Files selected for processing (1)
  • src/training_hub/profiling/memory_estimator.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.2)
src/training_hub/profiling/memory_estimator.py

130-131: Do not assign a lambda expression, use a def

Rewrite calc_subtotal as a def

(E731)


132-133: Do not assign a lambda expression, use a def

Rewrite calc_overhead as a def

(E731)


143-143: Do not assign a lambda expression, use a def

Rewrite rounded_helper as a def

(E731)


228-228: Multiple statements on one line (colon)

(E701)


229-229: Multiple statements on one line (colon)

(E701)


230-230: Multiple statements on one line (colon)

(E701)


231-231: Multiple statements on one line (colon)

(E701)


# NOTE: Typically, K is somewhere in [10, 30], but some research suggests that K=17.
# TODO: Add papers
gpu_vram_act_low: int = tokens_per_gpu * FLOAT32_BYTES_N * 10 * num_layers * hidden_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically if we see these estimates are not too far from actual memory this should be fine. But isn't it possible to calculate the exact upper bound given max sequence length and effective batch size? Can't we go through each layer of the model and sum the activation size?

Copy link
Contributor Author

@mazam-lab mazam-lab Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very shortly after I did this PR I made some discoveries that change this formula:

Basically, the activations are made up of two components:

  1. The mid-layer activations (gpu_vram_act), which can be calculated as max_tokens_per_gpu * FLOAT32_BYTES_N * num_layers * hidden_size (set K to 1)
  2. The output tensor (gpu_vram_out), which can be calculated as max_tokens_per_gpu * vocab_size * FLOAT32_BYTES_N.

From what I've seen empirically, you can get a reasonably good estimate if you use gpu_vram_act + (gpu_vram_out * 2) as your activation memory estimate

With liger kernels enabled you can simply use gpu_vram_act for the entire activation estimate

But the theoretical justification on using these two formulas is still a bit wishy-washy, so I want to tighten that up a bit if possible

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazam-lab yeah I think this formula should get us quite close to the actual activation usage. One thing to note though is that not all matrices/layers have the same output dimension. Example some matrices may have dimensions of [4h, h] in which case the output size would be four times larger. But overall I think this should bring us pretty close to the correct value.

lambda tot_val, act_val: tot_val - gpu_vram_par - gpu_vram_opt - gpu_vram_grad - act_val

# Perform those calculations using the lambdas!
gpu_vram_total_high: int = int(1.3 * calc_subtotal(gpu_vram_act_high))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these scalars 1.3, 1.2, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most formulas for calculating memory add about 10-30% to the initial subtotal to account for overhead, hence the scalars used here

Copy link
Collaborator

@NikhilNayak-debug NikhilNayak-debug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks like a good draft!

the given hardware to run the model, and note how much more memory
is needed to make the training more feasible.

Note that this estimate assumes training_hub will be used,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this estimate assumes training_hub will be used, in which all data types are float32 and the optimizer is always AdamW.

The reads like training hub houses the implementation for all algorithms, rather than just being an interface which exposes them. While these are probably the best settings in 95% of training scenarios, it also may not always be the case (e.g. FP8 training could be supported by an underlying backend).

Can you please update the messaging in your docs to instead be general?
E.g. This estimate assumes mixed precision training settings (storage in full precision/FP32, compute in half precision/BF16)

Copy link
Contributor Author

@mazam-lab mazam-lab Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this comment is to let the user know that we'll be assuming all tensors placed on VRAM will be Float32. So just to make sure I'm understanding correctly, should we let the user know that the underlying backend calculations might not necessarily be entirely done in Float32?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazam-lab No, the FP32 clarification is perfect, you don't need to change that. But right now it sounds like the float32 is because of training hub specifically, rather than being an established practice which we refer to as mixed precision training.

Copy link
Collaborator

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @mazam-lab, I took a look and left a few comments. Since the calculator is implemented as a script, maybe you could also add a notebook that provides a few examples on its usage?

# TODO: Validate these estimations empirically

# Load model directly
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, dtype="auto")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have this load as dtype="float32", or does "auto" end up making the calculations work well here? How does the rest of the script differ if the base model itself is stored as float32 vs. the same model in bfloat16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script is hardcoded to use Float32 for all calculations regardless of the AutoModel's loaded data type, so changing this doesn't matter anymore (I originally had a data type toggle as a feature, but removed it since it wasn't too necessary if everything in training_hub is Float32 in practice)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazam-lab That makes sense, thanks!

@mazam-lab
Copy link
Contributor Author

Thank you for the PR @mazam-lab, I took a look and left a few comments. Since the calculator is implemented as a script, maybe you could also add a notebook that provides a few examples on its usage?

Yeah I can do that; should the notebook provide some empirical results on what the actual memory usage for some select models and training setups are? (i.e. have a table showing the actual memory usage of Granite 2B on OSFT with Liger Kernels and 16,384 tokens)

@mazam-lab
Copy link
Contributor Author

@NikhilNayak-debug @RobotSail Thank you both so much for the comments!

@mazam-lab
Copy link
Contributor Author

Class structure has been implemented and an example notebook has been added examples/notebooks/memory_estimator_example.ipynb

@mazam-lab mazam-lab marked this pull request as ready for review October 10, 2025 18:34
@mazam-lab
Copy link
Contributor Author

@Maxusmusti @RobotSail @NikhilNayak-debug I'm ready to move this to a normal PR now.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (2)
src/training_hub/profiling/memory_estimator.py (2)

316-320: Critical: Incorrect SVD storage calculation for OSFT.

The calculation for U and V matrices uses the full matrix dimensions (shape[0] * shape[0] and shape[1] * shape[1]), which is incorrect for SVD decomposition. SVD produces matrices U (m×k), S (k), and V (k×n) where k = min(m, n).

Based on past review comments.

Apply this diff to fix the SVD calculation:

         for term in TARGET_TERMS:
             if layer_name.find(term) > -1 and layer_name.find('weight') > -1:
-                U_bytes_n: int = layer_data.shape[0] * layer_data.shape[0] 
-                S_bytes_n: int = layer_data.shape[0] if layer_data.shape[0] < \
-                            layer_data.shape[1] else layer_data.shape[1]
-                V_bytes_n: int = layer_data.shape[1] * layer_data.shape[1]
+                m = layer_data.shape[0]
+                n = layer_data.shape[1]
+                k = min(m, n)
+                U_bytes_n: int = m * k
+                S_bytes_n: int = k
+                V_bytes_n: int = n * k
                 return (U_bytes_n + S_bytes_n + V_bytes_n) * FLOAT32_BYTES_N

227-228: Generalize documentation to avoid training_hub-specific assumptions.

The comment suggests memory characteristics are specific to training_hub, but these are standard mixed-precision training practices. Generalize the message as discussed in past review comments.

Based on past review comments.

Apply this diff:

-        Note that this estimate assumes training_hub will be used, 
-        in which all data types are float32 and the optimizer is always AdamW.
+        Note that this estimate assumes mixed precision training settings 
+        (storage in full precision/FP32, compute in half precision/BF16) 
+        and the AdamW optimizer.
🧹 Nitpick comments (2)
src/training_hub/profiling/memory_estimator.py (2)

14-14: Remove unused constant.

FLOAT8_BYTES_N is defined but never used in the codebase.

Consider removing this constant or documenting its intended future use:

 FLOAT32_BYTES_N: int = 4
 FLOAT16_BYTES_N: int = 2
-FLOAT8_BYTES_N: int = 1
 ADAMW_PARAMS_N: int = 2

122-126: Remove unused kwargs parameter.

The **kwargs parameter in _calc_additional is never used. Either remove it or document its purpose if it's intended for future subclass extensions.

Apply this diff:

-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae33de and 4f9ebee.

📒 Files selected for processing (2)
  • examples/notebooks/memory_estimator_example.ipynb (1 hunks)
  • src/training_hub/profiling/memory_estimator.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
src/training_hub/profiling/memory_estimator.py

18-18: Do not assign a lambda expression, use a def

Rewrite ROUNDER as a def

(E731)


46-46: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


47-47: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


48-48: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


122-122: Unused method argument: kwargs

(ARG002)


281-281: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


282-282: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


283-283: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


351-351: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


352-352: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


353-353: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🔇 Additional comments (2)
examples/notebooks/memory_estimator_example.ipynb (1)

1-539: LGTM! Excellent documentation and examples.

The notebook provides clear, comprehensive examples of the memory estimator API, including:

  • Proper setup and configuration
  • Usage of both BasicEstimator and OSFTEstimator classes
  • Demonstration of the estimate() convenience function
  • Explanation of how Liger Kernels affect memory usage
  • Good pedagogical flow explaining the calculation methodology

The examples effectively illustrate the API's capabilities and will help users understand how to estimate memory requirements for their training scenarios.

src/training_hub/profiling/memory_estimator.py (1)

117-117: Verify embed_tokens presence or add defensive fallback
Confirm that all target Granite models expose model.embed_tokens.num_embeddings, or implement fallback logic using model.config.vocab_size or model.get_input_embeddings(), and raise a clear error if none are available.

@mazam-lab
Copy link
Contributor Author

Coderabbit comments have been addressed

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/training_hub/profiling/memory_estimator.py (1)

344-406: Duplicate: trust_remote_code parameter issue.

This function attempts to pass trust_remote_code to OSFTEstimator (line 393), which will fail due to the missing parameter in its __init__ method. This issue is addressed in the review comment for lines 273-288.

🧹 Nitpick comments (1)
src/training_hub/profiling/memory_estimator.py (1)

120-124: Remove unused kwargs parameter.

The **kwargs parameter is unused and flagged by static analysis. Since no subclass currently uses it, consider removing it for clarity.

Apply this diff:

-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f9ebee and 88ad308.

📒 Files selected for processing (2)
  • examples/notebooks/memory_estimator_example.ipynb (1 hunks)
  • src/training_hub/profiling/memory_estimator.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py

120-120: Unused method argument: kwargs

(ARG002)

🔇 Additional comments (6)
src/training_hub/profiling/memory_estimator.py (5)

1-16: LGTM!

The imports are minimal and necessary, constants are well-defined, and the ROUNDER function has been correctly converted from a lambda to a proper function definition as recommended in previous reviews.


17-80: LGTM!

The class definition and initialization look good. Past review feedback has been properly addressed:

  • Type hints now use explicit int | None syntax (lines 43-45, 48)
  • trust_remote_code parameter has been added (line 48)
  • Model loading correctly uses the trust_remote_code parameter (line 57)

127-264: LGTM!

The overhead application, printing methods, and main estimate orchestration are well-implemented. The logic is clear and the breakdown provides useful information to users.


291-342: LGTM!

The OSFT-specific parameter calculations look correct. The SVD decomposition logic for trainable layers and the override of _calc_model_params are properly implemented.


110-118: Guard against missing embed_tokens by falling back to config.vocab_size.
Line 115 assumes self.model.embed_tokens.num_embeddings exists, which can break on architectures without that attribute. Replace with:

- return (self.tokens_per_gpu * self.main_dtype_bytes * self.model.embed_tokens.num_embeddings) * self.output_constant
+ vocab_layer = getattr(self.model, "embed_tokens", None)
+ vocab_size  = vocab_layer.num_embeddings if vocab_layer else self.model.config.vocab_size
+ return (self.tokens_per_gpu * self.main_dtype_bytes * vocab_size) * self.output_constant
examples/notebooks/memory_estimator_example.ipynb (1)

1-539: LGTM!

The example notebook is well-structured and provides clear demonstrations of:

  • BasicEstimator usage for SFT
  • OSFTEstimator usage for OSFT
  • OSFT with Liger Kernels
  • The estimate() convenience function

The explanations are educational and help users understand how memory estimation works. The past review comment about the inline comment has been correctly addressed (line 82 now correctly states "48 GB in bytes").

"""
Calculate the VRAM for storing the model's intermediate activations
"""
return (self.tokens_per_gpu * self.main_dtype_bytes * self.num_layers * self.hidden_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazam-lab each layer has more than one matrix, and you need to account for that. For example each Llama layer has 4 attention projections and 3 MLP projections. Also, some layers like the MLP don't always have output dimension h (they expand to a larger width and then project back to h), but we can probably ignore this for now since it won't make a big difference and keeps the tool simple.

Copy link
Contributor Author

@mazam-lab mazam-lab Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you mean by this? Because from what I understand from reading blogs, you don't need to store the intermediate matrix computations, you only store the final activations of each layer. So even though Llama has all of those matrices, I think the only thing that gets stored in memory is whatever gets passed into the SiLU at the end, which should be a matrix of size (tokens x hidden_dim):

Screenshot From 2025-10-14 10-21-00

I don't think this needs adjustment because it's not a large part of the calculation and this lines up with empirical results since the SFT estimator (which does make use of this formula) has been really accurate from my testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what i understand, you store this temporarily so you can do the backpropogation efficiently

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazam-lab if empirically it is accurate I think it should be okay. But these intermediate activations do add up because output of each matrix is required to compute the gradient unless one trains with activation checkpointing in which case you calculate the gradients layer by layer. But if SFT estimates are accurate, this should be fine!

if layer_data.dim() < 2:
return layer_data.numel() * FLOAT32_BYTES_N
for term in TARGET_TERMS:
if layer_name.find(term) > -1 and layer_name.find('weight') > -1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazam-lab this wouldn't work for rectangular matrices where dimension m is not equal to n. In that case U would be (m, min(m, n)) and V would be (min(m, n), n). Can we make that change?

Copy link
Contributor Author

@mazam-lab mazam-lab Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will do; So I thought U was supposed to be of size m x m and V was supposed to be size n x n, because that's the only way you can perform matrix multiplication with S of size m x n.

If I'm understanding things right, the reason U and V need to be size (m x min(m, n)) and (min(m, n) x n) respectively is because S is actually of size (min(m, n) x min(m, n))?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazam-lab yeah so for a matrix W of shape (m, n), the call U, S, Vt = torch.linalg.svd(W) returns:
U → shape (m, k)
S → shape (k,)
Vt → shape (k, n)
where k=min(m,n)

The code where we do this is here - https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer/blob/cb8907090945f12f6eb8d70b14eb69f9e5af5a9e/src/mini_trainer/osft_utils.py#L321C9-L321C60


# If the layer is trainable, i.e. it contains any of these terms,
# then we will need to store a SVD decomposition of the layer in memory.
TARGET_TERMS: list[str] = ['self_attn.q_proj',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazam-lab these are the right parameter keys for Llama, Qwen, Mistral. But other model families have different names that we do the SVD on. For the full list refer to https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer/blob/cb8907090945f12f6eb8d70b14eb69f9e5af5a9e/src/mini_trainer/osft_utils.py#L100 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah perfect, thank you!

@NikhilNayak-debug
Copy link
Collaborator

@mazam-lab also right now in the PR we are not accounting for the unfreeze rank ratio that is how much of U and V we are actually updating. The total parameter count will use the full matrix size of U and V, but gradients and optimizer states should scale with the rank ratio since only that fraction of U/V participates in training.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
src/training_hub/profiling/memory_estimator.py (5)

4-4: Remove unused import.

The numpy module is imported but never used in this file.

Apply this diff to remove the unused import:

-import numpy as np

11-13: Remove or document unused constants.

FLOAT16_BYTES_N and FLOAT8_BYTES_N are defined but never used. If these are planned for future precision support, add a comment; otherwise, remove them to reduce clutter.

Apply this diff if not planning to use them:

 FLOAT32_BYTES_N: int = 4
-FLOAT16_BYTES_N: int = 2
-FLOAT8_BYTES_N: int = 1
 ADAMW_PARAMS_N: int = 2

17-17: Extract magic number to named constant.

The value 1073741824 (bytes per GiB) is a magic number. Extract it to a named constant for clarity.

Apply this diff:

+# Conversion constant
+BYTES_PER_GIB: int = 1073741824
+
 # Helper function to do the rounding when printing 
-def ROUNDER(value: int) -> str: return str(round(value / 1073741824, 1))
+def ROUNDER(value: int) -> str: return str(round(value / BYTES_PER_GIB, 1))

123-127: Remove unused kwargs parameter.

The kwargs parameter in _calc_additional() is unused (as flagged by static analysis). If it's intended for future subclass extensions, add a docstring note; otherwise, remove it.

As per static analysis hints.

Apply this diff if not needed:

-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0

423-423: Document the polynomial ratio formula.

Line 423 uses a hardcoded polynomial formula: -0.7802 * x^2 + 2.5302 * x + 0.25. This appears to be empirically derived but lacks explanation.

Add a comment explaining the source and derivation of these coefficients:

    @override
    def _apply_overhead(self, subtotal):
        """
        In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio.
        
        The ratio_val polynomial was empirically derived from [source/experiment] to approximate
        OSFT's memory scaling behavior relative to SFT.
        """
        # Polynomial approximation: f(x) = -0.7802*x^2 + 2.5302*x + 0.25
        ratio_val = -0.7802 * (self.unfreeze_rank_ratio ** 2) + 2.5302 * self.unfreeze_rank_ratio + 0.25
        return super()._apply_overhead(subtotal * ratio_val)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88ad308 and a6b7375.

📒 Files selected for processing (3)
  • examples/notebooks/memory_estimator_example.ipynb (1 hunks)
  • src/training_hub/__init__.py (2 hunks)
  • src/training_hub/profiling/memory_estimator.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/notebooks/memory_estimator_example.ipynb
🧰 Additional context used
🧬 Code graph analysis (1)
src/training_hub/__init__.py (1)
src/training_hub/profiling/memory_estimator.py (7)
  • BasicEstimator (20-267)
  • OSFTEstimatorExperimental (270-382)
  • estimate (218-267)
  • estimate (366-382)
  • estimate (427-431)
  • estimate (434-500)
  • OSFTEstimator (385-431)
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py

123-123: Unused method argument: kwargs

(ARG002)

🔇 Additional comments (2)
src/training_hub/__init__.py (1)

5-5: LGTM!

The imports and exports are correctly structured and consistent with each other. The new memory estimation API is properly exposed to users of the training_hub package.

Also applies to: 18-22

src/training_hub/profiling/memory_estimator.py (1)

434-500: LGTM!

The module-level estimate() function provides a clean, unified API for both SFT and OSFT estimation. The implementation correctly dispatches to the appropriate estimator class based on the training_method parameter.

if effective_batch_size is None or max_seq_len is None:
self.tokens_per_gpu: int = max_tokens_per_gpu
elif max_tokens_per_gpu is None:
self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Convert float division to int.

Division by num_gpus produces a float, but tokens_per_gpu is annotated as int (line 72) and should remain an integer for consistent calculations. Use integer division (//) or explicit int() cast.

Apply this diff:

         elif max_tokens_per_gpu is None:
-            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+            self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus)
         else:
             self.tokens_per_gpu: int = min(max_tokens_per_gpu,
-                                            effective_batch_size * max_seq_len / self.num_gpus)
+                                            int(effective_batch_size * max_seq_len / self.num_gpus))

Also applies to: 76-77

🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around lines 74 and also lines
76-77, the division operations produce floats while tokens_per_gpu (and related
variables) are annotated as ints; change the float divisions to integer results
by using integer division (//) or wrapping the expressions with int(...) so the
assigned values are integers and downstream calculations remain consistent.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/training_hub/profiling/memory_estimator.py (1)

77-83: Cast division results to int to match type annotation.

Lines 80 and 82–83 perform division by num_gpus, which produces a float, but tokens_per_gpu is annotated as int. This type mismatch may cause issues with type checkers or downstream integer-only operations.

Apply this diff to cast the results:

         if effective_batch_size is None or max_seq_len is None:
             self.tokens_per_gpu: int = max_tokens_per_gpu
         elif max_tokens_per_gpu is None:
-            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+            self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus)
         else:
             self.tokens_per_gpu: int = min(max_tokens_per_gpu,
-                                            effective_batch_size * max_seq_len / self.num_gpus)
+                                            int(effective_batch_size * max_seq_len / self.num_gpus))
🧹 Nitpick comments (2)
src/training_hub/profiling/memory_estimator.py (2)

19-22: Document the derivation or source of the quadratic formula.

The quadratic coefficients in OSFT_RATIO (−0.7802, 2.5302, 0.25) are unexplained beyond the boundary conditions. Consider adding a comment or docstring that clarifies:

  • The empirical or theoretical basis for this formula
  • Whether it's derived from experimental data or a mathematical model
  • Any references or sources that justify these specific coefficients

This will help future maintainers understand and validate the scaling behavior.


143-147: Remove unused kwargs parameter.

The kwargs parameter in _calc_additional is declared but never used. If it's not needed for future extensibility or polymorphism, consider removing it to keep the signature clean.

Apply this diff:

-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6b7375 and e0f2752.

📒 Files selected for processing (1)
  • src/training_hub/profiling/memory_estimator.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py

130-130: Do not use bare except

(E722)


133-133: Do not use bare except

(E722)


136-136: Do not use bare except

(E722)


137-137: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


137-137: Avoid specifying long messages outside the exception class

(TRY003)


143-143: Unused method argument: kwargs

(ARG002)

Comment on lines +374 to +378
def _calc_gradients(self):
"""
Override the optimizer parameter calculation by calculating based on OSFT's parameters
"""
return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add _calc_optimizer override to scale optimizer states by unfreeze ratio.

Currently, _calc_gradients is scaled by OSFT_RATIO(unfreeze_rank_ratio), but optimizer states are not. Since AdamW maintains two state tensors per parameter and only the unfrozen fraction participates in training, optimizer memory should also scale with the rank ratio.

As per user feedback from NikhilNayak-debug.

Add this override after _calc_gradients:

     @override
     def _calc_gradients(self):
         """
         Override the optimizer parameter calculation by calculating based on OSFT's parameters
         """
         return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        """
+        Override the optimizer calculation to scale with unfreeze_rank_ratio
+        """
+        return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _calc_gradients(self):
"""
Override the optimizer parameter calculation by calculating based on OSFT's parameters
"""
return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
def _calc_gradients(self):
"""
Override the optimizer parameter calculation by calculating based on OSFT's parameters
"""
return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
@override
def _calc_optimizer(self):
"""
Override the optimizer calculation to scale with unfreeze_rank_ratio
"""
return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)
🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around lines 374 to 378, add
an override method named _calc_optimizer immediately after _calc_gradients that
returns the optimizer state memory sized to the unfrozen parameter fraction:
calculate base optimizer parameter count via self._calc_model_params(), multiply
by OSFT_RATIO(self.unfreeze_rank_ratio) to account for only the unfrozen
parameters participating, and multiply by 2 to reflect the two state tensors
AdamW keeps per parameter (i.e., optimizer memory = _calc_model_params() *
OSFT_RATIO(self.unfreeze_rank_ratio) * 2); implement as a concise method that
parallels _calc_gradients.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
src/training_hub/profiling/memory_estimator.py (2)

374-379: Scale optimizer states by unfreeze_rank_ratio (Experimental OSFT)

Gradients are scaled but optimizer states aren’t. AdamW’s two state tensors should scale with the unfrozen fraction too.

     @override
     def _calc_gradients(self):
         """
         Override the optimizer parameter calculation by calculating based on OSFT's parameters
         """
         return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        """
+        Scale optimizer state memory with the unfrozen fraction (two states for AdamW)
+        """
+        return self._calc_model_params() * self.opt_params * OSFT_RATIO(self.unfreeze_rank_ratio)

80-83: Make tokens_per_gpu an integer

Float division produces a float; this should be an integer for consistent byte math.

-            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+            self.tokens_per_gpu: int = (effective_batch_size * max_seq_len) // self.num_gpus
         else:
-            self.tokens_per_gpu: int = min(max_tokens_per_gpu,
-                                            effective_batch_size * max_seq_len / self.num_gpus)
+            self.tokens_per_gpu: int = min(
+                max_tokens_per_gpu,
+                (effective_batch_size * max_seq_len) // self.num_gpus,
+            )
🧹 Nitpick comments (9)
src/training_hub/profiling/memory_estimator.py (9)

1-1: Ensure override import works on Python < 3.12

typing.override is 3.12+. Add a fallback to typing_extensions or switch to it directly to avoid ImportError on older runtimes.

-from typing import override
+try:
+    from typing import override  # Python 3.12+
+except ImportError:  # pragma: no cover
+    from typing_extensions import override  # type: ignore

4-4: Remove unused NumPy import

numpy as np is not used.

-import numpy as np

16-17: Broaden ROUNDER type and format

It’s called with floats; update the hint and use a multi-line def for readability.

-def ROUNDER(value: int) -> str: return str(round(value / 1073741824, 1))
+def ROUNDER(value: float) -> str:
+    return str(round(value / 1073741824, 1))

89-91: Explain magic output_constant

Document the 8/3 factor (empirical vs theoretical) to aid maintainability.

-        self.output_constant = 8/3
+        # Empirical scaling for output tensor footprint (see PR notes); tune as needed
+        self.output_constant = 8/3

143-147: Silence unused kwargs warning

Rename to _kwargs to address linter ARG002 without changing behavior.

-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self, **_kwargs):

321-325: Stop at first matching model family

Break after the first match to avoid accidental overrides when multiple keys match.

-        for key in MODEL_CONFIGS.keys():
-            if self.model_path.find(key) > -1:
-                self.target_terms = MODEL_CONFIGS[key]['patterns']
+        for key in MODEL_CONFIGS.keys():
+            if self.model_path.find(key) > -1:
+                self.target_terms = MODEL_CONFIGS[key]['patterns']
+                break

342-343: Use min for clarity

Minor readability improvement.

-            lowest_dim = layer_data.shape[0] if layer_data.shape[0] < \
-                            layer_data.shape[1] else layer_data.shape[1]
+            lowest_dim = min(layer_data.shape[0], layer_data.shape[1])

419-419: Validate unfreeze_rank_ratio input (OSFT)

Same check for the basic OSFT estimator.

         self.unfreeze_rank_ratio = unfreeze_rank_ratio
+        if not (0.0 <= self.unfreeze_rank_ratio <= 1.0):
+            raise ValueError("unfreeze_rank_ratio must be within [0, 1]")

127-137: Minor: Tighten exception message (TRY003)

Prefer a concise message.

-                    except AttributeError as e:
-                        raise ValueError("Could not find the given model's vocabulary size") from e
+                    except AttributeError as e:
+                        raise ValueError("Could not resolve vocab_size") from e
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0f2752 and 7cd94a9.

📒 Files selected for processing (1)
  • src/training_hub/profiling/memory_estimator.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py

137-137: Avoid specifying long messages outside the exception class

(TRY003)


143-143: Unused method argument: kwargs

(ARG002)

@@ -0,0 +1,516 @@
from typing import override
from transformers import AutoModel
from mini_trainer.osft_utils import MODEL_CONFIGS
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid hard dependency at import time on mini_trainer

Importing mini_trainer.osft_utils at module import breaks SFT-only use if mini_trainer isn’t installed. Defer import to the OSFT Experimental constructor and handle absence gracefully.

Apply both changes:

  1. Remove the module-level import:
-from mini_trainer.osft_utils import MODEL_CONFIGS
  1. Load lazily inside OSFTEstimatorExperimental.__init__ and fall back safely:
-        # Check to see which terms need to be included in the search for valid layers
-        self.target_terms = MODEL_CONFIGS['default']['patterns']
-        for key in MODEL_CONFIGS.keys():
-            if self.model_path.find(key) > -1:
-                self.target_terms = MODEL_CONFIGS[key]['patterns']
+        # Resolve target layer patterns lazily to avoid hard dependency on mini_trainer
+        try:
+            from mini_trainer.osft_utils import MODEL_CONFIGS
+        except Exception:
+            MODEL_CONFIGS = {"default": {"patterns": []}}  # minimal safe default
+        self.target_terms = MODEL_CONFIGS.get("default", {}).get("patterns", [])
+        for key in MODEL_CONFIGS.keys():
+            if key in self.model_path:
+                self.target_terms = MODEL_CONFIGS[key]["patterns"]
+                break
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from mini_trainer.osft_utils import MODEL_CONFIGS
++ b/src/training_hub/profiling/memory_estimator.py
@@ –1,5 +1,4 @@
-from mini_trainer.osft_utils import MODEL_CONFIGS
import torch
from typing import Callable, Optional, override
from transformers import AutoModel
@@ class OSFTEstimatorExperimental(BasicEstimator):
def __init__(
self,
num_gpus: int = 8,
gpu_memory: int = 85899345920,
model_path: str = "ibm-granite/granite-3.3-8b-instruct",
effective_batch_size: int | None = None,
max_seq_len: int | None = None,
max_tokens_per_gpu: int | None = None,
use_liger: bool = False,
verbose: int = 1,
):
super().__init__(...)
- # Check to see which terms need to be included in the search for valid layers
- self.target_terms = MODEL_CONFIGS['default']['patterns']
- for key in MODEL_CONFIGS.keys():
- if self.model_path.find(key) > -1:
# Resolve target layer patterns lazily to avoid hard dependency on mini_trainer
try:
from mini_trainer.osft_utils import MODEL_CONFIGS
except Exception:
MODEL_CONFIGS = {"default": {"patterns": []}} # minimal safe default
self.target_terms = MODEL_CONFIGS.get("default", {}).get("patterns", [])
for key, cfg in MODEL_CONFIGS.items():
if key in self.model_path:
self.target_terms = cfg.get("patterns", [])
break
🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around line 3, remove the
module-level import of mini_trainer.osft_utils to avoid a hard dependency at
import time; instead, in the OSFTEstimatorExperimental.__init__ perform a lazy
import inside a try/except ImportError block (e.g., try to import MODEL_CONFIGS
from mini_trainer.osft_utils and assign it to an instance attribute or module
variable) and on ImportError set that attribute to None or a safe default and
log or raise a clear runtime error only when OSFT functionality is actually
invoked; also update any code paths in the class to reference the
instance/module attribute and handle the None/default case gracefully so
SFT-only usage works without mini_trainer installed.

Comment on lines +422 to +427
def _apply_overhead(self, subtotal):
"""
In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
"""
ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)
return super()._apply_overhead(subtotal * ratio_val)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep subtotal logic clean; scale only grads/optimizer (OSFT)

Scaling the entire subtotal in _apply_overhead makes the component breakdown inconsistent. Override _calc_gradients and _calc_optimizer instead, and leave _apply_overhead to just apply overhead.

     def _apply_overhead(self, subtotal):
         """
         In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
         """
-        ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)
-        return super()._apply_overhead(subtotal * ratio_val)        
+        return super()._apply_overhead(subtotal)

Add these overrides:

+    @override
+    def _calc_gradients(self):
+        return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _apply_overhead(self, subtotal):
"""
In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
"""
ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)
return super()._apply_overhead(subtotal * ratio_val)
def _apply_overhead(self, subtotal):
"""
In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
"""
return super()._apply_overhead(subtotal)
@override
def _calc_gradients(self):
return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio)
@override
def _calc_optimizer(self):
return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)
🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around lines 422 to 427,
currently _apply_overhead multiplies the entire subtotal by
OSFT_RATIO(self.unfreeze_rank_ratio); instead, remove that scaling so
_apply_overhead only applies the 0-30% overhead to the subtotal, and implement
two new overrides: _calc_gradients and _calc_optimizer which call the parent
implementations (or existing internal logic) and then multiply only the
gradients and optimizer component values by
OSFT_RATIO(self.unfreeze_rank_ratio). Ensure you do not change the return shape
or keys of those methods, and keep the OSFT scaling isolated to those two
overrides so the component breakdown remains consistent.

"# Memory Estimator \n",
"\n",
"This notebook will provide some examples on how to use the memory_estimator API\n",
"to estimate the amount of GPU memory consumed when fine-tuning an LLM model in Training Hub.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"LLM Model is redundant"

@Maxusmusti
Copy link
Contributor

Only other thing is adding a note in the examples dir README linking to the newly added notebook and briefly explaining the "Memory Estimator API (In-Progress/Experimental)". Can also be added as a reference to bottom of main README

…om coderabbit, some other documentation cleaning
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
src/training_hub/profiling/memory_estimator.py (4)

427-433: Incorrect scaling approach: scale only gradients/optimizer, not entire subtotal.

Scaling the entire subtotal in _apply_overhead makes the component breakdown inconsistent and applies the OSFT scaling to activations/outputs which shouldn't be affected. Only gradients and optimizer states should scale with unfreeze_rank_ratio.

As per user feedback from NikhilNayak-debug.

Remove the scaling from _apply_overhead and add overrides for _calc_gradients and _calc_optimizer:

     @override
     def _apply_overhead(self, subtotal):
         """
-        In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
+        Apply the standard 0-30% overhead
         """
-        ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)
-        return super()._apply_overhead(subtotal * ratio_val)        
+        return super()._apply_overhead(subtotal)
+    
+    @override
+    def _calc_gradients(self):
+        """
+        Override to scale gradients by unfreeze_rank_ratio
+        """
+        return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio)
+    
+    @override
+    def _calc_optimizer(self):
+        """
+        Override to scale optimizer states by unfreeze_rank_ratio
+        """
+        return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)

3-3: Avoid hard dependency on mini_trainer at module import.

The module-level import of mini_trainer.osft_utils creates a hard dependency that breaks SFT-only usage when mini_trainer is not installed. This import is only needed for OSFTEstimatorExperimental.

Move the import inside OSFTEstimatorExperimental.__init__ with graceful fallback:

-from mini_trainer.osft_utils import MODEL_CONFIGS
 import numpy as np

Then in OSFTEstimatorExperimental.__init__ (after line 322):

         if not (0.0 <= self.unfreeze_rank_ratio <= 1.0):
             raise ValueError("Ratio must be in the range [0, 1]")
 
         # Check to see which terms need to be included in the search for valid layers
+        try:
+            from mini_trainer.osft_utils import MODEL_CONFIGS
+        except ImportError:
+            MODEL_CONFIGS = {"default": {"patterns": []}}
         self.target_terms = MODEL_CONFIGS['default']['patterns']
         for key in MODEL_CONFIGS.keys():
             if self.model_path.find(key) > -1:
                 self.target_terms = MODEL_CONFIGS[key]['patterns']

80-83: Convert float division results to int.

Division by num_gpus produces float values, but tokens_per_gpu is annotated as int. Use explicit int() cast to maintain type consistency.

Apply this diff:

         elif max_tokens_per_gpu is None:
-            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+            self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus)
         else:
             self.tokens_per_gpu: int = min(max_tokens_per_gpu,
-                                            effective_batch_size * max_seq_len / self.num_gpus)
+                                            int(effective_batch_size * max_seq_len / self.num_gpus))

377-383: Add _calc_optimizer override to scale optimizer states.

Currently _calc_gradients scales by OSFT_RATIO(unfreeze_rank_ratio), but optimizer states are not scaled. Since AdamW maintains two state tensors per parameter and only the unfrozen fraction participates in training, optimizer memory should also scale with the rank ratio.

As per user feedback from NikhilNayak-debug.

Add this override after _calc_gradients:

     @override
     def _calc_gradients(self):
         """
         Override the optimizer parameter calculation by calculating based on OSFT's parameters
         """
         return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        """
+        Override the optimizer calculation to scale with unfreeze_rank_ratio.
+        AdamW maintains two state tensors per parameter, and only the unfrozen
+        fraction participates in training.
+        """
+        return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) * ADAMW_PARAMS_N
🧹 Nitpick comments (2)
src/training_hub/profiling/memory_estimator.py (1)

145-149: Remove unused kwargs parameter.

The **kwargs parameter in _calc_additional is never used and creates unnecessary API surface.

As per static analysis hints.

Apply this diff:

-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0
examples/README.md (1)

83-83: Simplify wordy phrase.

The phrase "built off of" is unnecessarily wordy. Consider using a more concise alternative.

As per static analysis hints.

Apply this diff:

-training_hub includes a library for estimating the expected amount of GPU memory that will be allocated during the fine-tuning of a given model using SFT or OSFT. The calculations are built off of formulas presented in the blog post [How To Calculate GPU VRAM Requirements for an Large-Language Model](https://apxml.com/posts/how-to-calculate-vram-requirements-for-an-llm).
+training_hub includes a library for estimating the expected amount of GPU memory that will be allocated during the fine-tuning of a given model using SFT or OSFT. The calculations are based on formulas presented in the blog post [How To Calculate GPU VRAM Requirements for an Large-Language Model](https://apxml.com/posts/how-to-calculate-vram-requirements-for-an-llm).
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cd94a9 and c781b7c.

📒 Files selected for processing (3)
  • examples/README.md (1 hunks)
  • examples/notebooks/memory_estimator_example.ipynb (1 hunks)
  • src/training_hub/profiling/memory_estimator.py (1 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/README.md

[style] ~83-~83: ‘off of’ might be wordy. Consider a shorter alternative.
Context: ...SFT or OSFT. The calculations are built off of formulas presented in the blog post [Ho...

(EN_WORDINESS_PREMIUM_OFF_OF)


[grammar] ~83-~83: There might be a mistake here.
Context: ...calculate-vram-requirements-for-an-llm). NOTE: This feature is still a work in-pr...

(QB_NEW_EN)


[grammar] ~87-~87: There might be a mistake here.
Context: ...l results may still vary. Tutorials: - [Memory Estimation Example](notebooks/mem...

(QB_NEW_EN)

🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py

139-139: Avoid specifying long messages outside the exception class

(TRY003)


145-145: Unused method argument: kwargs

(ARG002)


322-322: Avoid specifying long messages outside the exception class

(TRY003)


425-425: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)
examples/notebooks/memory_estimator_example.ipynb (1)

1-357: LGTM! Well-structured educational notebook.

The notebook provides a comprehensive, step-by-step tutorial on using the memory estimation API. It effectively demonstrates:

  • Core concepts and memory calculation components
  • Both class-based instantiation and convenience function usage
  • SFT, OSFT, and Liger Kernels scenarios
  • Clear explanations of parameters and their impact

The examples align well with the implemented API and will help users understand how to estimate memory requirements for their training scenarios.

examples/README.md (1)

100-100: Fix syntax error in code example.

Missing comma after use_liger=True parameter.

Apply this diff:

 estimate(training_method='osft',
     num_gpus=2,
     model_path="/path/to/model",
     max_tokens_per_gpu=8192,
     use_liger=True,
     verbose=2,
-    unfreeze_rank_ratio: float = 0.25
+    unfreeze_rank_ratio=0.25
 )

Likely an incorrect or invalid review comment.

Comment on lines 19 to 22
# Helper function to calculate how much the given unfrozen_rank_ratio
# will be affecting the OSFT estimation (through a quadratic mapping where
# 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value)
def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Document the magic constant 0.33 in OSFT_RATIO.

The quadratic mapping formula uses 0.33 as a pivot point where OSFT equals SFT memory usage, but this isn't explained. Extract to a named constant with documentation.

Apply this refactor:

+# Default unfreeze ratio where OSFT memory usage equals SFT memory usage (empirically determined)
+DEFAULT_UNFREEZE_RATIO: float = 0.33
+
 # Helper function to calculate how much the given unfrozen_rank_ratio 
 # will be affecting the OSFT estimation (through a quadratic mapping where
-# 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value)
-def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25
+# 0 is 0.25 of SFT's value, DEFAULT_UNFREEZE_RATIO is equal to SFT's value, and 1 is twice of SFT's value)
+def OSFT_RATIO(value: float) -> float: 
+    """Calculate OSFT memory scaling factor based on unfreeze_rank_ratio."""
+    return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Helper function to calculate how much the given unfrozen_rank_ratio
# will be affecting the OSFT estimation (through a quadratic mapping where
# 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value)
def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25
# Default unfreeze ratio where OSFT memory usage equals SFT memory usage (empirically determined)
DEFAULT_UNFREEZE_RATIO: float = 0.33
# Helper function to calculate how much the given unfrozen_rank_ratio
# will be affecting the OSFT estimation (through a quadratic mapping where
# 0 is 0.25 of SFT's value, DEFAULT_UNFREEZE_RATIO is equal to SFT's value, and 1 is twice of SFT's value)
def OSFT_RATIO(value: float) -> float:
"""Calculate OSFT memory scaling factor based on unfreeze_rank_ratio."""
return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25
🤖 Prompt for AI Agents
In src/training_hub/profiling/memory_estimator.py around lines 19-22, the
quadratic mapping in OSFT_RATIO uses the unexplained magic pivot value 0.33;
extract that pivot into a named constant (e.g.
UNFROZEN_RANK_RATIO_SFT_EQUIVALENT = 0.33) with a short docstring that explains
"pivot point where OSFT memory equals SFT memory (OSFT_RATIO == 1.0)". Replace
the literal 0.33 in comments/description with the constant and add a brief
docstring above the OSFT_RATIO function describing the quadratic mapping, the
pivot constant, and what the function returns.

Copy link
Collaborator

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# Helper function to calculate how much the given unfrozen_rank_ratio
# will be affecting the OSFT estimation (through a quadratic mapping where
# 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value)
def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazam-lab cant we just linearly interpolate here instead of using a quadratic formula? Starting from half of SFT for unfreeze rank ratio of 0 to 2x of SFT for unfreeze rank ratio of 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that could work too. if you're confident in that estimation I'll go ahead and implement it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did

@NikhilNayak-debug
Copy link
Collaborator

@mazam-lab we dont need to scale OSFT params by the rank ratio. Just reuse the exact SFT total and linearly interpolate a scale factor between 0.5× (ratio=0) and 2.0× (ratio=1), then apply it to the SFT memory consumption thats it.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
src/training_hub/profiling/memory_estimator.py (4)

377-382: Critical: Missing _calc_optimizer override to scale optimizer states.

The _calc_gradients method is scaled by OSFT_RATIO(self.unfreeze_rank_ratio), but optimizer states are not. Since AdamW maintains two state tensors per parameter and only the unfrozen fraction participates in training, optimizer memory should also scale with the rank ratio.

As per user feedback from NikhilNayak-debug and past review comments.

Add this override immediately after _calc_gradients:

     @override
     def _calc_gradients(self):
         """
         Override the optimizer parameter calculation by calculating based on OSFT's parameters
         """
         return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        """
+        Override the optimizer calculation to scale with unfreeze_rank_ratio
+        """
+        return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) * self.opt_params

427-433: Critical: Incorrect scaling approach—scale only gradients and optimizer, not entire subtotal.

Scaling the entire subtotal in _apply_overhead makes the component breakdown inconsistent and incorrectly scales model parameters and activations. Only gradients and optimizer states should scale with unfreeze_rank_ratio because only the unfrozen fraction participates in training.

As per past review comments.

Replace the _apply_overhead override with proper overrides for _calc_gradients and _calc_optimizer:

-    @override
-    def _apply_overhead(self, subtotal):
-        """
-        In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
-        """
-        ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)
-        return super()._apply_overhead(subtotal * ratio_val)        
+    @override
+    def _calc_gradients(self):
+        """
+        Override the gradients calculation to scale with unfreeze_rank_ratio
+        """
+        return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        """
+        Override the optimizer calculation to scale with unfreeze_rank_ratio
+        """
+        return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)

Based on past review comments.


3-3: Critical: Hard dependency on mini_trainer at import time blocks SFT-only usage.

Importing mini_trainer.osft_utils at the module level prevents users from using BasicEstimator (SFT) if mini_trainer is not installed. This violates the stated goal of making the API usable for SFT-only scenarios.

Move the import inside OSFTEstimatorExperimental.__init__ and handle ImportError gracefully:

 from typing import override
 from transformers import AutoModel
-from mini_trainer.osft_utils import MODEL_CONFIGS
 import numpy as np

Then in OSFTEstimatorExperimental.__init__ (around line 324):

         self.unfreeze_rank_ratio = unfreeze_rank_ratio
         if not (0.0 <= self.unfreeze_rank_ratio <= 1.0):
             raise ValueError("Ratio must be in the range [0, 1]")
 
         # Check to see which terms need to be included in the search for valid layers
+        try:
+            from mini_trainer.osft_utils import MODEL_CONFIGS
+        except ImportError:
+            MODEL_CONFIGS = {"default": {"patterns": []}}
         self.target_terms = MODEL_CONFIGS['default']['patterns']
         for key in MODEL_CONFIGS.keys():
             if self.model_path.find(key) > -1:
                 self.target_terms = MODEL_CONFIGS[key]['patterns']

Based on past review comments.


80-80: Convert float division to int.

Division by num_gpus produces a float, but tokens_per_gpu is annotated as int (line 78). This inconsistency can lead to unexpected behavior in downstream calculations.

Apply this diff:

         elif max_tokens_per_gpu is None:
-            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+            self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus)
         else:
             self.tokens_per_gpu: int = min(max_tokens_per_gpu,
-                                            effective_batch_size * max_seq_len / self.num_gpus)
+                                            int(effective_batch_size * max_seq_len / self.num_gpus))

Based on past review comments.

🧹 Nitpick comments (1)
src/training_hub/profiling/memory_estimator.py (1)

346-347: Prefer built-in min() for clarity.

The ternary expression is unnecessarily verbose.

Apply this diff:

         for term in self.target_terms:
-            lowest_dim = layer_data.shape[0] if layer_data.shape[0] < \
-                            layer_data.shape[1] else layer_data.shape[1]
+            lowest_dim = min(layer_data.shape[0], layer_data.shape[1])
             if layer_name.find(term) > -1 and layer_name.find('weight') > -1:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c781b7c and c06e934.

📒 Files selected for processing (2)
  • examples/notebooks/memory_estimator_example.ipynb (1 hunks)
  • src/training_hub/profiling/memory_estimator.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/notebooks/memory_estimator_example.ipynb
🧰 Additional context used
🪛 Ruff (0.14.0)
src/training_hub/profiling/memory_estimator.py

139-139: Avoid specifying long messages outside the exception class

(TRY003)


145-145: Unused method argument: kwargs

(ARG002)


322-322: Avoid specifying long messages outside the exception class

(TRY003)


425-425: Avoid specifying long messages outside the exception class

(TRY003)

@Maxusmusti Maxusmusti merged commit e4e1b55 into Red-Hat-AI-Innovation-Team:main Oct 14, 2025
4 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants