Adding basic API for memory profiling (src/training_hub/profiling) #11

mazam-lab · 2025-09-29T19:57:31Z

The backend logic still needs some work on my end, but I'm opening the PR now to ensure the API signature and code standards is looking good.

Please note that for 3.0, we'll only be adding a very basic memory profiling API. Hardware discovery, runtime estimations, hyperparameter and hardware recommendations, etc. are planned for 3.1+

Summary by CodeRabbit

New Features
- Adds a memory estimation tool for fine‑tuning that reports per‑GPU VRAM needs (parameters, optimizer state, gradients, activations, outputs) and returns low/expected/high bounds.
- Supports SFT and OSFT (including an experimental OSFT mode), Liger‑aware adjustments, and a unified top‑level estimate entry point.
Documentation
- Adds a notebook and README section demonstrating SFT/OSFT workflows, estimator options, and example outputs.

coderabbitai · 2025-09-29T19:57:38Z

Walkthrough

Adds a memory estimation module implementing SFT and two OSFT estimators with a unified estimate(...) entry point, exports these symbols from the package, and includes a notebook plus README docs demonstrating usage and examples.

Changes

Cohort / File(s)	Summary
Profiling — Memory estimator module `src/training_hub/profiling/memory_estimator.py`	New module adding `BasicEstimator`, `OSFTEstimatorExperimental`, `OSFTEstimator`, and a module-level `estimate(...)`. Loads models via `AutoModel.from_pretrained` (supports `trust_remote_code`), extracts model metadata, computes per‑GPU VRAM components (parameters, gradients, optimizer state, activations, outputs), applies overhead bounds (lower/expected/upper), and includes OSFT-specific layer/parameter calculations and experimental SVD/Liger handling.
Public API — Package export `src/training_hub/__init__.py`	Imports and exposes new symbols by adding `BasicEstimator`, `OSFTEstimatorExperimental`, `OSFTEstimator`, and `estimate` to the package `__all__`.
Examples — Notebook `examples/notebooks/memory_estimator_example.ipynb`	New Jupyter notebook demonstrating class-based and convenience-function usage for SFT and OSFT (including Liger kernel notes), showing per‑GPU breakdowns, bounds, and example outputs.
Docs — Examples README `examples/README.md`	Adds a Memory Estimation section documenting the experimental GPU memory estimation feature, usage notes, and a quick example calling `training_hub.estimate(...)`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Caller
  participant Est as Estimator (Basic / OSFT)
  participant HF as AutoModel.from_pretrained
  participant CFG as Model Config

  U->>Est: call estimate(...) or instantiate estimator
  Est->>HF: AutoModel.from_pretrained(model_path, trust_remote_code=...)
  HF-->>Est: model instance
  Est->>CFG: read metadata (num_parameters, trainable_params, layers, hidden_size)
  CFG-->>Est: config values

  rect #F0FFF4
    note right of Est: compute tokens_per_gpu, dtype sizes\nand constants
    Est->>Est: per-layer analysis (_check_layer_params for OSFT experimental)
    Est->>Est: _calc_model_params / _calc_osft_params
  end

  Est->>Est: compute components: params, grads, optimizer, activations, outputs
  Est->>Est: apply overhead multipliers -> (lower, expected, upper)
  Est-->>U: return bounds (and verbose prints if enabled)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through tensors, byte by byte,
Counting layers in the soft moonlight.
SFT hums steady, OSFT peels thin,
I tally VRAM with a twitching chin.
A rabbit's estimate—precise and bright. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.57% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly describes the addition of a basic memory profiling API in the specified module and directly reflects the primary change in the pull request.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

RobotSail · 2025-10-02T15:10:01Z

@CodeRabbit review

coderabbitai · 2025-10-02T15:10:07Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2025-10-02T15:40:08Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc2175d and 7ae33de.

📒 Files selected for processing (1)

src/training_hub/profiling/memory_estimator.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.2)

src/training_hub/profiling/memory_estimator.py

130-131: Do not assign a lambda expression, use a def

Rewrite calc_subtotal as a def

(E731)

132-133: Do not assign a lambda expression, use a def

Rewrite calc_overhead as a def

(E731)

143-143: Do not assign a lambda expression, use a def

Rewrite rounded_helper as a def

(E731)

228-228: Multiple statements on one line (colon)

(E701)

229-229: Multiple statements on one line (colon)

(E701)

230-230: Multiple statements on one line (colon)

(E701)

231-231: Multiple statements on one line (colon)

(E701)

src/training_hub/profiling/memory_estimator.py

NikhilNayak-debug · 2025-10-02T19:12:43Z

src/training_hub/profiling/memory_estimator.py

+
+    # NOTE: Typically, K is somewhere in [10, 30], but some research suggests that K=17.
+    # TODO: Add papers
+    gpu_vram_act_low: int = tokens_per_gpu * FLOAT32_BYTES_N * 10 * num_layers * hidden_size


Empirically if we see these estimates are not too far from actual memory this should be fine. But isn't it possible to calculate the exact upper bound given max sequence length and effective batch size? Can't we go through each layer of the model and sum the activation size?

Very shortly after I did this PR I made some discoveries that change this formula:

Basically, the activations are made up of two components:

The mid-layer activations (gpu_vram_act), which can be calculated as max_tokens_per_gpu * FLOAT32_BYTES_N * num_layers * hidden_size (set K to 1)

The output tensor (gpu_vram_out), which can be calculated as max_tokens_per_gpu * vocab_size * FLOAT32_BYTES_N.

From what I've seen empirically, you can get a reasonably good estimate if you use gpu_vram_act + (gpu_vram_out * 2) as your activation memory estimate

With liger kernels enabled you can simply use gpu_vram_act for the entire activation estimate

But the theoretical justification on using these two formulas is still a bit wishy-washy, so I want to tighten that up a bit if possible

@mazam-lab yeah I think this formula should get us quite close to the actual activation usage. One thing to note though is that not all matrices/layers have the same output dimension. Example some matrices may have dimensions of [4h, h] in which case the output size would be four times larger. But overall I think this should bring us pretty close to the correct value.

NikhilNayak-debug · 2025-10-02T19:13:19Z

src/training_hub/profiling/memory_estimator.py

+        lambda tot_val, act_val: tot_val - gpu_vram_par - gpu_vram_opt - gpu_vram_grad - act_val
+
+    # Perform those calculations using the lambdas!
+    gpu_vram_total_high: int = int(1.3 * calc_subtotal(gpu_vram_act_high))


What are these scalars 1.3, 1.2, etc?

Most formulas for calculating memory add about 10-30% to the initial subtotal to account for overhead, hence the scalars used here

NikhilNayak-debug

Overall this looks like a good draft!

RobotSail · 2025-10-03T19:09:06Z

src/training_hub/profiling/memory_estimator.py

+    the given hardware to run the model, and note how much more memory
+    is needed to make the training more feasible. 
+
+    Note that this estimate assumes training_hub will be used, 


Note that this estimate assumes training_hub will be used, in which all data types are float32 and the optimizer is always AdamW.

The reads like training hub houses the implementation for all algorithms, rather than just being an interface which exposes them. While these are probably the best settings in 95% of training scenarios, it also may not always be the case (e.g. FP8 training could be supported by an underlying backend).

Can you please update the messaging in your docs to instead be general?
E.g. This estimate assumes mixed precision training settings (storage in full precision/FP32, compute in half precision/BF16)

The point of this comment is to let the user know that we'll be assuming all tensors placed on VRAM will be Float32. So just to make sure I'm understanding correctly, should we let the user know that the underlying backend calculations might not necessarily be entirely done in Float32?

@mazam-lab No, the FP32 clarification is perfect, you don't need to change that. But right now it sounds like the float32 is because of training hub specifically, rather than being an established practice which we refer to as mixed precision training.

src/training_hub/profiling/memory_estimator.py

RobotSail

Thank you for the PR @mazam-lab, I took a look and left a few comments. Since the calculator is implemented as a script, maybe you could also add a notebook that provides a few examples on its usage?

src/training_hub/profiling/memory_estimator.py

RobotSail · 2025-10-03T19:23:06Z

src/training_hub/profiling/memory_estimator.py

+    # TODO: Validate these estimations empirically 
+
+    # Load model directly
+    model = AutoModel.from_pretrained(model_path, trust_remote_code=True, dtype="auto")


Would it make sense to have this load as dtype="float32", or does "auto" end up making the calculations work well here? How does the rest of the script differ if the base model itself is stored as float32 vs. the same model in bfloat16?

The script is hardcoded to use Float32 for all calculations regardless of the AutoModel's loaded data type, so changing this doesn't matter anymore (I originally had a data type toggle as a feature, but removed it since it wasn't too necessary if everything in training_hub is Float32 in practice)

@mazam-lab That makes sense, thanks!

src/training_hub/profiling/memory_estimator.py

mazam-lab · 2025-10-06T22:37:25Z

Thank you for the PR @mazam-lab, I took a look and left a few comments. Since the calculator is implemented as a script, maybe you could also add a notebook that provides a few examples on its usage?

Yeah I can do that; should the notebook provide some empirical results on what the actual memory usage for some select models and training setups are? (i.e. have a table showing the actual memory usage of Granite 2B on OSFT with Liger Kernels and 16,384 tokens)

mazam-lab · 2025-10-06T22:37:48Z

@NikhilNayak-debug @RobotSail Thank you both so much for the comments!

mazam-lab · 2025-10-10T18:34:34Z

Class structure has been implemented and an example notebook has been added examples/notebooks/memory_estimator_example.ipynb

mazam-lab · 2025-10-10T18:35:05Z

@Maxusmusti @RobotSail @NikhilNayak-debug I'm ready to move this to a normal PR now.

coderabbitai

Actionable comments posted: 7

♻️ Duplicate comments (2)

src/training_hub/profiling/memory_estimator.py (2)
316-320: Critical: Incorrect SVD storage calculation for OSFT.

The calculation for U and V matrices uses the full matrix dimensions (shape[0] * shape[0] and shape[1] * shape[1]), which is incorrect for SVD decomposition. SVD produces matrices U (m×k), S (k), and V (k×n) where k = min(m, n).

Based on past review comments.

Apply this diff to fix the SVD calculation:
         for term in TARGET_TERMS:
             if layer_name.find(term) > -1 and layer_name.find('weight') > -1:
-                U_bytes_n: int = layer_data.shape[0] * layer_data.shape[0] 
-                S_bytes_n: int = layer_data.shape[0] if layer_data.shape[0] < \
-                            layer_data.shape[1] else layer_data.shape[1]
-                V_bytes_n: int = layer_data.shape[1] * layer_data.shape[1]
+                m = layer_data.shape[0]
+                n = layer_data.shape[1]
+                k = min(m, n)
+                U_bytes_n: int = m * k
+                S_bytes_n: int = k
+                V_bytes_n: int = n * k
                 return (U_bytes_n + S_bytes_n + V_bytes_n) * FLOAT32_BYTES_N
227-228: Generalize documentation to avoid training_hub-specific assumptions.

The comment suggests memory characteristics are specific to training_hub, but these are standard mixed-precision training practices. Generalize the message as discussed in past review comments.

Based on past review comments.

Apply this diff:
-        Note that this estimate assumes training_hub will be used, 
-        in which all data types are float32 and the optimizer is always AdamW.
+        Note that this estimate assumes mixed precision training settings 
+        (storage in full precision/FP32, compute in half precision/BF16) 
+        and the AdamW optimizer.

🧹 Nitpick comments (2)

src/training_hub/profiling/memory_estimator.py (2)
14-14: Remove unused constant.

FLOAT8_BYTES_N is defined but never used in the codebase.

Consider removing this constant or documenting its intended future use:
 FLOAT32_BYTES_N: int = 4
 FLOAT16_BYTES_N: int = 2
-FLOAT8_BYTES_N: int = 1
 ADAMW_PARAMS_N: int = 2
122-126: Remove unused kwargs parameter.

The **kwargs parameter in _calc_additional is never used. Either remove it or document its purpose if it's intended for future subclass extensions.

Apply this diff:
-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae33de and 4f9ebee.

📒 Files selected for processing (2)

examples/notebooks/memory_estimator_example.ipynb (1 hunks)
src/training_hub/profiling/memory_estimator.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.3)

src/training_hub/profiling/memory_estimator.py

18-18: Do not assign a lambda expression, use a def

Rewrite ROUNDER as a def

(E731)

46-46: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

47-47: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

48-48: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

122-122: Unused method argument: kwargs

(ARG002)

281-281: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

282-282: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

283-283: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

351-351: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

352-352: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

353-353: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🔇 Additional comments (2)

examples/notebooks/memory_estimator_example.ipynb (1)

1-539: LGTM! Excellent documentation and examples.

The notebook provides clear, comprehensive examples of the memory estimator API, including:

Proper setup and configuration

Usage of both BasicEstimator and OSFTEstimator classes

Demonstration of the estimate() convenience function

Explanation of how Liger Kernels affect memory usage

Good pedagogical flow explaining the calculation methodology

The examples effectively illustrate the API's capabilities and will help users understand how to estimate memory requirements for their training scenarios.

src/training_hub/profiling/memory_estimator.py (1)

117-117: Verify embed_tokens presence or add defensive fallback
Confirm that all target Granite models expose model.embed_tokens.num_embeddings, or implement fallback logic using model.config.vocab_size or model.get_input_embeddings(), and raise a clear error if none are available.

examples/notebooks/memory_estimator_example.ipynb

src/training_hub/profiling/memory_estimator.py

mazam-lab · 2025-10-13T18:48:26Z

Coderabbit comments have been addressed

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/training_hub/profiling/memory_estimator.py (1)

344-406: Duplicate: trust_remote_code parameter issue.

This function attempts to pass trust_remote_code to OSFTEstimator (line 393), which will fail due to the missing parameter in its __init__ method. This issue is addressed in the review comment for lines 273-288.

🧹 Nitpick comments (1)

src/training_hub/profiling/memory_estimator.py (1)
120-124: Remove unused kwargs parameter.

The **kwargs parameter is unused and flagged by static analysis. Since no subclass currently uses it, consider removing it for clarity.

Apply this diff:
-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f9ebee and 88ad308.

📒 Files selected for processing (2)

examples/notebooks/memory_estimator_example.ipynb (1 hunks)
src/training_hub/profiling/memory_estimator.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.0)

src/training_hub/profiling/memory_estimator.py

120-120: Unused method argument: kwargs

(ARG002)

🔇 Additional comments (6)

src/training_hub/profiling/memory_estimator.py (5)
1-16: LGTM!

The imports are minimal and necessary, constants are well-defined, and the ROUNDER function has been correctly converted from a lambda to a proper function definition as recommended in previous reviews.

17-80: LGTM!

The class definition and initialization look good. Past review feedback has been properly addressed:

Type hints now use explicit int | None syntax (lines 43-45, 48)

trust_remote_code parameter has been added (line 48)

Model loading correctly uses the trust_remote_code parameter (line 57)

127-264: LGTM!

The overhead application, printing methods, and main estimate orchestration are well-implemented. The logic is clear and the breakdown provides useful information to users.

291-342: LGTM!

The OSFT-specific parameter calculations look correct. The SVD decomposition logic for trainable layers and the override of _calc_model_params are properly implemented.

110-118: Guard against missing embed_tokens by falling back to config.vocab_size.
Line 115 assumes self.model.embed_tokens.num_embeddings exists, which can break on architectures without that attribute. Replace with:
- return (self.tokens_per_gpu * self.main_dtype_bytes * self.model.embed_tokens.num_embeddings) * self.output_constant
+ vocab_layer = getattr(self.model, "embed_tokens", None)
+ vocab_size  = vocab_layer.num_embeddings if vocab_layer else self.model.config.vocab_size
+ return (self.tokens_per_gpu * self.main_dtype_bytes * vocab_size) * self.output_constant
examples/notebooks/memory_estimator_example.ipynb (1)

1-539: LGTM!

The example notebook is well-structured and provides clear demonstrations of:

BasicEstimator usage for SFT

OSFTEstimator usage for OSFT

OSFT with Liger Kernels

The estimate() convenience function

The explanations are educational and help users understand how memory estimation works. The past review comment about the inline comment has been correctly addressed (line 82 now correctly states "48 GB in bytes").

src/training_hub/profiling/memory_estimator.py

NikhilNayak-debug · 2025-10-13T19:43:01Z

src/training_hub/profiling/memory_estimator.py

+        """
+        Calculate the VRAM for storing the model's intermediate activations
+        """
+        return (self.tokens_per_gpu * self.main_dtype_bytes  * self.num_layers * self.hidden_size)


@mazam-lab each layer has more than one matrix, and you need to account for that. For example each Llama layer has 4 attention projections and 3 MLP projections. Also, some layers like the MLP don't always have output dimension h (they expand to a larger width and then project back to h), but we can probably ignore this for now since it won't make a big difference and keeps the tool simple.

Can you clarify what you mean by this? Because from what I understand from reading blogs, you don't need to store the intermediate matrix computations, you only store the final activations of each layer. So even though Llama has all of those matrices, I think the only thing that gets stored in memory is whatever gets passed into the SiLU at the end, which should be a matrix of size (tokens x hidden_dim):

I don't think this needs adjustment because it's not a large part of the calculation and this lines up with empirical results since the SFT estimator (which does make use of this formula) has been really accurate from my testing.

from what i understand, you store this temporarily so you can do the backpropogation efficiently

@mazam-lab if empirically it is accurate I think it should be okay. But these intermediate activations do add up because output of each matrix is required to compute the gradient unless one trains with activation checkpointing in which case you calculate the gradients layer by layer. But if SFT estimates are accurate, this should be fine!

NikhilNayak-debug · 2025-10-13T19:49:07Z

src/training_hub/profiling/memory_estimator.py

+        if layer_data.dim() < 2:
+            return layer_data.numel() * FLOAT32_BYTES_N
+        for term in TARGET_TERMS:
+            if layer_name.find(term) > -1 and layer_name.find('weight') > -1:


@mazam-lab this wouldn't work for rectangular matrices where dimension m is not equal to n. In that case U would be (m, min(m, n)) and V would be (min(m, n), n). Can we make that change?

Okay will do; So I thought U was supposed to be of size m x m and V was supposed to be size n x n, because that's the only way you can perform matrix multiplication with S of size m x n.

If I'm understanding things right, the reason U and V need to be size (m x min(m, n)) and (min(m, n) x n) respectively is because S is actually of size (min(m, n) x min(m, n))?

@mazam-lab yeah so for a matrix W of shape (m, n), the call U, S, Vt = torch.linalg.svd(W) returns:
U → shape (m, k)
S → shape (k,)
Vt → shape (k, n)
where k=min(m,n)

The code where we do this is here - https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer/blob/cb8907090945f12f6eb8d70b14eb69f9e5af5a9e/src/mini_trainer/osft_utils.py#L321C9-L321C60

NikhilNayak-debug · 2025-10-13T19:50:54Z

src/training_hub/profiling/memory_estimator.py

+
+        # If the layer is trainable, i.e. it contains any of these terms,
+        # then we will need to store a SVD decomposition of the layer in memory.
+        TARGET_TERMS: list[str] = ['self_attn.q_proj',


@mazam-lab these are the right parameter keys for Llama, Qwen, Mistral. But other model families have different names that we do the SVD on. For the full list refer to https://github.com/Red-Hat-AI-Innovation-Team/mini_trainer/blob/cb8907090945f12f6eb8d70b14eb69f9e5af5a9e/src/mini_trainer/osft_utils.py#L100 .

ah perfect, thank you!

NikhilNayak-debug · 2025-10-13T20:00:23Z

@mazam-lab also right now in the PR we are not accounting for the unfreeze rank ratio that is how much of U and V we are actually updating. The total parameter count will use the full matrix size of U and V, but gradients and optimizer states should scale with the rank ratio since only that fraction of U/V participates in training.

…ixing __init__ issues

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (5)

src/training_hub/profiling/memory_estimator.py (5)
4-4: Remove unused import.

The numpy module is imported but never used in this file.

Apply this diff to remove the unused import:
-import numpy as np
11-13: Remove or document unused constants.

FLOAT16_BYTES_N and FLOAT8_BYTES_N are defined but never used. If these are planned for future precision support, add a comment; otherwise, remove them to reduce clutter.

Apply this diff if not planning to use them:
 FLOAT32_BYTES_N: int = 4
-FLOAT16_BYTES_N: int = 2
-FLOAT8_BYTES_N: int = 1
 ADAMW_PARAMS_N: int = 2
17-17: Extract magic number to named constant.

The value 1073741824 (bytes per GiB) is a magic number. Extract it to a named constant for clarity.

Apply this diff:
+# Conversion constant
+BYTES_PER_GIB: int = 1073741824
+
 # Helper function to do the rounding when printing 
-def ROUNDER(value: int) -> str: return str(round(value / 1073741824, 1))
+def ROUNDER(value: int) -> str: return str(round(value / BYTES_PER_GIB, 1))
123-127: Remove unused kwargs parameter.

The kwargs parameter in _calc_additional() is unused (as flagged by static analysis). If it's intended for future subclass extensions, add a docstring note; otherwise, remove it.

As per static analysis hints.

Apply this diff if not needed:
-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0
423-423: Document the polynomial ratio formula.

Line 423 uses a hardcoded polynomial formula: -0.7802 * x^2 + 2.5302 * x + 0.25. This appears to be empirically derived but lacks explanation.

Add a comment explaining the source and derivation of these coefficients:
    @override
    def _apply_overhead(self, subtotal):
        """
        In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio.
        
        The ratio_val polynomial was empirically derived from [source/experiment] to approximate
        OSFT's memory scaling behavior relative to SFT.
        """
        # Polynomial approximation: f(x) = -0.7802*x^2 + 2.5302*x + 0.25
        ratio_val = -0.7802 * (self.unfreeze_rank_ratio ** 2) + 2.5302 * self.unfreeze_rank_ratio + 0.25
        return super()._apply_overhead(subtotal * ratio_val)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88ad308 and a6b7375.

📒 Files selected for processing (3)

examples/notebooks/memory_estimator_example.ipynb (1 hunks)
src/training_hub/__init__.py (2 hunks)
src/training_hub/profiling/memory_estimator.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

examples/notebooks/memory_estimator_example.ipynb

🧰 Additional context used

🧬 Code graph analysis (1)

src/training_hub/__init__.py (1)

src/training_hub/profiling/memory_estimator.py (7)

BasicEstimator (20-267)

OSFTEstimatorExperimental (270-382)

estimate (218-267)

estimate (366-382)

estimate (427-431)

estimate (434-500)

OSFTEstimator (385-431)

🪛 Ruff (0.14.0)

src/training_hub/profiling/memory_estimator.py

123-123: Unused method argument: kwargs

(ARG002)

🔇 Additional comments (2)

src/training_hub/__init__.py (1)

5-5: LGTM!

The imports and exports are correctly structured and consistent with each other. The new memory estimation API is properly exposed to users of the training_hub package.

Also applies to: 18-22

src/training_hub/profiling/memory_estimator.py (1)

434-500: LGTM!

The module-level estimate() function provides a clean, unified API for both SFT and OSFT estimation. The implementation correctly dispatches to the appropriate estimator class based on the training_method parameter.

src/training_hub/profiling/memory_estimator.py

coderabbitai · 2025-10-14T17:49:09Z

src/training_hub/profiling/memory_estimator.py

+        if effective_batch_size is None or max_seq_len is None:
+            self.tokens_per_gpu: int = max_tokens_per_gpu
+        elif max_tokens_per_gpu is None:
+            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus


⚠️ Potential issue | 🟡 Minor

Convert float division to int.

Division by num_gpus produces a float, but tokens_per_gpu is annotated as int (line 72) and should remain an integer for consistent calculations. Use integer division (//) or explicit int() cast.

Apply this diff:

elif max_tokens_per_gpu is None: - self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus + self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus) else: self.tokens_per_gpu: int = min(max_tokens_per_gpu, - effective_batch_size * max_seq_len / self.num_gpus) + int(effective_batch_size * max_seq_len / self.num_gpus))

Also applies to: 76-77

🤖 Prompt for AI Agents

In src/training_hub/profiling/memory_estimator.py around lines 74 and also lines 76-77, the division operations produce floats while tokens_per_gpu (and related variables) are annotated as ints; change the float divisions to integer results by using integer division (//) or wrapping the expressions with int(...) so the assigned values are integers and downstream calculations remain consistent.

src/training_hub/profiling/memory_estimator.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/training_hub/profiling/memory_estimator.py (1)

77-83: Cast division results to int to match type annotation.

Lines 80 and 82–83 perform division by num_gpus, which produces a float, but tokens_per_gpu is annotated as int. This type mismatch may cause issues with type checkers or downstream integer-only operations.

Apply this diff to cast the results:

         if effective_batch_size is None or max_seq_len is None:
             self.tokens_per_gpu: int = max_tokens_per_gpu
         elif max_tokens_per_gpu is None:
-            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+            self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus)
         else:
             self.tokens_per_gpu: int = min(max_tokens_per_gpu,
-                                            effective_batch_size * max_seq_len / self.num_gpus)
+                                            int(effective_batch_size * max_seq_len / self.num_gpus))

🧹 Nitpick comments (2)

src/training_hub/profiling/memory_estimator.py (2)
19-22: Document the derivation or source of the quadratic formula.

The quadratic coefficients in OSFT_RATIO (−0.7802, 2.5302, 0.25) are unexplained beyond the boundary conditions. Consider adding a comment or docstring that clarifies:

The empirical or theoretical basis for this formula

Whether it's derived from experimental data or a mathematical model

Any references or sources that justify these specific coefficients

This will help future maintainers understand and validate the scaling behavior.

143-147: Remove unused kwargs parameter.

The kwargs parameter in _calc_additional is declared but never used. If it's not needed for future extensibility or polymorphism, consider removing it to keep the signature clean.

Apply this diff:
-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6b7375 and e0f2752.

📒 Files selected for processing (1)

src/training_hub/profiling/memory_estimator.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.0)

src/training_hub/profiling/memory_estimator.py

130-130: Do not use bare except

(E722)

133-133: Do not use bare except

(E722)

136-136: Do not use bare except

(E722)

137-137: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

137-137: Avoid specifying long messages outside the exception class

(TRY003)

143-143: Unused method argument: kwargs

(ARG002)

src/training_hub/profiling/memory_estimator.py

coderabbitai · 2025-10-14T18:25:01Z

src/training_hub/profiling/memory_estimator.py

+    def _calc_gradients(self):
+        """
+        Override the optimizer parameter calculation by calculating based on OSFT's parameters
+        """
+        return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)


⚠️ Potential issue | 🟠 Major

Add _calc_optimizer override to scale optimizer states by unfreeze ratio.

Currently, _calc_gradients is scaled by OSFT_RATIO(unfreeze_rank_ratio), but optimizer states are not. Since AdamW maintains two state tensors per parameter and only the unfrozen fraction participates in training, optimizer memory should also scale with the rank ratio.

As per user feedback from NikhilNayak-debug.

Add this override after _calc_gradients:

@override def _calc_gradients(self): """ Override the optimizer parameter calculation by calculating based on OSFT's parameters """ return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) + + @override + def _calc_optimizer(self): + """ + Override the optimizer calculation to scale with unfreeze_rank_ratio + """ + return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _calc_gradients(self):

"""

Override the optimizer parameter calculation by calculating based on OSFT's parameters

"""

return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)

def _calc_gradients(self):

"""

Override the optimizer parameter calculation by calculating based on OSFT's parameters

"""

return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)

@override

def _calc_optimizer(self):

"""

Override the optimizer calculation to scale with unfreeze_rank_ratio

"""

return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)

🤖 Prompt for AI Agents

In src/training_hub/profiling/memory_estimator.py around lines 374 to 378, add an override method named _calc_optimizer immediately after _calc_gradients that returns the optimizer state memory sized to the unfrozen parameter fraction: calculate base optimizer parameter count via self._calc_model_params(), multiply by OSFT_RATIO(self.unfreeze_rank_ratio) to account for only the unfrozen parameters participating, and multiply by 2 to reflect the two state tensors AdamW keeps per parameter (i.e., optimizer memory = _calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) * 2); implement as a concise method that parallels _calc_gradients.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

src/training_hub/profiling/memory_estimator.py (2)

374-379: Scale optimizer states by unfreeze_rank_ratio (Experimental OSFT)

Gradients are scaled but optimizer states aren’t. AdamW’s two state tensors should scale with the unfrozen fraction too.

     @override
     def _calc_gradients(self):
         """
         Override the optimizer parameter calculation by calculating based on OSFT's parameters
         """
         return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        """
+        Scale optimizer state memory with the unfrozen fraction (two states for AdamW)
+        """
+        return self._calc_model_params() * self.opt_params * OSFT_RATIO(self.unfreeze_rank_ratio)

80-83: Make tokens_per_gpu an integer

Float division produces a float; this should be an integer for consistent byte math.

-            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+            self.tokens_per_gpu: int = (effective_batch_size * max_seq_len) // self.num_gpus
         else:
-            self.tokens_per_gpu: int = min(max_tokens_per_gpu,
-                                            effective_batch_size * max_seq_len / self.num_gpus)
+            self.tokens_per_gpu: int = min(
+                max_tokens_per_gpu,
+                (effective_batch_size * max_seq_len) // self.num_gpus,
+            )

🧹 Nitpick comments (9)

src/training_hub/profiling/memory_estimator.py (9)

1-1: Ensure override import works on Python < 3.12

typing.override is 3.12+. Add a fallback to typing_extensions or switch to it directly to avoid ImportError on older runtimes.

-from typing import override
+try:
+    from typing import override  # Python 3.12+
+except ImportError:  # pragma: no cover
+    from typing_extensions import override  # type: ignore

4-4: Remove unused NumPy import

numpy as np is not used.

-import numpy as np

16-17: Broaden ROUNDER type and format

It’s called with floats; update the hint and use a multi-line def for readability.

-def ROUNDER(value: int) -> str: return str(round(value / 1073741824, 1))
+def ROUNDER(value: float) -> str:
+    return str(round(value / 1073741824, 1))

89-91: Explain magic output_constant

Document the 8/3 factor (empirical vs theoretical) to aid maintainability.

-        self.output_constant = 8/3
+        # Empirical scaling for output tensor footprint (see PR notes); tune as needed
+        self.output_constant = 8/3

143-147: Silence unused kwargs warning

Rename to _kwargs to address linter ARG002 without changing behavior.

-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self, **_kwargs):

321-325: Stop at first matching model family

Break after the first match to avoid accidental overrides when multiple keys match.

-        for key in MODEL_CONFIGS.keys():
-            if self.model_path.find(key) > -1:
-                self.target_terms = MODEL_CONFIGS[key]['patterns']
+        for key in MODEL_CONFIGS.keys():
+            if self.model_path.find(key) > -1:
+                self.target_terms = MODEL_CONFIGS[key]['patterns']
+                break

342-343: Use min for clarity

Minor readability improvement.

-            lowest_dim = layer_data.shape[0] if layer_data.shape[0] < \
-                            layer_data.shape[1] else layer_data.shape[1]
+            lowest_dim = min(layer_data.shape[0], layer_data.shape[1])

419-419: Validate unfreeze_rank_ratio input (OSFT)

Same check for the basic OSFT estimator.

         self.unfreeze_rank_ratio = unfreeze_rank_ratio
+        if not (0.0 <= self.unfreeze_rank_ratio <= 1.0):
+            raise ValueError("unfreeze_rank_ratio must be within [0, 1]")

127-137: Minor: Tighten exception message (TRY003)

Prefer a concise message.

-                    except AttributeError as e:
-                        raise ValueError("Could not find the given model's vocabulary size") from e
+                    except AttributeError as e:
+                        raise ValueError("Could not resolve vocab_size") from e

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0f2752 and 7cd94a9.

📒 Files selected for processing (1)

src/training_hub/profiling/memory_estimator.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.0)

src/training_hub/profiling/memory_estimator.py

137-137: Avoid specifying long messages outside the exception class

(TRY003)

143-143: Unused method argument: kwargs

(ARG002)

coderabbitai · 2025-10-14T18:47:39Z

src/training_hub/profiling/memory_estimator.py

@@ -0,0 +1,516 @@
+from typing import override
+from transformers import AutoModel
+from mini_trainer.osft_utils import MODEL_CONFIGS


⚠️ Potential issue | 🟠 Major

Avoid hard dependency at import time on mini_trainer

Importing mini_trainer.osft_utils at module import breaks SFT-only use if mini_trainer isn’t installed. Defer import to the OSFT Experimental constructor and handle absence gracefully.

Apply both changes:

Remove the module-level import:

-from mini_trainer.osft_utils import MODEL_CONFIGS

Load lazily inside OSFTEstimatorExperimental.__init__ and fall back safely:

- # Check to see which terms need to be included in the search for valid layers - self.target_terms = MODEL_CONFIGS['default']['patterns'] - for key in MODEL_CONFIGS.keys(): - if self.model_path.find(key) > -1: - self.target_terms = MODEL_CONFIGS[key]['patterns'] + # Resolve target layer patterns lazily to avoid hard dependency on mini_trainer + try: + from mini_trainer.osft_utils import MODEL_CONFIGS + except Exception: + MODEL_CONFIGS = {"default": {"patterns": []}} # minimal safe default + self.target_terms = MODEL_CONFIGS.get("default", {}).get("patterns", []) + for key in MODEL_CONFIGS.keys(): + if key in self.model_path: + self.target_terms = MODEL_CONFIGS[key]["patterns"] + break

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from mini_trainer.osft_utils import MODEL_CONFIGS

++ b/src/training_hub/profiling/memory_estimator.py

@@ –1,5 +1,4 @@

-from mini_trainer.osft_utils import MODEL_CONFIGS

import torch

from typing import Callable, Optional, override

from transformers import AutoModel

@@ class OSFTEstimatorExperimental(BasicEstimator):

def __init__(

self,

num_gpus: int = 8,

gpu_memory: int = 85899345920,

model_path: str = "ibm-granite/granite-3.3-8b-instruct",

effective_batch_size: int | None = None,

max_seq_len: int | None = None,

max_tokens_per_gpu: int | None = None,

use_liger: bool = False,

verbose: int = 1,

):

super().__init__(...)

- # Check to see which terms need to be included in the search for valid layers

- self.target_terms = MODEL_CONFIGS['default']['patterns']

- for key in MODEL_CONFIGS.keys():

- if self.model_path.find(key) > -1:

# Resolve target layer patterns lazily to avoid hard dependency on mini_trainer

try:

from mini_trainer.osft_utils import MODEL_CONFIGS

except Exception:

MODEL_CONFIGS = {"default": {"patterns": []}} # minimal safe default

self.target_terms = MODEL_CONFIGS.get("default", {}).get("patterns", [])

for key, cfg in MODEL_CONFIGS.items():

if key in self.model_path:

self.target_terms = cfg.get("patterns", [])

break

🤖 Prompt for AI Agents

In src/training_hub/profiling/memory_estimator.py around line 3, remove the module-level import of mini_trainer.osft_utils to avoid a hard dependency at import time; instead, in the OSFTEstimatorExperimental.__init__ perform a lazy import inside a try/except ImportError block (e.g., try to import MODEL_CONFIGS from mini_trainer.osft_utils and assign it to an instance attribute or module variable) and on ImportError set that attribute to None or a safe default and log or raise a clear runtime error only when OSFT functionality is actually invoked; also update any code paths in the class to reference the instance/module attribute and handle the None/default case gracefully so SFT-only usage works without mini_trainer installed.

src/training_hub/profiling/memory_estimator.py

coderabbitai · 2025-10-14T18:47:40Z

src/training_hub/profiling/memory_estimator.py

+    def _apply_overhead(self, subtotal):
+        """
+        In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
+        """
+        ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)
+        return super()._apply_overhead(subtotal * ratio_val)        


⚠️ Potential issue | 🟠 Major

Keep subtotal logic clean; scale only grads/optimizer (OSFT)

Scaling the entire subtotal in _apply_overhead makes the component breakdown inconsistent. Override _calc_gradients and _calc_optimizer instead, and leave _apply_overhead to just apply overhead.

def _apply_overhead(self, subtotal): """ In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio """ - ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio) - return super()._apply_overhead(subtotal * ratio_val) + return super()._apply_overhead(subtotal)

Add these overrides:

+ @override + def _calc_gradients(self): + return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio) + + @override + def _calc_optimizer(self): + return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _apply_overhead(self, subtotal):

"""

In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio

"""

ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)

return super()._apply_overhead(subtotal * ratio_val)

def _apply_overhead(self, subtotal):

"""

In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio

"""

return super()._apply_overhead(subtotal)

@override

def _calc_gradients(self):

return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio)

@override

def _calc_optimizer(self):

return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)

🤖 Prompt for AI Agents

In src/training_hub/profiling/memory_estimator.py around lines 422 to 427, currently _apply_overhead multiplies the entire subtotal by OSFT_RATIO(self.unfreeze_rank_ratio); instead, remove that scaling so _apply_overhead only applies the 0-30% overhead to the subtotal, and implement two new overrides: _calc_gradients and _calc_optimizer which call the parent implementations (or existing internal logic) and then multiply only the gradients and optimizer component values by OSFT_RATIO(self.unfreeze_rank_ratio). Ensure you do not change the return shape or keys of those methods, and keep the OSFT scaling isolated to those two overrides so the component breakdown remains consistent.

Maxusmusti · 2025-10-14T18:57:33Z

examples/notebooks/memory_estimator_example.ipynb

+    "# Memory Estimator \n",
+    "\n",
+    "This notebook will provide some examples on how to use the memory_estimator API\n",
+    "to estimate the amount of GPU memory consumed when fine-tuning an LLM model in Training Hub.\n",


"LLM Model is redundant"

Maxusmusti · 2025-10-14T18:59:30Z

Only other thing is adding a note in the examples dir README linking to the newly added notebook and briefly explaining the "Memory Estimator API (In-Progress/Experimental)". Can also be added as a reference to bottom of main README

…om coderabbit, some other documentation cleaning

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

src/training_hub/profiling/memory_estimator.py (4)

427-433: Incorrect scaling approach: scale only gradients/optimizer, not entire subtotal.

Scaling the entire subtotal in _apply_overhead makes the component breakdown inconsistent and applies the OSFT scaling to activations/outputs which shouldn't be affected. Only gradients and optimizer states should scale with unfreeze_rank_ratio.

As per user feedback from NikhilNayak-debug.

Remove the scaling from _apply_overhead and add overrides for _calc_gradients and _calc_optimizer:

     @override
     def _apply_overhead(self, subtotal):
         """
-        In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
+        Apply the standard 0-30% overhead
         """
-        ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)
-        return super()._apply_overhead(subtotal * ratio_val)        
+        return super()._apply_overhead(subtotal)
+    
+    @override
+    def _calc_gradients(self):
+        """
+        Override to scale gradients by unfreeze_rank_ratio
+        """
+        return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio)
+    
+    @override
+    def _calc_optimizer(self):
+        """
+        Override to scale optimizer states by unfreeze_rank_ratio
+        """
+        return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)

3-3: Avoid hard dependency on mini_trainer at module import.

The module-level import of mini_trainer.osft_utils creates a hard dependency that breaks SFT-only usage when mini_trainer is not installed. This import is only needed for OSFTEstimatorExperimental.

Move the import inside OSFTEstimatorExperimental.__init__ with graceful fallback:

-from mini_trainer.osft_utils import MODEL_CONFIGS
 import numpy as np

Then in OSFTEstimatorExperimental.__init__ (after line 322):

         if not (0.0 <= self.unfreeze_rank_ratio <= 1.0):
             raise ValueError("Ratio must be in the range [0, 1]")
 
         # Check to see which terms need to be included in the search for valid layers
+        try:
+            from mini_trainer.osft_utils import MODEL_CONFIGS
+        except ImportError:
+            MODEL_CONFIGS = {"default": {"patterns": []}}
         self.target_terms = MODEL_CONFIGS['default']['patterns']
         for key in MODEL_CONFIGS.keys():
             if self.model_path.find(key) > -1:
                 self.target_terms = MODEL_CONFIGS[key]['patterns']

80-83: Convert float division results to int.

Division by num_gpus produces float values, but tokens_per_gpu is annotated as int. Use explicit int() cast to maintain type consistency.

Apply this diff:

         elif max_tokens_per_gpu is None:
-            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+            self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus)
         else:
             self.tokens_per_gpu: int = min(max_tokens_per_gpu,
-                                            effective_batch_size * max_seq_len / self.num_gpus)
+                                            int(effective_batch_size * max_seq_len / self.num_gpus))

377-383: Add _calc_optimizer override to scale optimizer states.

Currently _calc_gradients scales by OSFT_RATIO(unfreeze_rank_ratio), but optimizer states are not scaled. Since AdamW maintains two state tensors per parameter and only the unfrozen fraction participates in training, optimizer memory should also scale with the rank ratio.

As per user feedback from NikhilNayak-debug.

Add this override after _calc_gradients:

     @override
     def _calc_gradients(self):
         """
         Override the optimizer parameter calculation by calculating based on OSFT's parameters
         """
         return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        """
+        Override the optimizer calculation to scale with unfreeze_rank_ratio.
+        AdamW maintains two state tensors per parameter, and only the unfrozen
+        fraction participates in training.
+        """
+        return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) * ADAMW_PARAMS_N

🧹 Nitpick comments (2)

src/training_hub/profiling/memory_estimator.py (1)

145-149: Remove unused kwargs parameter.

The **kwargs parameter in _calc_additional is never used and creates unnecessary API surface.

As per static analysis hints.

Apply this diff:
-    def _calc_additional(self, **kwargs):
+    def _calc_additional(self):
         """
         Calculate any additional VRAM that this training method might need
         """
         return 0

examples/README.md (1)

83-83: Simplify wordy phrase.

The phrase "built off of" is unnecessarily wordy. Consider using a more concise alternative.

As per static analysis hints.

Apply this diff:

-training_hub includes a library for estimating the expected amount of GPU memory that will be allocated during the fine-tuning of a given model using SFT or OSFT. The calculations are built off of formulas presented in the blog post [How To Calculate GPU VRAM Requirements for an Large-Language Model](https://apxml.com/posts/how-to-calculate-vram-requirements-for-an-llm).
+training_hub includes a library for estimating the expected amount of GPU memory that will be allocated during the fine-tuning of a given model using SFT or OSFT. The calculations are based on formulas presented in the blog post [How To Calculate GPU VRAM Requirements for an Large-Language Model](https://apxml.com/posts/how-to-calculate-vram-requirements-for-an-llm).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cd94a9 and c781b7c.

📒 Files selected for processing (3)

examples/README.md (1 hunks)
examples/notebooks/memory_estimator_example.ipynb (1 hunks)
src/training_hub/profiling/memory_estimator.py (1 hunks)

🧰 Additional context used

🪛 LanguageTool

examples/README.md

[style] ~83-~83: ‘off of’ might be wordy. Consider a shorter alternative.
Context: ...SFT or OSFT. The calculations are built off of formulas presented in the blog post [Ho...

(EN_WORDINESS_PREMIUM_OFF_OF)

[grammar] ~83-~83: There might be a mistake here.
Context: ...calculate-vram-requirements-for-an-llm). NOTE: This feature is still a work in-pr...

(QB_NEW_EN)

[grammar] ~87-~87: There might be a mistake here.
Context: ...l results may still vary. Tutorials: - [Memory Estimation Example](notebooks/mem...

(QB_NEW_EN)

🪛 Ruff (0.14.0)

src/training_hub/profiling/memory_estimator.py

139-139: Avoid specifying long messages outside the exception class

(TRY003)

145-145: Unused method argument: kwargs

(ARG002)

322-322: Avoid specifying long messages outside the exception class

(TRY003)

425-425: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)

examples/notebooks/memory_estimator_example.ipynb (1)

1-357: LGTM! Well-structured educational notebook.

The notebook provides a comprehensive, step-by-step tutorial on using the memory estimation API. It effectively demonstrates:

Core concepts and memory calculation components

Both class-based instantiation and convenience function usage

SFT, OSFT, and Liger Kernels scenarios

Clear explanations of parameters and their impact

The examples align well with the implemented API and will help users understand how to estimate memory requirements for their training scenarios.
examples/README.md (1)
100-100: Fix syntax error in code example.

Missing comma after use_liger=True parameter.

Apply this diff:
 estimate(training_method='osft',
     num_gpus=2,
     model_path="/path/to/model",
     max_tokens_per_gpu=8192,
     use_liger=True,
     verbose=2,
-    unfreeze_rank_ratio: float = 0.25
+    unfreeze_rank_ratio=0.25
 )
Likely an incorrect or invalid review comment.

coderabbitai · 2025-10-14T19:39:02Z

src/training_hub/profiling/memory_estimator.py

+# Helper function to calculate how much the given unfrozen_rank_ratio 
+# will be affecting the OSFT estimation (through a quadratic mapping where
+# 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value)
+def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25


🛠️ Refactor suggestion | 🟠 Major

Document the magic constant 0.33 in OSFT_RATIO.

The quadratic mapping formula uses 0.33 as a pivot point where OSFT equals SFT memory usage, but this isn't explained. Extract to a named constant with documentation.

Apply this refactor:

+# Default unfreeze ratio where OSFT memory usage equals SFT memory usage (empirically determined) +DEFAULT_UNFREEZE_RATIO: float = 0.33 + # Helper function to calculate how much the given unfrozen_rank_ratio # will be affecting the OSFT estimation (through a quadratic mapping where -# 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value) -def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25 +# 0 is 0.25 of SFT's value, DEFAULT_UNFREEZE_RATIO is equal to SFT's value, and 1 is twice of SFT's value) +def OSFT_RATIO(value: float) -> float: + """Calculate OSFT memory scaling factor based on unfreeze_rank_ratio.""" + return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Helper function to calculate how much the given unfrozen_rank_ratio

# will be affecting the OSFT estimation (through a quadratic mapping where

# 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value)

def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25

# Default unfreeze ratio where OSFT memory usage equals SFT memory usage (empirically determined)

DEFAULT_UNFREEZE_RATIO: float = 0.33

# Helper function to calculate how much the given unfrozen_rank_ratio

# will be affecting the OSFT estimation (through a quadratic mapping where

# 0 is 0.25 of SFT's value, DEFAULT_UNFREEZE_RATIO is equal to SFT's value, and 1 is twice of SFT's value)

def OSFT_RATIO(value: float) -> float:

"""Calculate OSFT memory scaling factor based on unfreeze_rank_ratio."""

return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25

🤖 Prompt for AI Agents

In src/training_hub/profiling/memory_estimator.py around lines 19-22, the quadratic mapping in OSFT_RATIO uses the unexplained magic pivot value 0.33; extract that pivot into a named constant (e.g. UNFROZEN_RANK_RATIO_SFT_EQUIVALENT = 0.33) with a short docstring that explains "pivot point where OSFT memory equals SFT memory (OSFT_RATIO == 1.0)". Replace the literal 0.33 in comments/description with the constant and add a brief docstring above the OSFT_RATIO function describing the quadratic mapping, the pivot constant, and what the function returns.

RobotSail

LGTM

NikhilNayak-debug · 2025-10-14T20:12:31Z

src/training_hub/profiling/memory_estimator.py

+# Helper function to calculate how much the given unfrozen_rank_ratio 
+# will be affecting the OSFT estimation (through a quadratic mapping where
+# 0 is 0.25 of SFT's value, 0.33 is equal to SFT's value, and 1 is twice of SFT's value)
+def OSFT_RATIO(value: float) -> float: return -0.7802 * (value) * (value) + 2.5302 * (value) + 0.25


@mazam-lab cant we just linearly interpolate here instead of using a quadratic formula? Starting from half of SFT for unfreeze rank ratio of 0 to 2x of SFT for unfreeze rank ratio of 1.

sure, that could work too. if you're confident in that estimation I'll go ahead and implement it

NikhilNayak-debug · 2025-10-14T20:32:27Z

@mazam-lab we dont need to scale OSFT params by the rank ratio. Just reuse the exact SFT total and linearly interpolate a scale factor between 0.5× (ratio=0) and 2.0× (ratio=1), then apply it to the SFT memory consumption thats it.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

src/training_hub/profiling/memory_estimator.py (4)
377-382: Critical: Missing _calc_optimizer override to scale optimizer states.

The _calc_gradients method is scaled by OSFT_RATIO(self.unfreeze_rank_ratio), but optimizer states are not. Since AdamW maintains two state tensors per parameter and only the unfrozen fraction participates in training, optimizer memory should also scale with the rank ratio.

As per user feedback from NikhilNayak-debug and past review comments.

Add this override immediately after _calc_gradients:
     @override
     def _calc_gradients(self):
         """
         Override the optimizer parameter calculation by calculating based on OSFT's parameters
         """
         return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        """
+        Override the optimizer calculation to scale with unfreeze_rank_ratio
+        """
+        return self._calc_model_params() * OSFT_RATIO(self.unfreeze_rank_ratio) * self.opt_params
427-433: Critical: Incorrect scaling approach—scale only gradients and optimizer, not entire subtotal.

Scaling the entire subtotal in _apply_overhead makes the component breakdown inconsistent and incorrectly scales model parameters and activations. Only gradients and optimizer states should scale with unfreeze_rank_ratio because only the unfrozen fraction participates in training.

As per past review comments.

Replace the _apply_overhead override with proper overrides for _calc_gradients and _calc_optimizer:
-    @override
-    def _apply_overhead(self, subtotal):
-        """
-        In addition to the 0-30% overhead, apply a multiplier based on the unfreeze_rank_ratio
-        """
-        ratio_val = OSFT_RATIO(self.unfreeze_rank_ratio)
-        return super()._apply_overhead(subtotal * ratio_val)        
+    @override
+    def _calc_gradients(self):
+        """
+        Override the gradients calculation to scale with unfreeze_rank_ratio
+        """
+        return super()._calc_gradients() * OSFT_RATIO(self.unfreeze_rank_ratio)
+
+    @override
+    def _calc_optimizer(self):
+        """
+        Override the optimizer calculation to scale with unfreeze_rank_ratio
+        """
+        return super()._calc_optimizer() * OSFT_RATIO(self.unfreeze_rank_ratio)
Based on past review comments.

3-3: Critical: Hard dependency on mini_trainer at import time blocks SFT-only usage.

Importing mini_trainer.osft_utils at the module level prevents users from using BasicEstimator (SFT) if mini_trainer is not installed. This violates the stated goal of making the API usable for SFT-only scenarios.

Move the import inside OSFTEstimatorExperimental.__init__ and handle ImportError gracefully:
 from typing import override
 from transformers import AutoModel
-from mini_trainer.osft_utils import MODEL_CONFIGS
 import numpy as np
Then in OSFTEstimatorExperimental.__init__ (around line 324):
         self.unfreeze_rank_ratio = unfreeze_rank_ratio
         if not (0.0 <= self.unfreeze_rank_ratio <= 1.0):
             raise ValueError("Ratio must be in the range [0, 1]")
 
         # Check to see which terms need to be included in the search for valid layers
+        try:
+            from mini_trainer.osft_utils import MODEL_CONFIGS
+        except ImportError:
+            MODEL_CONFIGS = {"default": {"patterns": []}}
         self.target_terms = MODEL_CONFIGS['default']['patterns']
         for key in MODEL_CONFIGS.keys():
             if self.model_path.find(key) > -1:
                 self.target_terms = MODEL_CONFIGS[key]['patterns']
Based on past review comments.

80-80: Convert float division to int.

Division by num_gpus produces a float, but tokens_per_gpu is annotated as int (line 78). This inconsistency can lead to unexpected behavior in downstream calculations.

Apply this diff:
         elif max_tokens_per_gpu is None:
-            self.tokens_per_gpu: int = effective_batch_size * max_seq_len / self.num_gpus
+            self.tokens_per_gpu: int = int(effective_batch_size * max_seq_len / self.num_gpus)
         else:
             self.tokens_per_gpu: int = min(max_tokens_per_gpu,
-                                            effective_batch_size * max_seq_len / self.num_gpus)
+                                            int(effective_batch_size * max_seq_len / self.num_gpus))
Based on past review comments.

🧹 Nitpick comments (1)

src/training_hub/profiling/memory_estimator.py (1)

346-347: Prefer built-in min() for clarity.

The ternary expression is unnecessarily verbose.

Apply this diff:

         for term in self.target_terms:
-            lowest_dim = layer_data.shape[0] if layer_data.shape[0] < \
-                            layer_data.shape[1] else layer_data.shape[1]
+            lowest_dim = min(layer_data.shape[0], layer_data.shape[1])
             if layer_name.find(term) > -1 and layer_name.find('weight') > -1:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c781b7c and c06e934.

📒 Files selected for processing (2)

examples/notebooks/memory_estimator_example.ipynb (1 hunks)
src/training_hub/profiling/memory_estimator.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

examples/notebooks/memory_estimator_example.ipynb

🧰 Additional context used

🪛 Ruff (0.14.0)

src/training_hub/profiling/memory_estimator.py

139-139: Avoid specifying long messages outside the exception class

(TRY003)

145-145: Unused method argument: kwargs

(ARG002)

322-322: Avoid specifying long messages outside the exception class

(TRY003)

425-425: Avoid specifying long messages outside the exception class

(TRY003)

Mohib Azam and others added 4 commits September 15, 2025 12:02

First pass at memory estimator

31e59e9

Fixing bugs with the OSFT implementation

f3ca036

Merge branch 'Red-Hat-AI-Innovation-Team:main' into main

fcc9383

Cleanup prior to draft PR

7ae33de

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

src/training_hub/profiling/memory_estimator.py Outdated Show resolved Hide resolved

NikhilNayak-debug reviewed Oct 2, 2025

View reviewed changes

src/training_hub/profiling/memory_estimator.py Outdated Show resolved Hide resolved

NikhilNayak-debug reviewed Oct 2, 2025

View reviewed changes

src/training_hub/profiling/memory_estimator.py Outdated Show resolved Hide resolved

NikhilNayak-debug reviewed Oct 2, 2025

View reviewed changes

RobotSail reviewed Oct 3, 2025

View reviewed changes

src/training_hub/profiling/memory_estimator.py Outdated Show resolved Hide resolved

RobotSail reviewed Oct 3, 2025

View reviewed changes

src/training_hub/profiling/memory_estimator.py Outdated Show resolved Hide resolved

RobotSail reviewed Oct 3, 2025

View reviewed changes

Updating PR with updated thresholds

dba04d0

mazam-lab added 5 commits October 7, 2025 17:26

Restructing Memory estimator as a class

4590ead

Restructing Memory estimator as a class

9c24220

Fixing bug in OSFT

7d0e05c

Polished up the documentation and added verbosity feature for the PR

581fa5e

Notebook giving an example on how to use the memory estimator

4f9ebee

mazam-lab marked this pull request as ready for review October 10, 2025 18:34

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

Addressing coderabbit comments

88ad308

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

src/training_hub/profiling/memory_estimator.py Show resolved Hide resolved

NikhilNayak-debug reviewed Oct 13, 2025

View reviewed changes

Patching in a simpler estimator for OSFT, updating the notebook, hotf…

a6b7375

…ixing __init__ issues

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

Addressing coderabbit review

e0f2752

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

Hotfixing coderabbit issue

7cd94a9

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

Maxusmusti reviewed Oct 14, 2025

View reviewed changes

Addressing Mustafa's comments on the readme, adjusting an typcheck fr…

c781b7c

…om coderabbit, some other documentation cleaning

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

RobotSail approved these changes Oct 14, 2025

View reviewed changes

NikhilNayak-debug reviewed Oct 14, 2025

View reviewed changes

Simplifying the linear mapping based on Nikhil's comment

c06e934

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

Maxusmusti merged commit e4e1b55 into Red-Hat-AI-Innovation-Team:main Oct 14, 2025
4 checks passed

coderabbitai bot mentioned this pull request Oct 23, 2025

Add HPU SFT example #10

Open

coderabbitai bot mentioned this pull request Nov 14, 2025

Adding Unsloth for LoRA / QLoRA SFT Algorithm #23

Open

-from mini_trainer.osft_utils import MODEL_CONFIGS
+++ b/src/training_hub/profiling/memory_estimator.py
+@@ –1,5 +1,4 @@
+-from mini_trainer.osft_utils import MODEL_CONFIGS
+ import torch
+ from typing import Callable, Optional, override
+ from transformers import AutoModel
+@@ class OSFTEstimatorExperimental(BasicEstimator):
+     def __init__(
+         self,
+         num_gpus: int = 8,
+         gpu_memory: int = 85899345920,
+         model_path: str = "ibm-granite/granite-3.3-8b-instruct",
+         effective_batch_size: int | None = None,
+         max_seq_len: int | None = None,
+         max_tokens_per_gpu: int | None = None,
+         use_liger: bool = False,
+         verbose: int = 1,
+     ):
+         super().__init__(...)
+-        # Check to see which terms need to be included in the search for valid layers
+-        self.target_terms = MODEL_CONFIGS['default']['patterns']
+-        for key in MODEL_CONFIGS.keys():
+-            if self.model_path.find(key) > -1:
+        # Resolve target layer patterns lazily to avoid hard dependency on mini_trainer
+        try:
+            from mini_trainer.osft_utils import MODEL_CONFIGS
+        except Exception:
+            MODEL_CONFIGS = {"default": {"patterns": []}}  # minimal safe default
+        self.target_terms = MODEL_CONFIGS.get("default", {}).get("patterns", [])
+        for key, cfg in MODEL_CONFIGS.items():
+            if key in self.model_path:
+                self.target_terms = cfg.get("patterns", [])
+                break

Adding basic API for memory profiling (src/training_hub/profiling) #11

Adding basic API for memory profiling (src/training_hub/profiling) #11

Uh oh!

Conversation

mazam-lab commented Sep 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

RobotSail commented Oct 2, 2025

Uh oh!

coderabbitai bot commented Oct 2, 2025

Uh oh!

coderabbitai bot commented Oct 2, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mazam-lab Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NikhilNayak-debug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mazam-lab Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mazam-lab commented Oct 6, 2025

Uh oh!

mazam-lab commented Oct 6, 2025

Uh oh!

mazam-lab commented Oct 10, 2025

Uh oh!

mazam-lab commented Oct 10, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mazam-lab commented Sep 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 29, 2025 •

edited

Loading

mazam-lab Oct 3, 2025 •

edited

Loading

mazam-lab Oct 6, 2025 •

edited

Loading

mazam-lab Oct 14, 2025 •

edited

Loading

mazam-lab Oct 13, 2025 •

edited

Loading