Addressing Mustafa's comments on the readme, adjusting an typcheck from coderabbit, some other documentation cleaning

mazam-lab · mazam-lab · commit c781b7c98d16 · 2025-10-14T15:30:57.000-04:00
diff --git a/examples/README.md b/examples/README.md
@@ -78,6 +78,29 @@ result = osft(
 )
 ```
 
+### Memory Estimation (Experimental / In-Development)
+
+training_hub includes a library for estimating the expected amount of GPU memory that will be allocated during the fine-tuning of a given model using SFT or OSFT. The calculations are built off of formulas presented in the blog post [How To Calculate GPU VRAM Requirements for an Large-Language Model](https://apxml.com/posts/how-to-calculate-vram-requirements-for-an-llm).
+NOTE: This feature is still a work in-progress. In particular, the given estimates for OSFT may vary from your actual results; the estimate mainly serves to give theoretical bounds.  
+The estimates for SFT should be reasonably close to actual results when using training_hub, but keep in mind that your actual results may still vary. 
+
+**Tutorials:**
+- [Memory Estimation Example](notebooks/memory_estimator_example.ipynb) - Interactive notebook showcasing how to utilize the memory estimator methods.
+
+**Quick Example:**
+```python
+from training_hub import estimate
+
+estimate(training_method='osft',
+    num_gpus=2,
+    model_path="/path/to/model",
+    max_tokens_per_gpu=8192,
+    use_liger=True,
+    verbose=2,
+    unfreeze_rank_ratio: float = 0.25
+)
+```
+
 ## Getting Started
 
 1. **For detailed parameter documentation**: Check the relevant guide in `docs/`
diff --git a/examples/notebooks/memory_estimator_example.ipynb b/examples/notebooks/memory_estimator_example.ipynb
@@ -8,7 +8,7 @@
     "# Memory Estimator \n",
     "\n",
     "This notebook will provide some examples on how to use the memory_estimator API\n",
-    "to estimate the amount of GPU memory consumed when fine-tuning an LLM model in Training Hub.\n",
+    "to estimate the amount of GPU memory consumed when fine-tuning in Training Hub.\n",
     "This notebook will cover:\n",
     "1. How the package's primary class implemented, \n",
     "2. How it can be subclassed for further extensions,\n",
@@ -32,7 +32,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from training_hub.profiling.memory_estimator import BasicEstimator, OSFTEstimator, OSFTEstimatorExperimental, estimate"
+    "from training_hub import BasicEstimator, OSFTEstimator, OSFTEstimatorExperimental, estimate"
    ]
   },
   {
diff --git a/src/training_hub/profiling/memory_estimator.py b/src/training_hub/profiling/memory_estimator.py
@@ -122,9 +122,11 @@ def _calc_intermediate_activations(self):
 
     def _calc_outputs(self):
         """
-        Calculate the VRAM for storing the model's activated outputs
+        Calculate the VRAM for storing the model's activated outputs.
+        Note that this value is 0 if Liger Kernels are used.
         """
         if not self.use_liger:
+            # This nested try-catch attempts to find the model's vocabulary size
             try:
                 vocab_size = self.model.embed_tokens.num_embeddings
             except AttributeError:
@@ -316,6 +318,8 @@ def __init__(
                         use_liger, verbose, trust_remote_code)
         self.output_constant = 7/3
         self.unfreeze_rank_ratio = unfreeze_rank_ratio
+        if not (0.0 <= self.unfreeze_rank_ratio <= 1.0):
+            raise ValueError("Ratio must be in the range [0, 1]")
 
         # Check to see which terms need to be included in the search for valid layers
         self.target_terms = MODEL_CONFIGS['default']['patterns']
@@ -417,6 +421,8 @@ def __init__(
                             effective_batch_size, max_seq_len, max_tokens_per_gpu, 
                             use_liger, verbose, trust_remote_code)
         self.unfreeze_rank_ratio = unfreeze_rank_ratio
+        if not (0.0 <= self.unfreeze_rank_ratio <= 1.0):
+            raise ValueError("Ratio must be in the range [0, 1]")
 
     @override
     def _apply_overhead(self, subtotal):