Skip to content

Conversation

@Fiona-Waters
Copy link
Contributor

@Fiona-Waters Fiona-Waters commented Sep 22, 2025

While working on creating a runtime image that encorporates kubeflow training and training hub related dependencies I came across the following error when using the image to run osft_llama_example.py and osft_llama_example.py:

[rank0]: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
[rank0]: │ /opt/app-root/lib64/python3.12/site-packages/mini_trainer/train.py:691 in    │
[rank0]: │ main                                                                         │
[rank0]: │                                                                              │
[rank0]: │   688 │   # If Orthogonal Subspace Learning is enabled, loads a model with d │
[rank0]: │   689 │   # Convert user-facing osft_unfreeze_rank_ratio to internal osft_ra │
[rank0]: │   690 │   osft_rank_ratio = None if osft_unfreeze_rank_ratio is None else (1 │
[rank0]: │ ❱ 691 │   model = setup_model(                                               │
[rank0]: │   692 │   │   model_name_or_path=model_name_or_path,                         │
[rank0]: │   693 │   │   save_dtype=save_dtype,                                         │
[rank0]: │   694 │   │   use_liger_kernels=use_liger_kernels,                           │
[rank0]: │                                                                              │
[rank0]: │ /opt/app-root/lib64/python3.12/site-packages/mini_trainer/setup_model_for_tr │
[rank0]: │ aining.py:197 in setup_model                                                 │
[rank0]: │                                                                              │
[rank0]: │   194 │   │   │   liger_fixed_fused_linear_cross_entropy_none_reduction,     │
[rank0]: │   195 │   │   )                                                              │
[rank0]: │   196 │   │                                                                  │
[rank0]: │ ❱ 197 │   │   patch_target_module(                                           │
[rank0]: │   198 │   │   │   "liger_kernel.transformers.model.loss_utils.fixed_fused_li │
[rank0]: │   199 │   │   │   liger_fixed_fused_linear_cross_entropy_none_reduction,     │
[rank0]: │   200 │   │   )                                                              │
[rank0]: │                                                                              │
[rank0]: │ /opt/app-root/lib64/python3.12/site-packages/mini_trainer/utils.py:61 in     │
[rank0]: │ patch_target_module                                                          │
[rank0]: │                                                                              │
[rank0]: │    58 │                                                                      │
[rank0]: │    59 │   to_patch, obj_name_to_patch = to_patch[:-1], to_patch[-1]          │
[rank0]: │    60 │   to_patch = ".".join(to_patch)                                      │
[rank0]: │ ❱  61 │   source = importlib.import_module(to_patch)                         │
[rank0]: │    62 │   setattr(source, obj_name_to_patch, replace_with)                   │
[rank0]: │    63                                                                        │
[rank0]: │    64                                                                        │
[rank0]: │                                                                              │
[rank0]: │ /usr/lib64/python3.12/importlib/__init__.py:90 in import_module              │
[rank0]: │                                                                              │
[rank0]: │    87 │   │   │   if character != '.':                                       │
[rank0]: │    88 │   │   │   │   break                                                  │
[rank0]: │    89 │   │   │   level += 1                                                 │
[rank0]: │ ❱  90 │   return _bootstrap._gcd_import(name[level:], package, level)        │
[rank0]: │    91                                                                        │
[rank0]: │    92                                                                        │
[rank0]: │    93 _RELOADING = {}                                                        │
[rank0]: │ in _gcd_import:1387                                                          │
[rank0]: │ in _find_and_load:1360                                                       │
[rank0]: │ in _find_and_load_unlocked:1310                                              │
[rank0]: │ in _call_with_frames_removed:488                                             │
[rank0]: │ in _gcd_import:1387                                                          │
[rank0]: │ in _find_and_load:1360                                                       │
[rank0]: │ in _find_and_load_unlocked:1310                                              │
[rank0]: │ in _call_with_frames_removed:488                                             │
[rank0]: │ in _gcd_import:1387                                                          │
[rank0]: │ in _find_and_load:1360                                                       │
[rank0]: │ in _find_and_load_unlocked:1331                                              │
[rank0]: │ in _load_unlocked:935                                                        │
[rank0]: │ in exec_module:999                                                           │
[rank0]: │ in _call_with_frames_removed:488                                             │
[rank0]: │                                                                              │
[rank0]: │ /opt/app-root/lib64/python3.12/site-packages/liger_kernel/transformers/__ini │
[rank0]: │ t__.py:1 in <module>                                                         │
[rank0]: │                                                                              │
[rank0]: │ ❱  1 from liger_kernel.transformers.auto_model import AutoLigerKernelForCaus │
[rank0]: │    2 from liger_kernel.transformers.cross_entropy import LigerCrossEntropyLo │
[rank0]: │    3 from liger_kernel.transformers.fused_linear_cross_entropy import LigerF │
[rank0]: │    4 from liger_kernel.transformers.fused_linear_jsd import LigerFusedLinear │
[rank0]: │                                                                              │
[rank0]: │ /opt/app-root/lib64/python3.12/site-packages/liger_kernel/transformers/auto_ │
[rank0]: │ model.py:6 in <module>                                                       │
[rank0]: │                                                                              │
[rank0]: │    3 from transformers import AutoConfig                                     │
[rank0]: │    4 from transformers import AutoModelForCausalLM                           │
[rank0]: │    5                                                                         │
[rank0]: │ ❱  6 from liger_kernel.transformers.monkey_patch import MODEL_TYPE_TO_APPLY_ │
[rank0]: │    7 from liger_kernel.transformers.monkey_patch import _apply_liger_kernel  │
[rank0]: │    8                                                                         │
[rank0]: │    9                                                                         │
[rank0]: │                                                                              │
[rank0]: │ /opt/app-root/lib64/python3.12/site-packages/liger_kernel/transformers/monke │
[rank0]: │ y_patch.py:16 in <module>                                                    │
[rank0]: │                                                                              │
[rank0]: │    13 from liger_kernel.transformers.functional import liger_cross_entropy   │
[rank0]: │    14 from liger_kernel.transformers.geglu import LigerGEGLUMLP              │
[rank0]: │    15 from liger_kernel.transformers.layer_norm import LigerLayerNorm        │
[rank0]: │ ❱  16 from liger_kernel.transformers.model.gemma import lce_forward as gemma │
[rank0]: │    17 from liger_kernel.transformers.model.gemma import lce_forward_deprecat │
[rank0]: │    18 from liger_kernel.transformers.model.gemma2 import lce_forward as gemm │
[rank0]: │    19 from liger_kernel.transformers.model.gemma2 import lce_forward_depreca │
[rank0]: │                                                                              │
[rank0]: │ /opt/app-root/lib64/python3.12/site-packages/liger_kernel/transformers/model │
[rank0]: │ /gemma.py:11 in <module>                                                     │
[rank0]: │                                                                              │
[rank0]: │     8 from torch.nn import CrossEntropyLoss                                  │
[rank0]: │     9 from transformers.cache_utils import Cache                             │
[rank0]: │    10 from transformers.modeling_outputs import CausalLMOutputWithPast       │
[rank0]: │ ❱  11 from transformers.models.gemma.modeling_gemma import _CONFIG_FOR_DOC   │
[rank0]: │    12 from transformers.models.gemma.modeling_gemma import GEMMA_INPUTS_DOCS │
[rank0]: │    13 from transformers.utils import add_start_docstrings_to_model_forward   │
[rank0]: │    14 from transformers.utils import replace_return_docstrings               │
[rank0]: ╰──────────────────────────────────────────────────────────────────────────────╯
[rank0]: ImportError: cannot import name '_CONFIG_FOR_DOC' from 
[rank0]: 'transformers.models.gemma.modeling_gemma' 
[rank0]: (/opt/app-root/lib64/python3.12/site-packages/transformers/models/gemma/modeling
[rank0]: _gemma.py)
[rank0]:[W919 14:11:25.274921520 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0919 14:11:26.582000 502 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 567) of binary: /opt/app-root/bin/python3.12
Traceback (most recent call last):
  File "/opt/app-root/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/app-root/lib64/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/app-root/lib64/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/app-root/lib64/python3.12/site-packages/mini_trainer/train.py FAILED
------------------------------------------------------------

Updating the liger-kernel dependency to 0.5.10 fixes this issue as per huggingface/trl#3480 . This is an instruct-lab dependency so not sure where would be best to fix it.

Summary by CodeRabbit

  • Chores
    • Added an optional CUDA-related dependency to enable GPU-accelerated functionality when selected.
    • No functional behavior changes for users who do not opt into CUDA.
    • Users opting into GPU support may need to update their environments; installation size/time may increase.
    • Minor packaging/formatting tweak to dependency declarations (no impact on functionality).

@coderabbitai
Copy link

coderabbitai bot commented Sep 22, 2025

Walkthrough

Added an optional CUDA dependency liger-kernel>=0.5.10 under [project.optional-dependencies].cuda in pyproject.toml.

Changes

Cohort / File(s) Summary
Project metadata
pyproject.toml
Added optional CUDA dependency liger-kernel>=0.5.10 under [project.optional-dependencies].cuda.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I nibble lines of toml tonight,
A kernel tiptoes into view,
Optional CUDA, tucked in tight,
Quiet hops of version new.
—🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Pinning liger-kernal version" concisely and accurately reflects the primary change (pinning the liger-kernel dependency) and is relevant to the changeset, but it contains a spelling error ("kernal" instead of "kernel").
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
pyproject.toml (3)

49-49: Tighten the “pin”: use a bounded range (or exact pin) and annotate the reason.

If the intent is to lock the working fix, prefer >=0.5.10,<0.5.11 (or ==0.5.10) to avoid future breakage; also add a brief comment referencing the upstream issue.

Apply one of the following:

Option A (bounded minor):

-    "liger-kernel>=0.5.10"
+    "liger-kernel>=0.5.10,<0.5.11",  # Fix Gemma _CONFIG_FOR_DOC import error; see HF TRL issue 3480

Option B (exact pin for apps):

-    "liger-kernel>=0.5.10"
+    "liger-kernel==0.5.10",  # Fix Gemma _CONFIG_FOR_DOC import error; see HF TRL issue 3480

39-47: Duplicate dependency: filelock listed twice with different minima.

You have both filelock>=3.0 (Line 39) and filelock>=3.19.1 (Line 46). Keep one, preferably the stricter one.

Apply this diff:

-    "filelock>=3.0",
@@
-    "filelock>=3.19.1",
+    "filelock>=3.19.1",

48-50: Confirm whether liger-kernel must be core or should live under a CUDA/kernels extra.

If Training Hub flows exist that don’t require liger-kernel, consider moving it to [project.optional-dependencies].cuda (or a new [project.optional-dependencies].kernels) to reduce base install weight. Otherwise, keeping it core is fine.

I can draft the extras rewire and minimal import smoke test for CI to validate Gemma import. Want me to push that?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8164824 and 7086176.

📒 Files selected for processing (1)
  • pyproject.toml (1 hunks)

Copy link
Contributor

@Maxusmusti Maxusmusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to the cuda dependencies, rather than the base dependencies? This package should be installed when installing via pip install training-hub[cuda].

@Fiona-Waters
Copy link
Contributor Author

Could you move this to the cuda dependencies, rather than the base dependencies? This package should be installed when installing via pip install training-hub[cuda].

Updated. Thanks!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pyproject.toml (2)

39-47: Duplicate dependency: filelock listed twice with conflicting bounds.

Keep the stricter one once to avoid resolver ambiguity.

Apply this diff:

@@
-    "filelock>=3.0",
@@
-    "filelock>=3.19.1",
+    "filelock>=3.19.1",

45-45: Replace 'attr' with 'attrs' in pyproject.toml

PyPI shows 'attr' is a different project (v0.3.2); the intended runtime library is 'attrs' (latest v25.3.0). Update pyproject.toml (line 45): replace
"attr>=0.3.2"
with
"attrs>=23.2.0"

🧹 Nitpick comments (1)
pyproject.toml (1)

14-16: Remove build tools from runtime dependencies (setuptools, wheel).

They’re already in [build-system] and shouldn’t be required at runtime.

Apply this diff:

@@
-    "setuptools>=80.0",
@@
-    "wheel>=0.43",
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 870674a and d0815f2.

📒 Files selected for processing (1)
  • pyproject.toml (1 hunks)
🔇 Additional comments (3)
pyproject.toml (3)

48-49: Do not ship pytest as a runtime dependency; move to dev extra.

Keep pytest out of end-user installs; include it under [project.optional-dependencies].dev.

Apply this diff:

@@
-    "pytest>=8.0"
+    "pytest>=8.0",
@@
 dev = [
     "ipykernel",
-    "ipython"
+    "ipython",
+    "pytest>=8.0"
 ]

And remove pytest from [project].dependencies:

@@
-    "pytest>=8.0"

32-37: Constraints are satisfiable — PyPI publishes 2025+ releases

fsspec latest: 2025.9.0; regex latest: 2025.9.18 — the >=2025.0 floors are satisfiable.


61-62: LGTM: CUDA extra now pins liger-kernel to a fixed-good range.

Sandbox couldn't fetch PyPI metadata (SSL certificate verification failed), so verification couldn't be completed here — confirm instructlab/instructlab-training doesn't force a conflicting liger-kernel version and that both CPU- and CUDA-only Training Hub images resolve correctly.

Copy link
Contributor

@Maxusmusti Maxusmusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@Maxusmusti Maxusmusti merged commit fc2175d into Red-Hat-AI-Innovation-Team:main Sep 23, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants