Thanks for your interest in contributing to Megatron-Bridge!
You can either follow the steps below to set up the environment from scratch, or use the NeMo Framework container, which provides a pre-built environment and makes these steps unnecessary.
Build and run the Docker container:
docker build \
-f docker/Dockerfile.ci \
-t megatron-bridge \
.To start a shell in the container to interactively run/develop:
docker run --rm -it -w /workdir -v $(pwd):/opt/Megatron-Bridge \
--entrypoint bash \
--gpus all \
megatron-bridgeIf you are using VSCode/Cursor you can also use Dev Containers. Here's a devcontainer.json to get you started:
If you're an external contributor, you'll need to fork the repository:
-
Create a fork: Click the "Fork" button on the GitHub repository page or follow this direct link to fork
-
Clone your fork:
git clone https://github.com/YOUR-USERNAME/Megatron-Bridge megatron-bridge cd megatron-bridge -
Add upstream remote to keep your fork updated:
git remote add upstream https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
-
Install pre-commit:
# Requires `uv` to be installed uv run --group dev pre-commit install -
Keep your fork updated before starting new work:
git fetch upstream git checkout main git merge upstream/main git push origin main
-
Create a new branch for your changes:
git checkout main git switch -c your-feature-name
-
Make your changes and commit them:
git add . git commit --signoff -m "Your descriptive commit message"
We require signing commits with
--signoff(or-sfor short). See Signing Your Work for details. -
Push to your fork:
git push origin your-feature-name
-
Create a pull request from your fork's branch to the main repository's
mainbranch through the GitHub web interface.
If you have write access to the repository (NVIDIA contributors):
-
Clone the repository directly:
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge cd megatron-bridge -
Install pre-commit from the project root directory:
# Requires `uv` to be installed uv run --group dev pre-commit install -
Create a new branch for your changes:
git switch -c your-feature-name
-
Make your changes and commit them:
git add . git commit --signoff -m "Your descriptive commit message"
-
Push your branch to the repository:
git push origin your-feature-name
-
Create a pull request from your branch to the
mainbranch.
Format your commit messages and PR titles as:
[{areas}] {type}: {description}
Areas (use the most relevant ones, separate multiple with ,):
model- Model implementations and HF bridge logicrecipe- Training recipes and launch configstraining- Training loop, callbacks, and runtime integrationdata- Dataset builders, preprocessing, and samplersckpt- Checkpoint conversion, loading, export, and save pathspeft- PEFT methods (LoRA, adapters) and adapter exportperf- Performance optimizations and throughput improvementsci- CI, automation, and workflow infrastructuredocs- Documentation, examples, and contributor guidancebuild- Dependencies, packaging, and environment setupmisc- Cross-cutting utilities and other changes
Types:
feat- New featurefix- Bug fixrefactor- Code refactoring without changing functionalitychore- Maintenance taskstest- Adding or updating tests
Breaking Changes: If your PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
Examples:
[model] feat: Add Qwen3 model bridge
[recipe, docs] feat: Add Llama 3.1 70B recipe with documentation
[ckpt] fix: Handle missing keys in HF checkpoint conversion
[BREAKING][training] refactor: Change optimizer config structure
[ci, build] chore: Update ruff version
Megatron Bridge uses a small governance taxonomy so maintainers, oncall, and automation can reason about issues and PRs consistently:
- New issues should start with
needs-triageand leave triage with onetypelabel plus onearealabel. - PRs should use one primary
area:*value in the PR template. State labels such asneeds-author,blocked, andready-to-mergeare for routing active work, not for replacing review status or CI details. - Release labels such as
r0.3.0, community labels, andneeds-follow-upare still valid, but they are orthogonal to the main governance taxonomy.
Use exactly one type label per issue or PR after triage:
| Label | Use for |
|---|---|
bug |
Incorrect behavior, regressions, or broken workflows |
feature |
New capabilities, enhancements, or enablement work |
support |
Questions, help requests, or user guidance gaps |
docs |
Documentation-only updates or documentation debt |
ci |
CI, automation, test queue, or workflow infrastructure work |
Use at most one state label from this set at a time:
| Label | Meaning |
|---|---|
needs-triage |
New item needs classification and ownership |
needs-review |
PR is ready for code review and waiting on a reviewer |
needs-author |
Author action is required before review or merge can continue |
needs-follow-up |
Issue or PR has finished initial triage/review and needs further follow-up |
blocked |
Work cannot move forward until an external dependency is cleared |
ready-to-merge |
PR is approved, current, and only waiting for CI to pass before merge |
Apply only when risk affects review or merge behavior:
| Label | Meaning |
|---|---|
breaking-change |
Public behavior or API compatibility changes |
high-complexity |
Harder to merge: prone to conflicts and needs additional test coverage |
needs-more-tests |
Requires additional test coverage; triggers both L0 and L1 CI test tiers |
Use one primary area label after triage:
| Label | Scope |
|---|---|
area:model |
Model implementations and HF bridge logic |
area:recipe |
Training recipes and launch configs |
area:training |
Training loop, callbacks, and runtime integration |
area:data |
Dataset builders, preprocessing, and samplers |
area:ckpt |
Checkpoint conversion, loading, export, and save paths |
area:peft |
PEFT methods (LoRA, adapters) and adapter export |
area:perf |
Performance optimizations, kernel integration, and throughput improvements |
area:build |
Dependencies, packaging, images, and environment setup |
area:misc |
Cross-cutting utilities, logging, helpers, and other changes that do not fit a primary domain |
This taxonomy does not replace every existing label:
- Keep release labels such as
r0.3.0as independent scheduling signals. - Keep
community-requestand other community-related labels as independent intake signals. - Use
needs-follow-upwhen an issue or PR should stay explicitly visible to the oncaller across handoffs. - Avoid creating new status synonyms when an existing label in this taxonomy already fits.
- New issues should start with
needs-triage. - Issues should leave triage with one
typelabel and onearealabel. - An issue keeps
needs-triageuntil a maintainer has responded or assigned it. Adding type and area labels is classification; the issue leavesneeds-triageonly when a maintainer engages (responds, assigns, or explicitly routes it). - After a maintainer engages, transition to
needs-follow-up(deferred work oncall should track),needs-author(waiting on reporter for more info),blocked(external dependency), or no state label (actively being worked on). - PRs should not use
needs-triage. Useneeds-review,needs-author,blocked, orready-to-mergeonly when they help route work. high-complexitystarts as a manual maintainer label, not an automated heuristic.needs-follow-upshould usually point to a linked issue instead of staying on a merged PR.needs-follow-upis the visibility label for deferred work that should stay on the oncall radar.needs-follow-upcan be combined withblockedwhen the oncaller should keep watching a blocked item.- If a PR is marked
breaking-change, do not treat it as auto-mergeable even if CI is green.
These four views are the core daily queues maintainers and oncall should watch.
- Scope: open issues labeled
needs-triage - Goal: assign one
typeand onearea - Suggested query:
is:issue is:open label:"needs-triage" sort:updated-asc
- Scope: open PRs labeled
ready-to-merge - Goal: surface PRs that should merge without rereading every CI detail
- Suggested query:
is:pr is:open label:"ready-to-merge" draft:false sort:updated-asc
- Scope: open issues and PRs labeled
blockedorneeds-follow-up - Goal: make blockers and deferred work visible across handoffs
- Suggested query:
is:open (label:"blocked" OR label:"needs-follow-up") sort:updated-asc
- Scope: open PRs labeled
high-complexity - Goal: proactively review, rebase, and ensure adequate test coverage before conflicts waste CI and reviewer time
- Suggested query:
is:pr is:open label:"high-complexity" sort:updated-asc
If you mirror these queues into a GitHub Project, keep the columns and sort keys small:
- item title
- primary area
- owner or assignee
- age
- last updated time
- release label
- current state
We use pytest for writing both unit and functional tests.
Unit tests aim to test functions in isolation. They generally do not depend on artifacts like Hugging Face checkpoints or larger datasets. Exception to this is a small toy dataset consisting of tokenizers.
Unit tests are stored at tests/unit_tests. Please add your test to an existing folder or create a new one if none matches.
Functional tests are integration tests that perform model training or operate on larger artifacts. We use pytest for writing these. In some cases, it might be desired to run your test (or parts of it) in a subprocess to avoid process contamination. We use subprocess.run for this inside the pytest function. Please add your test into one of the predefined folders. If none of the folders matches semantically, please reach out to the @nvidia-nemo/automation in your PR for consultation.
Functional tests are placed in tiered launcher scripts inside tests/functional_tests/. Each tier runs in a separate CI job, allowing faster PR feedback while keeping thorough coverage on nightly runs.
| Tier | Prefix | Trigger | Purpose |
|---|---|---|---|
| L0 | L0_Launch_*.sh |
Every PR, main push, schedule | Core smoke tests — must be fast and stable |
| L1 | L1_Launch_*.sh |
Main push + schedule (not PRs) | Broader model/recipe coverage |
| L2 | L2_Launch_*.sh |
Schedule / workflow_dispatch only |
VL models, checkpoint conversion, heavy quantization |
When adding a new launcher script, choose the appropriate tier and also update .github/workflows/cicd-main.yml to include it in the corresponding cicd-functional-tests-l{0,1,2} job matrix:
# Example: adding an L1 test
- script: L1_Launch_your_new_testWithout this step, your new launcher script will not be picked up by CI.
We use uv for managing dependencies. For reproducible builds, our project tracks the generated uv.lock file in the repository.
On a weekly basis, the CI attempts an update of the lock file to test against upstream dependencies.
New required dependencies can be added by uv add $DEPENDENCY.
New optional dependencies can be added by uv add --optional --extra $EXTRA $DEPENDENCY.
EXTRA refers to the subgroup of extra-dependencies to which you're adding the new dependency.
Example: For adding a TRT-LLM specific dependency, run uv add --optional --extra trtllm $DEPENDENCY.
Alternatively, the pyproject.toml file can also be modified directly.
Adding a new dependency will update UV's lock-file. Please check this into your branch:
git add uv.lock pyproject.toml
git commit -m "build: Adding dependencies"
git pushWe use ruff for linting and formatting. CI does not auto-fix linting and formatting issues, but most issues can be fixed by running the following command:
uv run ruff check --fix .
uv run ruff format .Note: If ruff is missing, please follow the installation guide.
Important: All new key features (e.g., enabling a new model, enabling a new parallelism strategy) must include documentation update (either a new doc or updating an existing one). This document update should:
- Explain the motivation and purpose of the feature
- Outline the technical approach and architecture
- Provide clear usage examples and instructions for users
- Document internal implementation details where appropriate
This ensures that all significant changes are well-thought-out and properly documented for future reference. Comprehensive documentation serves two critical purposes:
- User Adoption: Helps users understand how to effectively use the library's features in their projects
- Developer Extensibility: Enables developers to understand the internal architecture and implementation details, making it easier to modify, extend, or adapt the code for their specific use cases
Quality documentation is essential for both the usability of Megatron-Bridge and its ability to be customized by the community.
- Follow the existing code style and conventions (see CODING_GUIDELINES.md)
- Write tests for new features
- Update documentation to reflect your changes
- Ensure all tests pass before submitting a PR
- Do not add arbitrary defaults for configs, be as explicit as possible
-
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
-
To sign off on a commit you simply use the
--signoff(or-s) option when committing your changes:git commit -s -m "Add cool feature."This will append the following to your commit message:
Signed-off-by: Your Name <your@email.com> -
Full text of the DCO:
Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
There are two ways to trigger CI tests on your pull request:
If your GitHub user is configured to use signed commits, CI tests will run automatically when you push commits to your pull request.
Note: Signed commits are different from signing-off on commits (which uses the
-sflag mentioned in the Signing Your Work section).
If you don't have signed commits set up, you can still trigger CI tests manually by commenting on your pull request:
/ok to test <commit-SHA>
For example:
/ok to test a1b2c3d4e5f6
Important: You'll need to add this comment for each new commit you push to ensure CI tests run on the latest changes.
You can find the commit SHA in several ways:
- View your pull request's commit history on GitHub
- Run
git log --oneline -1in your local repository - Check the commit details in your Git client
Please see our documentation for a detailed guide on contributing new models.
{ "name": "megatron-bridge-dev", "image": "megatron-bridge:latest", "runArgs": [ "--gpus", "all", "--ulimit", "memlock=-1", "--ulimit", "stack=67108864", "--shm-size=24g", "--privileged", "--pid=host" ] // NOTE: Here is an example of how you can set up some common mounts, environment variables, and set up your shell. // Feel free to adapt to your development workflow and remember to replace the paths with your username. //"mounts": [ // {"source": "/home/yourusername", "target": "/home/yourusername", "type": "bind"}, // {"source": "/home/yourusername/.ssh", "target": "/root/yourusername-ssh", "type": "bind"} //], //"containerEnv": { // "HF_TOKEN_PATH": "/home/yourusername/.cache/huggingface/token", // "HF_HOME": "/home/yourusername/.cache/huggingface", // "HF_DATASETS_CACHE": "/home/yourusername/.cache/huggingface/datasets", // "WANDB_API_KEY": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" //}, // // This (1) marks all directories safe (2) copies in ssh keys (3) sources user's bashrc file //"postStartCommand": "git config --global --add safe.directory '*' && cp -r /root/yourusername-ssh/* /root/.ssh/ && source /home/yourusername/.bashrc" }