-
Notifications
You must be signed in to change notification settings - Fork 243
Updates for supporting multi-host CI #26490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
76fc5ec
to
a8e5b90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (1/1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds debug output and configuration updates to support multi-host CI testing. It introduces extensive logging throughout the intermesh ethernet link initialization and routing logic to help diagnose multi-host connectivity issues.
- Adds debug console output for intermesh link configuration and status
- Updates test cases to focus on specific multicast routing patterns
- Includes new configuration files and CI workflow for dual T3K testing
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.
File | Description |
---|---|
tt_metal/fabric/control_plane.cpp | Adds debug logging throughout intermesh link initialization and routing |
tests/tt_metal/multihost/fabric_tests/intermesh_routing.cpp | Simplifies test cases to focus on north/south multicast patterns |
tests/tt_metal/distributed/config/dual_t3k_rank_bindings.yaml | New configuration file for dual T3K rank bindings |
.github/workflows/multi-host-physical.yaml | New CI workflow for multi-host physical testing |
- rank: 0 | ||
mesh_id: 0 | ||
env_overrides: | ||
LD_LIBRARY_PATH: "./build/lib" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to not override LD_LIBRARY_PATH
but to extend it with .build/lib
? Also ideally add a comment why this is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer capturing this in the rank binding file. Instead we can pass this in as a top level env var to the tt-run
script.
This way the same binding file can be reused across setups and users can set the env vars they require in a custom manner.
Not sure why the multi-host Github Actions setup requires the path to be explicitly specified though we can look into this as a cleanup step.
…t-metal into akirby/multi-host-workflow
2880e55
to
9356960
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock
- main | ||
push: | ||
branches: | ||
- akirby/multi-host-workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be main
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just here temporarily, will be removed right before merging.
This allows us to trigger the workflow on the current PR, we won't be able to do so otherwise.
ttnn/ttnn/distributed/ttrun.py
Outdated
"TT_METAL_HOME": os.environ.get("TT_METAL_HOME", str(Path.home())), | ||
"PYTHONPATH": os.environ.get("PYTHONPATH", str(Path.home())), | ||
"LD_LIBRARY_PATH": os.environ.get("LD_LIBRARY_PATH", DEFAULT_LD_LIBRARY_PATH.format(home=str(Path.home()))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path.home()
? Can we leave a TODO comment for @akirby-TT to revisit why this was needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest commit removes this. Will remove this entirely if it passes. Will fix the path otherwise with a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be working without the path specified. Removing.
b941930
to
189c509
Compare
189c509
to
19250c1
Compare
Ticket
Link to Github Issue
Problem description
What's changed
multi-host-physical.yaml
workflow file for capturing all CI workloads running on 2 Loudboxestt-run
to propagateTT_METAL_HOME
,PYTHONPATH
andLD_LIBRARY_PATH
to child processes that may be called remotelyChecklist