Skip to content

Updates for supporting multi-host CI #26490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 293 commits into
base: main
Choose a base branch
from
Open

Conversation

akirby-TT
Copy link
Contributor

@akirby-TT akirby-TT commented Aug 7, 2025

Ticket

Link to Github Issue

Problem description

  • We currently don't have any testing for multi-host systems on CI
  • Distributed features are thus being tested in multi-process environments, which is fairly limited

What's changed

  • Add multi-host-physical.yaml workflow file for capturing all CI workloads running on 2 Loudboxes
  • These Loudboxes have been allocated/provisioned as dedicated multi-host CI machines
  • Add dual T3K rank binding file
  • Update Intermesh Routing tests to cover more traffic patterns
  • Update tt-run to propagate TT_METAL_HOME, PYTHONPATH and LD_LIBRARY_PATH to child processes that may be called remotely

Checklist

akirby-TT and others added 2 commits August 7, 2025 23:12
 - Remove debug prints from Control Plane
 - Remove commented out tests for Dual Loudbox Configs
@tt-asaigal tt-asaigal force-pushed the akirby/multi-host-workflow branch from 76fc5ec to a8e5b90 Compare August 7, 2025 23:13
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

@nsextonTT nsextonTT requested a review from Copilot August 8, 2025 05:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds debug output and configuration updates to support multi-host CI testing. It introduces extensive logging throughout the intermesh ethernet link initialization and routing logic to help diagnose multi-host connectivity issues.

  • Adds debug console output for intermesh link configuration and status
  • Updates test cases to focus on specific multicast routing patterns
  • Includes new configuration files and CI workflow for dual T3K testing

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File Description
tt_metal/fabric/control_plane.cpp Adds debug logging throughout intermesh link initialization and routing
tests/tt_metal/multihost/fabric_tests/intermesh_routing.cpp Simplifies test cases to focus on north/south multicast patterns
tests/tt_metal/distributed/config/dual_t3k_rank_bindings.yaml New configuration file for dual T3K rank bindings
.github/workflows/multi-host-physical.yaml New CI workflow for multi-host physical testing

- rank: 0
mesh_id: 0
env_overrides:
LD_LIBRARY_PATH: "./build/lib"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to not override LD_LIBRARY_PATH but to extend it with .build/lib? Also ideally add a comment why this is here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer capturing this in the rank binding file. Instead we can pass this in as a top level env var to the tt-run script.

This way the same binding file can be reused across setups and users can set the env vars they require in a custom manner.

Not sure why the multi-host Github Actions setup requires the path to be explicitly specified though we can look into this as a cleanup step.

@tt-asaigal tt-asaigal force-pushed the akirby/multi-host-workflow branch from 2880e55 to 9356960 Compare August 9, 2025 02:00
@tt-asaigal tt-asaigal requested a review from a team as a code owner August 9, 2025 03:31
Copy link
Member

@cfjchu cfjchu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock

- main
push:
branches:
- akirby/multi-host-workflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be main?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just here temporarily, will be removed right before merging.

This allows us to trigger the workflow on the current PR, we won't be able to do so otherwise.

Comment on lines 100 to 102
"TT_METAL_HOME": os.environ.get("TT_METAL_HOME", str(Path.home())),
"PYTHONPATH": os.environ.get("PYTHONPATH", str(Path.home())),
"LD_LIBRARY_PATH": os.environ.get("LD_LIBRARY_PATH", DEFAULT_LD_LIBRARY_PATH.format(home=str(Path.home()))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path.home()? Can we leave a TODO comment for @akirby-TT to revisit why this was needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commit removes this. Will remove this entirely if it passes. Will fix the path otherwise with a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be working without the path specified. Removing.

@tt-asaigal tt-asaigal force-pushed the akirby/multi-host-workflow branch 2 times, most recently from b941930 to 189c509 Compare August 11, 2025 02:53
@tt-asaigal tt-asaigal force-pushed the akirby/multi-host-workflow branch from 189c509 to 19250c1 Compare August 11, 2025 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants