Skip to content

multi-node (beyond multi-GPU) inference with UMA 1.2 #1949

@gauthierlab

Description

@gauthierlab

Python version

3.11.15

fairchem-core version

2.19.0

pytorch version

2.8.0

cuda version

12.8

Operating system version

whatever is on Perlmutter; SUSE Linux Enterprise Server 15 SP6

Minimal example

import os, sys, torch
import ray
from ase.io import read, write
from ase.optimize import BFGS
from fairchem.core.units.mlip_unit.api.inference import InferenceSettings
from fairchem.core.units.mlip_unit.predict import ParallelMLIPPredictUnit
from fairchem.core.units.mlip_unit import guess_inference_settings
from fairchem.core import FAIRChemCalculator


MODEL_PATH = "/global/cfs/cdirs/m5215/jgauth32/mlip_models/uma-s-1p2.pt"

inference_settings = guess_inference_settings("turbo")
predictor = ParallelMLIPPredictUnit(
    inference_model_path=MODEL_PATH,
    device="cuda",
    inference_settings=inference_settings,
    num_workers=8,
    num_workers_per_node=4,
)

calc = FAIRChemCalculator(predictor, task_name="omat")

if os.path.exists("qn.traj"):
    atoms = read("qn.traj", index=-1)
else:
    atoms = read("init.traj")

atoms.calc = calc

optimizer = BFGS(atoms, trajectory="qn.traj", logfile="qn.log")
optimizer.run(fmax=0.05)

write("final_relaxed.traj", atoms)

Current behavior

Multi-GPU inference using UMA works using the above script but only when it is fully within a single node. On Perlmutter, this means up to 4 GPUs. If your system needs more than 4 GPUs (the script above tries to use 8, or two nodes) the system hangs when trying to set up the Ray cluster.

Expected Behavior

I would've anticipated this to work the same across multiple nodes, but there seems to be something that goes wrong when orchestrating the Ray cluster. Claude code was trying to do something complicated that involved a check-delay loop to make sure the Ray cluster on each node was properly communicating but I can't imagine that's the intended functionality here.

Relevant files to reproduce this bug

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions