Python version
3.11.15
fairchem-core version
2.19.0
pytorch version
2.8.0
cuda version
12.8
Operating system version
whatever is on Perlmutter; SUSE Linux Enterprise Server 15 SP6
Minimal example
import os, sys, torch
import ray
from ase.io import read, write
from ase.optimize import BFGS
from fairchem.core.units.mlip_unit.api.inference import InferenceSettings
from fairchem.core.units.mlip_unit.predict import ParallelMLIPPredictUnit
from fairchem.core.units.mlip_unit import guess_inference_settings
from fairchem.core import FAIRChemCalculator
MODEL_PATH = "/global/cfs/cdirs/m5215/jgauth32/mlip_models/uma-s-1p2.pt"
inference_settings = guess_inference_settings("turbo")
predictor = ParallelMLIPPredictUnit(
inference_model_path=MODEL_PATH,
device="cuda",
inference_settings=inference_settings,
num_workers=8,
num_workers_per_node=4,
)
calc = FAIRChemCalculator(predictor, task_name="omat")
if os.path.exists("qn.traj"):
atoms = read("qn.traj", index=-1)
else:
atoms = read("init.traj")
atoms.calc = calc
optimizer = BFGS(atoms, trajectory="qn.traj", logfile="qn.log")
optimizer.run(fmax=0.05)
write("final_relaxed.traj", atoms)
Current behavior
Multi-GPU inference using UMA works using the above script but only when it is fully within a single node. On Perlmutter, this means up to 4 GPUs. If your system needs more than 4 GPUs (the script above tries to use 8, or two nodes) the system hangs when trying to set up the Ray cluster.
Expected Behavior
I would've anticipated this to work the same across multiple nodes, but there seems to be something that goes wrong when orchestrating the Ray cluster. Claude code was trying to do something complicated that involved a check-delay loop to make sure the Ray cluster on each node was properly communicating but I can't imagine that's the intended functionality here.
Relevant files to reproduce this bug
No response
Python version
3.11.15
fairchem-core version
2.19.0
pytorch version
2.8.0
cuda version
12.8
Operating system version
whatever is on Perlmutter; SUSE Linux Enterprise Server 15 SP6
Minimal example
Current behavior
Multi-GPU inference using UMA works using the above script but only when it is fully within a single node. On Perlmutter, this means up to 4 GPUs. If your system needs more than 4 GPUs (the script above tries to use 8, or two nodes) the system hangs when trying to set up the Ray cluster.
Expected Behavior
I would've anticipated this to work the same across multiple nodes, but there seems to be something that goes wrong when orchestrating the Ray cluster. Claude code was trying to do something complicated that involved a check-delay loop to make sure the Ray cluster on each node was properly communicating but I can't imagine that's the intended functionality here.
Relevant files to reproduce this bug
No response