multi-node (beyond multi-GPU) inference with UMA 1.2

### Python version

3.11.15

### fairchem-core version

2.19.0

### pytorch version

2.8.0

### cuda version

12.8

### Operating system version

whatever is on Perlmutter; SUSE Linux Enterprise Server 15 SP6

### Minimal example

```Python
import os, sys, torch
import ray
from ase.io import read, write
from ase.optimize import BFGS
from fairchem.core.units.mlip_unit.api.inference import InferenceSettings
from fairchem.core.units.mlip_unit.predict import ParallelMLIPPredictUnit
from fairchem.core.units.mlip_unit import guess_inference_settings
from fairchem.core import FAIRChemCalculator


MODEL_PATH = "/global/cfs/cdirs/m5215/jgauth32/mlip_models/uma-s-1p2.pt"

inference_settings = guess_inference_settings("turbo")
predictor = ParallelMLIPPredictUnit(
    inference_model_path=MODEL_PATH,
    device="cuda",
    inference_settings=inference_settings,
    num_workers=8,
    num_workers_per_node=4,
)

calc = FAIRChemCalculator(predictor, task_name="omat")

if os.path.exists("qn.traj"):
    atoms = read("qn.traj", index=-1)
else:
    atoms = read("init.traj")

atoms.calc = calc

optimizer = BFGS(atoms, trajectory="qn.traj", logfile="qn.log")
optimizer.run(fmax=0.05)

write("final_relaxed.traj", atoms)
```

### Current behavior

Multi-GPU inference using UMA works using the above script but only when it is fully within a single node. On Perlmutter, this means up to 4 GPUs. If your system needs more than 4 GPUs (the script above tries to use 8, or two nodes) the system hangs when trying to set up the Ray cluster. 

### Expected Behavior

I would've anticipated this to work the same across multiple nodes, but there seems to be something that goes wrong when orchestrating the Ray cluster. Claude code was trying to do something complicated that involved a check-delay loop to make sure the Ray cluster on each node was properly communicating but I can't imagine that's the intended functionality here. 

### Relevant files to reproduce this bug

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-node (beyond multi-GPU) inference with UMA 1.2 #1949

Python version

fairchem-core version

pytorch version

cuda version

Operating system version

Minimal example

Current behavior

Expected Behavior

Relevant files to reproduce this bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multi-node (beyond multi-GPU) inference with UMA 1.2 #1949

Description

Python version

fairchem-core version

pytorch version

cuda version

Operating system version

Minimal example

Current behavior

Expected Behavior

Relevant files to reproduce this bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions