NVIDIA GPU scheduling issue with multiple models

### Nomad version
```shell 
Nomad v1.9.5
BuildDate 2025-01-14T18:35:12Z
Revision 0b7bb8b60758981dae2a78a0946742e09f8316f5+CHANGES
```
### Issue
I am not entirely sure whenever this is a legit limitation of `nomad` + `nomad-device-nvidia` plugin or legit bug. According to documentation posted in https://developer.hashicorp.com/nomad/docs/job-specification/device#multiple-nvidia-gpu multiple GPU is supported but in reality - its not specified whenever those GPU should be the same model + located on same node or models can be different and only same node placement play the role. in our case - we have 2 NVIDIA GPU placed and available on one node. Setting them specifically like:
```hcl
        device "nvidia/gpu/NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition" {
          count = 1
        }
```
or 
```hcl
        device "nvidia/gpu/NVIDIA RTX 5000 Ada Generation" {
          count = 1
        }
```
work without any issues, same as running container manually on node - `nvidia-smi` report that both cards are visible and can be utilised. But setting them as:
```hcl
        device "nvidia/gpu" {
          count = 2
        }
```
result into placement failure

### Reproduction steps
Have 2 NVIDIA GPU that are correctly fingerprinted:
```json
❯ nomad node status -json fc9077e8 | jq '.NodeResources.Devices'                                                                                                                                                                                                            ~
[
  {
    "Attributes": {
      "cores_clock": {
        "Int": 210,
        "Unit": "MHz"
      },
      "pci_bandwidth": {
        "Int": 32768,
        "Unit": "MB/s"
      },
      "driver_version": {
        "String": "580.65.06",
        "Unit": ""
      },
      "memory": {
        "Int": 32760,
        "Unit": "MiB"
      },
      "bar1": {
        "Int": 256,
        "Unit": "MiB"
      },
      "display_state": {
        "String": "0",
        "Unit": ""
      },
      "power": {
        "Int": 14,
        "Unit": "W"
      },
      "memory_clock": {
        "Int": 405,
        "Unit": "MHz"
      },
      "persistence_mode": {
        "String": "0",
        "Unit": ""
      }
    },
    "Instances": [
      {
        "HealthDescription": "",
        "Healthy": true,
        "ID": "GPU-45bc2781-22da-689e-59d5-f3778161164f",
        "Locality": {
          "PciBusID": "00000000:00:1B.0\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
        }
      }
    ],
    "Name": "NVIDIA RTX 5000 Ada Generation",
    "Type": "gpu",
    "Vendor": "nvidia"
  },
  {
    "Attributes": {
      "memory": {
        "Int": 97887,
        "Unit": "MiB"
      },
      "memory_clock": {
        "Int": 405,
        "Unit": "MHz"
      },
      "power": {
        "Int": 8,
        "Unit": "W"
      },
      "pci_bandwidth": {
        "Int": 49152,
        "Unit": "MB/s"
      },
      "cores_clock": {
        "Int": 180,
        "Unit": "MHz"
      },
      "bar1": {
        "Int": 256,
        "Unit": "MiB"
      },
      "persistence_mode": {
        "String": "0",
        "Unit": ""
      },
      "driver_version": {
        "String": "580.65.06",
        "Unit": ""
      },
      "display_state": {
        "String": "0",
        "Unit": ""
      }
    },
    "Instances": [
      {
        "HealthDescription": "",
        "Healthy": true,
        "ID": "GPU-fb44165c-1a4f-a9dd-aa1b-30fd8c5658e3",
        "Locality": {
          "PciBusID": "00000000:00:10.0\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
        }
      }
    ],
    "Name": "NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition",
    "Type": "gpu",
    "Vendor": "nvidia"
  }
]
```

Create `job` that have defined `device` set just to `<vendor>/<type>` or just `<vendor>` or just `type` and set to `count = 2`:
```hcl
        device "nvidia/gpu" {
          count = 2
        }
```

#### Expected Result
`job` evaluated that there is 2 GPU on node that match `<vendor>/<type>` or `<vendor>` or `<type>` combination and reserve those GPU + create container with correct `NVIDIA_VISIBLE_DEVICES` env variable
#### Actual Result
Evaluation + placement failure


I had tried to add `constraints` that will point to one of the defined in list `ids`, `model` or even resource attributes but all of those resulted in placement failure. As soon as `count` had been reduced from `= 2` to `= 1` - job without any problem evaluated into one of those cards. I checked whenever some process lock card making its inaccessible but no, nothing. We have recent version of `nvidia-device-plugin` `1.1.0` plus we also compiled code from `master` branch (version reported to `1.2.0`) but have pretty much the same result. 

Question is: is this is a legit limitation and documentation does not tell specifically that `Multiple GPU` means `Multiple GPU with same model` or this is a bug? If there is a question "why do we put different models of GPU in same container" - we are working heavily with ollama. Couple of our servers have configuration where we have more and less power hungry cards on the same node. Due to the fact that different models have different size - ollama can automatically choose what cards it can use currently and will be enough for current model/task 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVIDIA GPU scheduling issue with multiple models #26584

Nomad version

Issue

Reproduction steps

Expected Result

Actual Result

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NVIDIA GPU scheduling issue with multiple models #26584

Description

Nomad version

Issue

Reproduction steps

Expected Result

Actual Result

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions