Skip to content

Conversation

kyounghunJang
Copy link

@kyounghunJang kyounghunJang commented Jul 30, 2025

Description

This PR adds support for disk IO throttling using cgroup in Nomad task drivers.
The implementation allows users to define per-device IO bandwidth limits through job specifications.

Testing & Reproduction steps

Links

#26295

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

@kyounghunJang kyounghunJang requested review from a team as code owners July 30, 2025 07:02
Copy link

hashicorp-cla-app bot commented Jul 30, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@aimeeu aimeeu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kyounghunJang - I'm the tech writer who works with the Nomad team. Thank you very much for including excellent documentation with your feature code. I left some suggestions so that the new content follows our documentation style guide. Please feel free to tag me with any documentation questions.

@@ -0,0 +1,113 @@
---
layout: docs
page_title: disk_throttles block in the job specification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
page_title: disk_throttles block in the job specification
page_title: disk_throttle block in the job specification

changed to match the block name

layout: docs
page_title: disk_throttles block in the job specification
description: |-
Configure disk I/O throttling limits in the `disk_throttles` block of the Nomad job specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Configure disk I/O throttling limits in the `disk_throttles` block of the Nomad job specification.
Configure disk I/O throttling limits in the `disk_throttle` block of the Nomad job specification.


<Placement groups={['job', 'group', 'task', 'resources', 'disk_throttle']} />

The disk_throttle block is used to set limits on the disk I/O (Input/Output) a task can perform on a specific block device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The disk_throttle block is used to set limits on the disk I/O (Input/Output) a task can perform on a specific block device.
Use the `disk_throttle` block to set limits on the disk I/O (Input/Output) a task can perform on a specific block device.

<Placement groups={['job', 'group', 'task', 'resources', 'disk_throttle']} />

The disk_throttle block is used to set limits on the disk I/O (Input/Output) a task can perform on a specific block device.
This block helps mitigate the "noisy neighbor" problem, where a single task consuming excessive disk bandwidth can negatively impact other tasks running on the same node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This block helps mitigate the "noisy neighbor" problem, where a single task consuming excessive disk bandwidth can negatively impact other tasks running on the same node.
This block helps mitigate the noisy neighbor problem, where a single task consuming excessive disk bandwidth can negatively impact other tasks running on the same node.

Nit: remove quotes around noisy neighbor

The disk_throttle block is used to set limits on the disk I/O (Input/Output) a task can perform on a specific block device.
This block helps mitigate the "noisy neighbor" problem, where a single task consuming excessive disk bandwidth can negatively impact other tasks running on the same node.

When a disk_throttle block is added, Nomad will limit the task's I/O throughput in bytes per second (BPS) or I/O operations per second (IOPS).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a disk_throttle block is added, Nomad will limit the task's I/O throughput in bytes per second (BPS) or I/O operations per second (IOPS).
When you add a `disk_throttle` block, Nomad limits the task's I/O throughput in bytes per second (BPS) or I/O operations per second (IOPS).

}
```

### Throttling Multiple Devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Throttling Multiple Devices
### Throttling multiple devices


```hcl
resources {
disk_throttles {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
disk_throttles {
disk_throttle {

This block should be disk_throttle, correct?

@@ -109,6 +113,22 @@ resources {
}
}
```

### DiskThrottles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### DiskThrottles
### Disk throttle

@@ -167,3 +187,4 @@ resource utilization and considering the following suggestions:
[numa]: /nomad/docs/job-specification/numa 'Nomad NUMA Job Specification'
[`secrets/`]: /nomad/docs/reference/runtime-environment-settings#secrets
[concepts-cpu]: /nomad/docs/architecture/cpu
[disk_throttle]: /nomad/docs/job-specification/disk_throttle 'Nomad Disk_Throttle Job Specification'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[disk_throttle]: /nomad/docs/job-specification/disk_throttle 'Nomad Disk_Throttle Job Specification'
[disk_throttle]: /nomad/docs/job-specification/disk_throttle

You don't need to add a page title after the link. That pattern is from a much earlier version of the docs.

Remember that the `disk_throttle` block is only valid in the placements listed above.

### Limiting Bandwidth (BPS)
This example limits the read and write bandwidth of a specific device to 50 MB/s.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding an explanation to each of the examples!

@aimeeu
Copy link
Contributor

aimeeu commented Jul 30, 2025

@aimeeu aimeeu added the theme/docs Documentation issues and enhancements label Jul 30, 2025
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kyounghunJang and thanks for the PR! But I don't think we can accept this as-is.

As you can see from the resource structs you've edited, Nomad has legacy fields for managing disk IOPs and size, but they aren't currently in use. Disk IOPs tracking was never fully supported, and the field was deprecated in Nomad 0.9.0 (ref #4970). Disk capacity (DiskMB) was removed all the way back in Nomad 0.5.0 (ref #1679) and moved to the ephemeral_disk field. So there's a long history of not-quite-complete/working disk related resource tracking. 😁

The core challenge with anything to do with disk resources is that they're platform dependent. The cgroups-based approach you have here for throttling doesn't work on non-Linux OS. They're also dependent on the task driver. Setting cgroups won't work for the qemu driver or the libvirt driver. Or if in the future someone wanted to add disk space constraints, it's infeasibly expensive to do that except on filesystems that support quotas (ex. ZFS). We'll likely need to have a discussion about how to handle host volumes, dynamic volumes, and the alloc directory, all of which are bind-mounted (but only on supported task drivers!).

Also, the resource block generally represents schedulable resources. That is, resources that the scheduler should be comparing against the available resources on the node. Your implementation sets cgroups flags when the workload is placed, but there's no way for the scheduler to tell whether it's "using up" too much IOPs for a given node. And the per-device throttling option you've got only works if all the hosts have the same set of major:minor numbers on disks. We could start fingerprinting the IOPs per devices and then acocunting for that resource, but then that's only useful if every alloc is assigned disk IO slice as well.

All of which is to say that this feature has a lot of complexity, and I'm not sure it's a good idea to try to work out the design incrementally in a pull request, rather than having a design discussion in the original #26295 issue. For a feature of this complexity, we'd typically have an internal Request for Comments (RFC) document and product management involvement as well.

We'd love to have your enthusiasm as part of that discussion in #26295 though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/docs Documentation issues and enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants