Disk Attach has some race conditions

Following-up on @plotnick 's comment [here](https://github.com/oxidecomputer/omicron/pull/1068/files#r873063471), I think there may be some deeper issues with disk attaching / detaching.

# Background

The following steps attempt to roughly map out the disk attach process:

- 1\) [instance_attach_disk](https://github.com/oxidecomputer/omicron/blob/a179f03bfe17ec890478c6f482a19b1675217194/nexus/src/app/instance.rs#L535) - in `app/instance.rs` - is invoked.
- 2\) [instance_list_disks](https://github.com/oxidecomputer/omicron/blob/a179f03bfe17ec890478c6f482a19b1675217194/nexus/src/app/instance.rs#L558-L572) is invoked
- 3\) The number of attached disks is [compared against the maximum permitted for the instance](https://github.com/oxidecomputer/omicron/blob/a179f03bfe17ec890478c6f482a19b1675217194/nexus/src/app/instance.rs#L574-L579).
- 4\) The [disk state is checked](https://github.com/oxidecomputer/omicron/blob/a179f03bfe17ec890478c6f482a19b1675217194/nexus/src/app/instance.rs#L612-L648), only permitting attaching if it is in a valid state.
- 5\) The [instance state is checked](https://github.com/oxidecomputer/omicron/blob/a179f03bfe17ec890478c6f482a19b1675217194/nexus/src/app/instance.rs#L650-L678), only permitting attaching if it is in a valid state.
- 6a) **If the instance is running...**
- 6a-1) A request is sent through the [Sled Agent](https://github.com/oxidecomputer/omicron/blob/a179f03bfe17ec890478c6f482a19b1675217194/nexus/src/app/instance.rs#L657-L666) to update the disk state. This [attaches the disk to a running instance](https://github.com/oxidecomputer/omicron/blob/a179f03bfe17ec890478c6f482a19b1675217194/nexus/src/app/disk.rs#L180-L192) first...
- 6a-2) ... then [updates the database](https://github.com/oxidecomputer/omicron/blob/a179f03bfe17ec890478c6f482a19b1675217194/nexus/src/app/disk.rs#L196-L200).
- 6b) **If the instance is not running...**
- 6b-1) The [database is updated](https://github.com/oxidecomputer/omicron/blob/a179f03bfe17ec890478c6f482a19b1675217194/nexus/src/app/instance.rs#L674-L677) with the new runtime. 

# Issues

- Between (2) and (3), other disks may be concurrently attached, bypassing the check. This is a TOCTTOU.
- Between (4) and (6a-1), the disk state may be modified before the sled agent request is made. This could result in the sled agent attaching a disk that has been deleted or attached to a different instance.
- Between (5) and (6), the instance state may be modified. The instance has a "state_generation" value for [optimistic concurrency control](https://rfd.shared.oxide.computer/rfd/0192#_transactions_ctes_sagas_generation_numbers), but it is not being checked / modified here.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disk Attach has some race conditions #1073

Background

Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Disk Attach has some race conditions #1073

Description

Background

Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions