Closed
Description
Following-up on @plotnick 's comment here, I think there may be some deeper issues with disk attaching / detaching.
Background
The following steps attempt to roughly map out the disk attach process:
- 1) instance_attach_disk - in
app/instance.rs
- is invoked. - 2) instance_list_disks is invoked
- 3) The number of attached disks is compared against the maximum permitted for the instance.
- 4) The disk state is checked, only permitting attaching if it is in a valid state.
- 5) The instance state is checked, only permitting attaching if it is in a valid state.
- 6a) If the instance is running...
- 6a-1) A request is sent through the Sled Agent to update the disk state. This attaches the disk to a running instance first...
- 6a-2) ... then updates the database.
- 6b) If the instance is not running...
- 6b-1) The database is updated with the new runtime.
Issues
- Between (2) and (3), other disks may be concurrently attached, bypassing the check. This is a TOCTTOU.
- Between (4) and (6a-1), the disk state may be modified before the sled agent request is made. This could result in the sled agent attaching a disk that has been deleted or attached to a different instance.
- Between (5) and (6), the instance state may be modified. The instance has a "state_generation" value for optimistic concurrency control, but it is not being checked / modified here.