Description
A CI test run for #7211 biffed it in integration_tests::disks::test_disk_move_between_instances
, specifically while trying to create and stop an instance (to which it was going to try to attach the test disk). Logs from the failed run are here: https://buildomat.eng.oxide.computer/wg/0/artefact/01JFDZKBR1TVFTDN2CQC9RQRJX/LqJImvHS2nqTNJ8xklaVdZRzeVBxPnvzKkpfU6G10PeGtRSO/01JFE0RWTP7DP7FRV6S6M8CNVT/01JFE41AJ8FS8Y7XDQQ7SEXAEC/test_all-e39a2a7c7389546b-test_disk_move_between_instances.109403.0.log?format=x-bunyan
I think this is a bug in the simulated sled agent. First, we see the new (stopping) instance's VMM go to the Destroyed state:
[10409] 2024-12-18T23:42:01.825Z INFO test_disk_move_between_instances (dropshot_internal): received new VMM runtime state from sled agent
actor_id = 001de000-05e4-4000-8000-000000000002
authenticated = true
local_addr = 127.0.0.1:39975
method = PUT
migration_state = Migrations { migration_in: None, migration_out: None }
propolis_id = 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
remote_addr = 127.0.0.1:37810
req_id = 031a3f84-0295-4c40-9c05-ad193f0293b2
uri = /vmms/1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
vmm_state = VmmRuntimeState { state: Destroyed, gen: Generation(7), time_updated: 2024-12-18T23:42:01.723653214Z }
This HTTP request completes here:
[10584] 2024-12-18T23:42:02.596Z INFO test_disk_move_between_instances (dropshot_internal): request completed
latency_us = 774448
local_addr = 127.0.0.1:39975
method = PUT
remote_addr = 127.0.0.1:37810
req_id = 031a3f84-0295-4c40-9c05-ad193f0293b2
response_code = 204
uri = /vmms/1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
Between these two points the instance watcher task swoops in and asks the simulated sled agent about the VMM. The agent replies that the VMM is gone; the watcher interprets this as a gone-missing VMM that should be moved to the Failed state:
[10417] 2024-12-18T23:42:01.828Z INFO test_disk_move_between_instances (dropshot): request completed
error_message_external = Not Found
error_message_internal = VMM with ID 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15 not found
latency_us = 1441
local_addr = [::1]:38575
method = GET
remote_addr = [::1]:38190
req_id = 1a1447d9-9e89-41df-8845-a41fd655fd75
response_code = 404
sled_id = b6d65341-167c-41df-9b5c-41cded99c229
uri = /vmms/1d0ff3f1-e9c1-4443-bd36-efcc4200ab15/state
[10419] 2024-12-18T23:42:01.829Z INFO test_disk_move_between_instances (BackgroundTasks): sled-agent error indicates that this instance's VMM has failed!
background_task = instance_watcher
error = Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "1a1447d9-9e89-41df-8845-a41fd655fd75", "content-length": "120", "date": "Wed, 18 Dec 2024 23:42:01 GMT"}; value: Error { error_code: Some("NO_SUCH_INSTANCE"), message: "Not Found", request_id: "1a1447d9-9e89-41df-8845-a41fd655fd75" }
instance_id = 40470f10-206e-4a26-be87-96e862729da8
sled_id = b6d65341-167c-41df-9b5c-41cded99c229
vmm_id = 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
[10421] 2024-12-18T23:42:01.829Z INFO test_disk_move_between_instances (BackgroundTasks): received new VMM runtime state from sled agent
background_task = instance_watcher
instance_id = 40470f10-206e-4a26-be87-96e862729da8
migration_state = Migrations { migration_in: None, migration_out: None }
propolis_id = 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
sled_id = b6d65341-167c-41df-9b5c-41cded99c229
vmm_id = 1d0ff3f1-e9c1-4443-bd36-efcc4200ab15
vmm_state = VmmRuntimeState { state: Failed, gen: Generation(7), time_updated: 2024-12-18T23:42:01.829357055Z }
Later we see an instance updater saga dutifully move the instance to Failed:
[10833] 2024-12-18T23:42:03.157Z INFO test_disk_move_between_instances (SagaExecutor): instance update: committed update new runtime state!
did_unlock = true
instance_id = 40470f10-206e-4a26-be87-96e862729da8
new_runtime = InstanceRuntimeState { time_updated: 2024-12-18T23:42:02.806634751Z, gen: Generation(Generation(4)), propolis_id: None, dst_propolis_id: None, migration_id: None, nexus_state: Failed, time_last_auto_restarted: None }
saga_id = 6cf498e3-bf25-40a8-bc9a-51a26a00c4ae
saga_name = instance-update
Then the auto-restart task sends a new start request for the instance. Since the instance never actually comes to rest in the Stopped state, the part of the test that waits for this times out and the test fails.
RFD 486 section 6.1 describes how sled agents are supposed to avoid this problem:
Before it can talk to a sled about a VMM, Nexus needs to read from CRDB to find the VMM and sled of interest. Nexus and sled agent must therefore be careful to avoid the following kind of time-of-check-to-time-of-use problem:
1. A user-initiated Nexus operation (e.g. instance stop) reads that the target instance is running in a VMM on some sled 2. Propolis reaches the Destroyed state and notifies sled agent 3. Sled agent observes the Destroyed state and removes the VMM from its tracking table 4. The Nexus task from step 1 calls sled agent’s "ensure instance state" API 5. Sled agent returns "instance not present" 6. Nexus concludes that the VMM has failed when it stopped normally
This can be fixed by having sled agent remove instances from its tracking table only after publishing the final Destroyed state transition to Nexus. This guarantees that if a caller observed a not-Destroyed VMM, and then calls sled agent after the VMM became Destroyed, then the caller’s prior observation has an out-of-date generation number:
1. Nexus reads that a VMM with generation 20 is running on some sled 2. Propolis reaches the Destroyed state and notifies sled agent 3. Sled agent sends updated VMM state to Nexus 4. Nexus advances the VMM record to generation 21 and records that it’s Destroyed 5. Sled agent removes the VMM from its table 6. The Nexus task from step 1 gets "instance not present" back from the sled agent 7. The same Nexus task tries to write that the VMM has Failed at generation 21; this write fails because the VMM was already advanced to that generation
If the rules had been followed here, the "state = destroyed, gen = 7" update from the sled agent would have been committed to CRDB before the agent began to return that the VMM was gone. That would have caused the watcher task's "state = failed, gen = 7" update not to be written, and the next updater task would then have moved the instance to Stopped.
The problem is that simulated sled agent doesn't follow these rules. Instead, when it updates an object's state, it takes its collection lock, removes the object from the collection, and then decides if the object is ready to destroy. If it is, it drops the lock and then sends a state change notification:
omicron/sled-agent/src/sim/collection.rs
Lines 238 to 274 in d337333
Not holding the collection lock while calling another service is right and just, but if sim-sled-agent is going to do this for VMMs, it needs to wait to remove them from the collection until after the relevant notification is sent.