sled-agent: move instance configuration generation to Nexus #8002

gjcolombo · 2025-04-18T00:01:26Z

One of the determinations in RFD 505 is that Nexus should be the component that's in charge of determining how to configure a VM given a set of database records describing its instance (the Instance itself, its attached Disks and NetworkInterfaces, etc.). To summarize the rationale in the RFD, the hope is that this will promote two nice properties:

Local reasoning about virtual platforms: All the logic that translates instance descriptions into VM specs now lives in a single module in Nexus. In past iterations of the code, Nexus transformed database records into an intermediate sled-agent type, and sled-agent would transform those into Propolis API types, which Propolis would then use to fill in virtual hardware details. Understanding where a VM's configuration came from required the reader to look at all these components; now all the relevant logic lives in Nexus.
Serviceability: Putting type transformations and platform policies into sled-agent and Propolis makes them marginally more painful to update, since updating these components requires the system to migrate VMs and reboot sleds. Putting the virtual platform policy in Nexus will make it much less expensive to update in the future.

To achieve this:

Move sled-agent's virtual platform logic (added in ingest new Propolis VM creation API #7211) into a new Nexus module. Sled-agent needs to hold onto a bit of logic to insert OPTE port names into instance specs before sending those specs to Propolis; this needs to live in the agent since it selects the relevant object names.
Update the sled-agent instance registration API to take a Propolis instance spec as a parameter (and rework some other types to distinguish a bit more clearly between "Propolis VM configuration" and "sled-agent objects that need to be created to support this VM").

The main pain point in this change is that sled-agent's API now includes types that it picked up from the propolis-client API, which caused sled-agent's OpenAPI document to balloon with "duplicate" schema descriptions it inherited from propolis-client's generated types. I'm not sure if there's a great way around this (aside from changing the generated Propolis client to replace all its generated types with their "native" counterparts); I'm open to suggestions here.

Tested by booting a VM in a dev cluster, booting a comparable VM on rack2, and comparing their instance specs (as returned by Propolis's /instance/spec API) to make sure they specified the same components with the same configuration.

gjcolombo

I also want to test manually that Propolis-directed region replacements still work as intended with this change (they depend on the virtual platform module having used the relevant disk record's ID as the relevant Propolis backend ID).

sled-agent/src/instance_manager.rs

gjcolombo · 2025-04-24T19:47:48Z

This will need a fresh commit hash/SHA from the Propolis repo after oxidecomputer/propolis#899 merges, but I think it is otherwise more or less ready for review (though it could probably use some unit tests of the new virtual platform logic...).

hawkw · 2025-04-25T20:37:11Z