Open
Description
Problem
There are many steps in the RSS process, but after the underlay is set up, sled agents are created, and Nexus is booted, the flow is basically the following:
- Nexus boots, tries to access CRDB, and eventually starts an "internal API" server
- Sled Agents asychronously send information about: Sleds, Disks & Zpools
- RSS tries to do "handoff to Nexus", where it transfers all knowledge from setup + control over service initialization
- Nexus, after receiving this request, confirms that handoff is complete, and starts the "external API" server. The rack is initialized, Nexus controls the world.
This is, at least, the theory and intent behind RSS. In reality, the following often happens:
- Step (3) only succeeds if Nexus can correlate all services to known sleds, and all datasets to zpools + physical disks...
- ... so this can fail if step (2) has not completed! This causes step (3) to return 500s, as the referenced objects are not known to Nexus
- ... however, we can't (and shouldn't) block on step (2) completing because this information may change through the lifetime of the sled.
Why is this bad
- It's actually kinda slow, because the RSS handoff fails more than it needs to
- It's also a bit confusing -- 500 errors imply something has gone really wrong, but that's not the case here
Proposal
- First off, for any of these "missing in DB" errors, we should avoid returning 500s, and perhaps prefer sending 404s. These are errors that we can overcome by simply waiting for the sled agent to finish populating data.
- Secondly: We may simplify this process and reduce latency by consolidating some of these requests. Namely, rather than sending a distinct request for each "add sled + add disk + add zpool" call, we might be better off sending a single "here is the sled, with a summary of all existing hardware" request. Doing so would also make it easier to add a generation number for identifying stale requests, and would also make this error handling simpler (as the sled will either exist and be known to Nexus, or it won't).