Skip to content

Reduce spurious 500s during RSS #3899

Open
@smklein

Description

@smklein

Problem

There are many steps in the RSS process, but after the underlay is set up, sled agents are created, and Nexus is booted, the flow is basically the following:

  1. Nexus boots, tries to access CRDB, and eventually starts an "internal API" server
  2. Sled Agents asychronously send information about: Sleds, Disks & Zpools
  3. RSS tries to do "handoff to Nexus", where it transfers all knowledge from setup + control over service initialization
  4. Nexus, after receiving this request, confirms that handoff is complete, and starts the "external API" server. The rack is initialized, Nexus controls the world.

This is, at least, the theory and intent behind RSS. In reality, the following often happens:

  • Step (3) only succeeds if Nexus can correlate all services to known sleds, and all datasets to zpools + physical disks...
  • ... so this can fail if step (2) has not completed! This causes step (3) to return 500s, as the referenced objects are not known to Nexus
  • ... however, we can't (and shouldn't) block on step (2) completing because this information may change through the lifetime of the sled.

Why is this bad

  • It's actually kinda slow, because the RSS handoff fails more than it needs to
  • It's also a bit confusing -- 500 errors imply something has gone really wrong, but that's not the case here

Proposal

  • First off, for any of these "missing in DB" errors, we should avoid returning 500s, and perhaps prefer sending 404s. These are errors that we can overcome by simply waiting for the sled agent to finish populating data.
  • Secondly: We may simplify this process and reduce latency by consolidating some of these requests. Namely, rather than sending a distinct request for each "add sled + add disk + add zpool" call, we might be better off sending a single "here is the sled, with a summary of all existing hardware" request. Doing so would also make it easier to add a generation number for identifying stale requests, and would also make this error handling simpler (as the sled will either exist and be known to Nexus, or it won't).

Metadata

Metadata

Assignees

Labels

Sled AgentRelated to the Per-Sled Configuration and Managementbootstrap servicesFor those occasions where you want the rack to turn oncleanupCode cleanlinessdevelopmentBugs, paper cuts, feature requests, or other thoughts on making omicron development better

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions