Skip to content

internal hardening for availability #2414

Open
@davepacheco

Description

@davepacheco

There are some basic things we'll want to check everywhere (e.g., Nexus, Sled Agent, DNS servers, etc.) for availability:

  • TCP KeepAlive: want to enable this on all network connections (in both directions) to identify failed systems. external vs. internal should probably have different values.
  • HTTP KeepAlive: probably want to just pick a value like 60 seconds. Consider having clients make dummy requests to keep the connections open? (to avoid the problem of picking a connection that's been open for just under 60 seconds, sending a request, and having the server slam the door in your face -- we ran into this with Manta, admittedly only at very large scale since it's fairly improbable)

We'll want to review these, too. They might be more security-related (see #2184):

  • limits for bad client behavior:
    • maximum time waiting for a client to send request headers (whether on a new connection or between requests)
    • minimum flow rate for request bodies (can be fairly low -- just want to avoid clients dribbling data in as a DoS vector to keep connections open)
    • maximum number of open connections (ideally limited separately for different APIs -- e.g., external vs. internal)
    • TCP listen socket backlog
    • maximum rate of new connections created [ideally per-client]
    • maximum rate of incoming requests [per authenticated user? or IP?, as well as overall]
    • maximum number of connect-in-progress sockets
    • maximum number of TLS-session-establishment-in-progress sockets
  • size of tokio worker thread pool, blocked thread pool
  • maximum length of time that graceful server shutdown can take

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions