Skip to content

Peers with different persistence cause hydra-node to crash on startup #1937

@noonio

Description

@noonio

Context & versions

Ever since the change to etcd for networking, if you delete your persistence, but one of your (configured) peers does not, then you see errors like:

Apr 10 11:02:23 noon-hydra-nixos-head.c.iog-hydra.internal spinupHydra[561926]: {
    "timestamp":"2025-04-10T11:02:23.182677246Z"
  ,"threadId":68
  ,"namespace":"HydraNode-\"noon\""
  ,"message":{
      "network":{
        "contents":{
          "etcd":{
            "caller":"etcdmain/etcd.go:204"
          ,"error":"member 9140f16a1adb1a87 has already been bootstrapped"
          ,"level":>
...

i.e. "member ... has already been bootstrapped".

It's a bit inconvenient, because it means you need to wait for all your peers to do the same before you can re-run hydra-node successfully.

Steps to reproduce

  1. Run with two peers
  2. Stop one node
  3. Delete the persistence
  4. Re-run the stopped node; the hydra-node executable won't even be able to start up.

Expected behavior

It would be great if the hydra-node executable could stay running, and just retry to connect to the peers every x period, with some backoff perhaps. i.e. the problem to resolve by itself if everyone runs a hydra-node that is fixed like this and has correct --peer and --advertise command line options

Solution idea

  • Detect cluster misconfiguration errors from internal etcd process. Probably these two:
    • "member has already bootstrapped"
    • "mismatching member id"
  • Wipe etcd/ state dir upon seeing such errors
  • Retry starting (incl. initiating the cluster) of etcd after some time
  • The hydra-node should log info about this process and not stop upon seeing these errors

The clustering guide may be a useful resource explaining how the --initial.. command line options work.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Blocked ✋

Status

Now

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions