Description
Context & versions
Ever since the change to etcd for networking, if you delete your persistence, but one of your (configured) peers does not, then you see errors like:
Apr 10 11:02:23 noon-hydra-nixos-head.c.iog-hydra.internal spinupHydra[561926]: {
"timestamp":"2025-04-10T11:02:23.182677246Z"
,"threadId":68
,"namespace":"HydraNode-\"noon\""
,"message":{
"network":{
"contents":{
"etcd":{
"caller":"etcdmain/etcd.go:204"
,"error":"member 9140f16a1adb1a87 has already been bootstrapped"
,"level":>
...
i.e. "member ... has already been bootstrapped".
It's a bit inconvenient, because it means you need to wait for all your peers to do the same before you can re-run hydra-node successfully.
Steps to reproduce
- Run with two peers
- Stop one node
- Delete the persistence
- Re-run the stopped node; the hydra-node executable won't even be able to start up.
Expected behavior
It would be great if the hydra-node
executable could stay running, and just retry to connect to the peers every x period, with some backoff perhaps. i.e. the problem to resolve by itself if everyone runs a hydra-node
that is fixed like this and has correct --peer
and --advertise
command line options
Solution idea
- Detect cluster misconfiguration errors from internal
etcd
process. Probably these two:- "member has already bootstrapped"
- "mismatching member id"
- Wipe
etcd/
state dir upon seeing such errors - Retry starting (incl. initiating the cluster) of
etcd
after some time - The
hydra-node
should log info about this process and not stop upon seeing these errors
The clustering guide may be a useful resource explaining how the --initial..
command line options work.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status