Skip to content

SST/IST always times out between DC2 and DC3 unless a donor is forced #654

@Sh4dow998

Description

@Sh4dow998

I have a three-node MariaDB 10.11 Galera cluster running in Docker (host networking) across three sites connected by VPN:

  • DC1 (10.82.1.11)
  • DC2 (192.168.138.151)
  • DC3 (192.168.128.149)

When bringing up nodes one at a time without specifying a donor on DC2, the second join always fails its IST (and never falls back to SST). I’m forced to set on DC2:

wsrep_sst_donor="dc1,dc3"

With that, both DC2→DC1 and DC2→DC3 joins now succeed. However, if DC1 is temporarily down, DC2→DC3 still fails.
Reproduction steps

  1. Start DC1 alone → PRIMARY/SYNCED
  2. Start DC2 without wsrep_sst_donor → joins from DC1 via SST → OK
  3. Start DC3 without wsrep_sst_donor → IST from DC2 (or DC1) → timeout → abort
  4. Add on DC2: wsrep_sst_donor="dc1,dc3"
  5. Restart DC3 → joins from DC1 (preferred) or DC3 fallback → OK
  6. Stop DC1, restart DC3 → attempts from DC1 first, fails, no fallback to DC2 → abort

Sample log when dc3 joins after dc2 (no donor set):

2025-07-28 15:43:05 [Note] WSREP: IST uuid:… f: 34, l: 35
2025-07-28 15:43:05 [Note] WSREP: requested state transfer from 'dc2,dc1'. Selected dc2 (SYNCED) as donor
2025-07-28 15:45:20 [Warning] WSREP: State transfer to dc3 failed: Operation timed out
2025-07-28 15:45:20 [ERROR] WSREP: Will never receive state. Need to abort.

My configuration (identical on each node):

[mysqld]
wsrep_on=ON
wsrep_cluster_address="gcomm://10.82.1.11,192.168.138.151,192.168.128.149"
wsrep_node_name=<dcX>
wsrep_node_address=<IP_VPN_dcX>
wsrep_sst_method=rsync
wsrep_sst_donor="dcY,dcZ"
gcache.size=1G

I have confirmed:

  • All ports (4567, 4568, 4444) are reachable pairwise over the VPN.
  • No host firewalls or Docker network restrictions.
  • rsync is installed, wsrep_sst_auth is functional if used.
  • Systemd TimeoutStartSec increased to 15 min.
  • Enabling wsrep_debug=ON shows IST attempts but no SST fallback logs.

I’d be very grateful for any insight into why IST never falls back to SST in this scenario, and why forcing a donor still fails when the first donor is unavailable. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions