Currently, sable_services writes its database as a single JSON file on disk. This is similar to what Atheme does, so we know it works are least at Libera.Chat's scale.
While this can easily be replicated to other services, it means sable_services going down causes an outage where people cannot login, channel ops cannot be opped, etc. This happens on Libera.Chat from time to time.
Given Sable's distributed architecture, we can do better here. @spb's idea is to have multiple sable_services nodes, one of which would be a leader and would stream its database to the other.
The database could remain a single JSON file, but it might become a scaling concern to copy this file over and over. We see a few options to solve this:
- use a database that supports streaming replication, like PostgreSQL.
- make
sable_services nodes coordinate over the Sable network, and each have their own independent database
- make
sable_services nodes share a single replicated database (Cassandra, something on top of Ceph, CockroachDB, ...)
With options 1 and 2, if we want high availability,it means sable_services needs to somehow have a leader election, because we can't allow write to the same objects from multiple nodes at the same time. PostgreSQL does not provide a solution to this, and expects users to tell it when to switch between follower/leader state.
And option 3 may be unsustainable for Libera, as all solutions I'm aware of in this space require extensive specialized knowledge with that solution (maybe not CockroachDB though? I've never tried it). In particular, Cassandra and Ceph are designed to work with petabyte-scale data, which is far beyond what we need here. Additionally, they often come with constraints/caveats in what software developers can do with the database.
Currently,
sable_serviceswrites its database as a single JSON file on disk. This is similar to what Atheme does, so we know it works are least at Libera.Chat's scale.While this can easily be replicated to other services, it means
sable_servicesgoing down causes an outage where people cannot login, channel ops cannot be opped, etc. This happens on Libera.Chat from time to time.Given Sable's distributed architecture, we can do better here. @spb's idea is to have multiple
sable_servicesnodes, one of which would be a leader and would stream its database to the other.The database could remain a single JSON file, but it might become a scaling concern to copy this file over and over. We see a few options to solve this:
sable_servicesnodes coordinate over the Sable network, and each have their own independent databasesable_servicesnodes share a single replicated database (Cassandra, something on top of Ceph, CockroachDB, ...)With options 1 and 2, if we want high availability,it means
sable_servicesneeds to somehow have a leader election, because we can't allow write to the same objects from multiple nodes at the same time. PostgreSQL does not provide a solution to this, and expects users to tell it when to switch between follower/leader state.And option 3 may be unsustainable for Libera, as all solutions I'm aware of in this space require extensive specialized knowledge with that solution (maybe not CockroachDB though? I've never tried it). In particular, Cassandra and Ceph are designed to work with petabyte-scale data, which is far beyond what we need here. Additionally, they often come with constraints/caveats in what software developers can do with the database.