-
Notifications
You must be signed in to change notification settings - Fork 155
Description
Summary
If ReadySet has a replication offset set on every base table, it doesn't drop and recreate the replication slot; it assumes that the replication slot still exists. If the replication slot is dropped while ReadySet is down, when RS starts back up, it enters an infinite retry loop as it tries to start replication on a slot that doesn't exist.
Description
The fix here is likely to query pg_replication_slots to see if the readyset slot exists when we start up. If it doesn't, we need to create the slot before we try to start replicating and resnapshot if our min replication offset is less than the consistent point of the slot.
2023-09-15T16:30:07.081747Z WARN replicators::noria_adapter: Restarting adapter after error encountered error=Error during replication: db error: ERROR: replication slot "readyset" does not exist
2023-09-15T16:30:07.081793Z ERROR replicators: Error in replication, will retry after timeout error=Error during replication: db error: ERROR: replication slot "readyset" does not exist timeout_sec=1
This bug existed before 1eb63189f, but the log lines are different now because the query to pg_replication_slots in PostgresWalConnector::start_replication must return data as of 1eb63189f. Here's what it looks like now:
2023-09-15T16:17:57.066691Z WARN replicators::noria_adapter: Restarting adapter after error encountered error=Error during replication: Incorrect response to query "SELECT confirmed_flush_lsn, wal_status FROM pg_replication_slots WHERE slot_name = 'readyset'" expected 2 rows, got 1
2023-09-15T16:17:57.066732Z ERROR replicators: Error in replication, will retry after timeout error=Error during replication: Incorrect response to query "SELECT confirmed_flush_lsn, wal_status FROM pg_replication_slots WHERE slot_name = 'readyset'" expected 2 rows, got 1
Expected behavior
ReadySet creates the replication slot if it doesn't exist and initiates a resnapshot.
Actual behavior
ReadySet fails to start replicating, bubbles the error up, and retries infinitely.
Steps to reproduce
- Start ReadySet and allow snapshotting to finish
- Stop ReadySet
- Run
SELECT pg_drop_replication_slot('readyset')in psql - Start ReadySet
ReadySet version
eb0fd75b0
Upstream DB type and version
Postgres version 14
Instance Details
N/A
Deployment Details
\[Docker | OSS K8s | OSS binary | RS-Cloud\]
OS Information
Logs
2023-09-15T16:30:07.081747Z WARN replicators::noria_adapter: Restarting adapter after error encountered error=Error during replication: db error: ERROR: replication slot "readyset" does not exist
2023-09-15T16:30:07.081793Z ERROR replicators: Error in replication, will retry after timeout error=Error during replication: db error: ERROR: replication slot "readyset" does not exist timeout_sec=1