Skip to content

ReadySet enters infinite retries if replication slot is deleted #602

@ethan-readyset

Description

@ethan-readyset

Summary

If ReadySet has a replication offset set on every base table, it doesn't drop and recreate the replication slot; it assumes that the replication slot still exists. If the replication slot is dropped while ReadySet is down, when RS starts back up, it enters an infinite retry loop as it tries to start replication on a slot that doesn't exist.

Description

The fix here is likely to query pg_replication_slots to see if the readyset slot exists when we start up. If it doesn't, we need to create the slot before we try to start replicating and resnapshot if our min replication offset is less than the consistent point of the slot.

2023-09-15T16:30:07.081747Z  WARN replicators::noria_adapter: Restarting adapter after error encountered error=Error during replication: db error: ERROR: replication slot "readyset" does not exist
2023-09-15T16:30:07.081793Z ERROR replicators: Error in replication, will retry after timeout error=Error during replication: db error: ERROR: replication slot "readyset" does not exist timeout_sec=1

This bug existed before 1eb63189f, but the log lines are different now because the query to pg_replication_slots in PostgresWalConnector::start_replication must return data as of 1eb63189f. Here's what it looks like now:

2023-09-15T16:17:57.066691Z  WARN replicators::noria_adapter: Restarting adapter after error encountered error=Error during replication: Incorrect response to query "SELECT confirmed_flush_lsn, wal_status FROM pg_replication_slots WHERE slot_name = 'readyset'" expected 2 rows, got 1
2023-09-15T16:17:57.066732Z ERROR replicators: Error in replication, will retry after timeout error=Error during replication: Incorrect response to query "SELECT confirmed_flush_lsn, wal_status FROM pg_replication_slots WHERE slot_name = 'readyset'" expected 2 rows, got 1

Expected behavior

ReadySet creates the replication slot if it doesn't exist and initiates a resnapshot.

Actual behavior

ReadySet fails to start replicating, bubbles the error up, and retries infinitely.

Steps to reproduce

  • Start ReadySet and allow snapshotting to finish
  • Stop ReadySet
  • Run SELECT pg_drop_replication_slot('readyset') in psql
  • Start ReadySet

ReadySet version

eb0fd75b0

Upstream DB type and version

Postgres version 14

Instance Details

N/A

Deployment Details

\[Docker | OSS K8s | OSS binary | RS-Cloud\]

OS Information

Logs

2023-09-15T16:30:07.081747Z  WARN replicators::noria_adapter: Restarting adapter after error encountered error=Error during replication: db error: ERROR: replication slot "readyset" does not exist
2023-09-15T16:30:07.081793Z ERROR replicators: Error in replication, will retry after timeout error=Error during replication: db error: ERROR: replication slot "readyset" does not exist timeout_sec=1

Metadata

Metadata

Labels

1 pointsCreated by Linear-GitHub SyncHigh priorityCreated by Linear-GitHub SyncbugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions