Skip to content

Commit f7e052f

Browse files
replicators: Fix race between domain recovery and dropping table
When a domain fails, we will replace the domain in a background thread. However, if the failure is due to a replication issue, like not being able to find a row in the source table, we will remove the table from readyset. Those two operations will race with each other. We should wait for the new domain to be ready before removing the table. This can be done by checking the replication offsets via the RPC call. We just need to make the caller wait for the RPC to succeed instead of returning immediately after the first error. We use the same approach when we are booting up for the first time in the noria_adapter. Closes: REA-5563 Fixes: #1484 Release-Note-Core: Fix a race between domain recovery and dropping a table after replication failure. Change-Id: I9274eca5fcf256ce37bdd4b2b2bbfda946f2952e Reviewed-on: https://gerrit.readyset.name/c/readyset/+/9146 Tested-by: Buildkite CI Reviewed-by: Jason Brown <jason.b@readyset.io>
1 parent 981f459 commit f7e052f

File tree

1 file changed

+13
-1
lines changed

1 file changed

+13
-1
lines changed

replicators/src/noria_adapter.rs

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1159,7 +1159,19 @@ impl<'a> NoriaAdapter<'a> {
11591159
table = %table.display(readyset_sql::Dialect::PostgreSQL),
11601160
"Removing table state from readyset"
11611161
);
1162-
self.noria.replication_offsets().await?;
1162+
1163+
// In case of a domain failure, we might be replacing the failed domain.
1164+
// and at the same time attempting to remove a table from readyset.
1165+
// We need to wait for the new domain to be ready before removing the table.
1166+
retry_with_exponential_backoff(
1167+
|| async {
1168+
let mut noria = self.noria.clone();
1169+
noria.replication_offsets().await
1170+
},
1171+
5,
1172+
Duration::from_millis(250),
1173+
)
1174+
.await?;
11631175
self.replication_offsets.tables.remove(&table);
11641176
self.mutator_map.remove(&table);
11651177
// Dropping the table cleans up any dataflow state that may have been made as well as

0 commit comments

Comments
 (0)