Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294
Description
Description
When using the room deletion api to remove a large room (such as Matrix HQ) from the server, the purging process, if it needs more than a few seconds to finish, can sometimes enter a constant fail-retry loop due to unable to serialize access
(because the tables are concurrently accessed and modified by other transactions constantly on a running server).
Steps to reproduce
- On a moderately busy server (e.g. being in multiple moderately-sized federated rooms), try to purge a large federated room using the delete room api
- Observe that the process gets stuck with
unable to serialize access
being reported in the logs. You can also observe the behavior using PgHero, in which one exact long-running query will appear again and again on a regular interval, indicating Synapse has been retrying it again and again.
Version information
-
Version: 1.37.1
-
Install method: pip
- Platform: Debian 10 "buster", in a LXC container
Notes
The error will go away if the isolation level is changed to the lowest READ COMMITTED
for the room-purging transaction, though I am not sure if this is correct or not, but I assume it should be fine given that we are just deleting everything related to a room.
diff --git a/synapse/storage/databases/main/purge_events.py b/synapse/storage/databases/main/purge_events.py
index 7fb7780d0..2619a6602 100644
--- a/synapse/storage/databases/main/purge_events.py
+++ b/synapse/storage/databases/main/purge_events.py
@@ -313,6 +313,7 @@ class PurgeEventsStore(StateGroupWorkerStore, CacheInvalidationWorkerStore):
)
def _purge_room_txn(self, txn, room_id: str) -> List[int]:
+ txn.execute("SET TRANSACTION ISOLATION LEVEL READ COMMITTED")
# First we fetch all the state groups that should be deleted, before
# we delete that information.
txn.execute(
On a second note, is there a reason why the isolation level is set to REPEATABLE READ
by default globally? Does Synapse really need REPEATABLE READ
on every transaction?