Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294

Closed
@PeterCxy

Description

@PeterCxy

Description

When using the room deletion api to remove a large room (such as Matrix HQ) from the server, the purging process, if it needs more than a few seconds to finish, can sometimes enter a constant fail-retry loop due to unable to serialize access (because the tables are concurrently accessed and modified by other transactions constantly on a running server).

Steps to reproduce

  • On a moderately busy server (e.g. being in multiple moderately-sized federated rooms), try to purge a large federated room using the delete room api
  • Observe that the process gets stuck with unable to serialize access being reported in the logs. You can also observe the behavior using PgHero, in which one exact long-running query will appear again and again on a regular interval, indicating Synapse has been retrying it again and again.

Version information

  • Version: 1.37.1

  • Install method: pip

  • Platform: Debian 10 "buster", in a LXC container

Notes

The error will go away if the isolation level is changed to the lowest READ COMMITTED for the room-purging transaction, though I am not sure if this is correct or not, but I assume it should be fine given that we are just deleting everything related to a room.

diff --git a/synapse/storage/databases/main/purge_events.py b/synapse/storage/databases/main/purge_events.py
index 7fb7780d0..2619a6602 100644
--- a/synapse/storage/databases/main/purge_events.py
+++ b/synapse/storage/databases/main/purge_events.py
@@ -313,6 +313,7 @@ class PurgeEventsStore(StateGroupWorkerStore, CacheInvalidationWorkerStore):
         )

     def _purge_room_txn(self, txn, room_id: str) -> List[int]:
+        txn.execute("SET TRANSACTION ISOLATION LEVEL READ COMMITTED")
         # First we fetch all the state groups that should be deleted, before
         # we delete that information.
         txn.execute(

On a second note, is there a reason why the isolation level is set to REPEATABLE READ by default globally? Does Synapse really need REPEATABLE READ on every transaction?

Metadata

Metadata

Assignees

No one assigned

    Labels

    S-MinorBlocks non-critical functionality, workarounds exist.T-DefectBugs, crashes, hangs, security vulnerabilities, or other reported issues.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions