Skip to content

Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Apr 29, 2025

Re-introduce #18260 which was reverted.

Adds total message counts to homeserver stats reporting. This is primarily to comply with the TI-Messenger spec, which requires each homeserver to report "Number of message events as an integer, cumulative".

Dev notes

poetry run trial tests.metrics.test_phone_home_stats
poetry run trial tests.storage.test_event_stats

Test with Postgres

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.metrics.test_phone_home_stats
SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.storage.test_event_stats

RETURNING syntax in SQLite

The RETURNING syntax has been supported by SQLite since version 3.35.0 (2021-03-12).

-- https://www.sqlite.org/lang_returning.html

Synapse supports...

The oldest supported version of SQLite is the version provided by Debian oldstable.

-- https://element-hq.github.io/synapse/latest/deprecation_policy.html

which currently is https://packages.debian.org/bullseye/sqlite3 -> 3.34.1-3+deb11u1

We have self.db_pool.engine.supports_returning to detect whether we can use RETURNING.

"could not serialize access due to concurrent update"

Related:

Postgres

Context #18349

Relevant error:

psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update
CONTEXT:  SQL statement "UPDATE event_stats SET total_event_count = total_event_count + 1"
PL/pgSQL function event_stats_increment_counts() line 5 at SQL statement

The default isolation in Postgres is IsolationLevel.READ_COMMITTED but we've set the default to be IsolationLevel.REPEATABLE_READ in Synapse:

self.default_isolation_level = (
psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ
)

Docs: https://www.postgresql.org/docs/current/transaction-iso.html

13.2.3. Serializable Isolation Level

However, like the Repeatable Read level, applications using this level must be prepared to retry transactions due to serialization failures.

-- https://www.postgresql.org/docs/current/transaction-iso.html#XACT-SERIALIZABLE

Postgres SERIALIZABLE isolation level is actually "Serializable Snapshot Isolation" (SSI) where it optimistically allows transactions to proceed and aborts only if guarantees are violated (the serialization errors we're seeing).

SQLite

In SQLite, all transactions are SERIALIZABLE by default and because of the way it works, each write waits their turn to run in serial order so we won't run into the same concurrency problems for incrementing a counter:

all transactions in SQLite show "serializable" isolation. SQLite implements serializable transactions by actually serializing the writes. There can only be a single writer at a time to an SQLite database. There can be multiple database connections open at the same time, and all of those database connections can write to the database file, but they have to take turns. SQLite uses locks to serialize the writes automatically; this is not something that the applications using SQLite need to worry about.

-- https://www.sqlite.org/isolation.html

Are TRIGGER's still viable?

Discussion around whether TRIGGER are viable: #18371 (comment)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

The Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems book by Martin Kleppmann is fantastic reference for trying to understand the problem here.

Chapter 7: Transactions

Weak Isolation levels

Atomic write operations

Many databases provide atomic update operations, which remove the need to implement read-modify-write cycles in application code. They are usually the best solutions if your code can be expressed in terms of those operations. For example, the following instruction is concurrency-safe in most relational databases:

UPDATE counters SET value = value + 1 WHERE key = 'foo';

Chapter 7: Transactions

Weak Isolation levels

Automatically detecting lost updates

Atomic operations and locks are a ways of preventing lost updates by forcing the read-modify-write cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the transaction manager detects a lost updates, abort the transaction and force it to retry its read-modify-write cycle.

An advantage of this approach is that databases can perform this check efficiently in conjunction with the snapshot isolation. Indeed, PostgreSQL's repeatable read, Oracle's serializable, and SQL Server's snapshot isolation levels automatically detect when a lost update has occurred and abort the offending transaction.

Chapter 7: Transactions

Serializability

Seriablize Snapshot Isolation (SSI)

Today SSI is used both in single node databases (the serializable isolation level in PostgreSQL since version 9.1) [...]

Pessimistic versus optimistic concurrency control

[...]

Optimistic concurrency control is an old idea, and its advantages and disadvantages have been debated for a long time. It performs badly if there is high contention (many transactions trying to acess the same objects), as this leads to a high proportion of transactions needing to abort. If the system is already close to its maximum throughput, the additional transaction load from retried transactions can make performance worse.

However, if there is enough spare capacity, and if contention between transactions is not too high, optimistic concurrency control techniques tend to perform better than pessimistic ones. Contention can be reduced with commutative atomic operations: for example, if several transactions concurrently want to increment a counter, it doesn't matter which order the increments are applied (as long as the counter isn't read in the same transaction), so the concurrent increments can all be applied with conflicting.

Todo

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct
    (run the linters)

@github-actions github-actions bot deployed to PR Documentation Preview April 29, 2025 14:08 Active
This was the original reason the PR was reverted, see
#18346

> The `RETURNING` syntax has been supported by SQLite since version 3.35.0 (2021-03-12).
>
> *-- https://www.sqlite.org/lang_returning.html*

Synapse supports...

> The oldest supported version of SQLite is the version [provided](https://packages.debian.org/bullseye/libsqlite3-0) by [Debian oldstable](https://wiki.debian.org/DebianOldStable).
>
> *-- https://element-hq.github.io/synapse/latest/deprecation_policy.html*

which currently is https://packages.debian.org/bullseye/sqlite3 -> `3.34.1-3+deb11u1`

We have `self.db_pool.engine.supports_returning` to detect whether we can use `RETURNING`.
So we always get the correct count regardless of how many times the background
update is run,
@github-actions github-actions bot deployed to PR Documentation Preview April 30, 2025 05:01 Active
@github-actions github-actions bot deployed to PR Documentation Preview April 30, 2025 05:05 Active
tok=user_2_token,
)

def test_concurrent_event_insert(self) -> None:
Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use a few pointers on how best to architect this test. I'd like to start a transaction that involves the event_stats table, then try to insert a new event to try to reproduce the concurrency error: psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update (#18349). The problem with the current test is that the transaction just blocks everything where asdf1 is printed but not asdf2. I'm not sure how best to release control back to the GIL to run other tasks and continue the test.

I'm assuming the real-life scenario that caused this error happens when you have multiple stream writer workers for the events stream (multiple event_perister) since it "experimentally supports having multiple writer workers, where load is sharded between them by room ID. Each writer is called an event persister." (source). And then you just send events into different rooms at the same time that are handled by separate workers.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think part of the problem is that ThreadedMemoryReactorClock "doesn't really use threads" and uses make_fake_db_pool(...) which "runs db queries synchronously [...] on the test reactor's main thread".

synapse/tests/server.py

Lines 699 to 705 in d59bbd8

"""Wrapper for `make_pool` which builds a pool which runs db queries synchronously.
For more deterministic testing, we don't use a regular db connection pool: instead
we run all db queries synchronously on the test reactor's main thread. This function
is a drop-in replacement for the normal `make_pool` which builds such a connection
pool.
"""

synapse/tests/server.py

Lines 638 to 641 in d59bbd8

# [1]: we replace the threadpool backing the db connection pool with a
# mock ThreadPool which doesn't really use threads; but we still use
# reactor.callFromThread to feed results back from the db functions to the
# main thread.

So no matter how much I try to split things in threads for parallel execution, it just blocks.


Other references:

  • defer_to_thread(...) (thanks for pointing this one out @anoadragon453)
  • self.reactor.callFromThread(...)

Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't have a solution for this.

I've also tried using a BaseMultiWorkerStreamTestCase with sharded "event_persisters" (multiple workers for the events stream) to mimic the real-life scenario that I think could reproduce this error (see test_sharded_event_persisters) but only the first event in each room seems to send (requires more investigation). This kind of test doesn't guarantee a reproduction either as the stars could align and events could send fine. Additionally, as BaseMultiWorkerStreamTestCase is implemented right now, I don't think it will work as all of workers share the same reactor so they at-best send one at a time without concurrency conflicts.

if isinstance(txn.database_engine, Sqlite3Engine):
txn.execute(
"""
CREATE TRIGGER IF NOT EXISTS event_stats_events_insert_trigger
Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that TRIGGER's may not be a viable option overall because of the concurrency problems (psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update) with the default IsolationLevel.READ_COMMITTED for database transactions.

When we insert events, we have pretty giant transactions to also update all of the associated tables in one go. I'm concerned that changing the isolation level of all of the transactions where we insert/delete events will unacceptably degrade performance (probably need IsolationLevel.SERIALIZABLE, need to investigate whether IsolationLevel.REPEATABLE_READ is sufficient).

Any good alternatives or words of encouragement just to try it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set the isolation level per transaction apparently:

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- your queries here
COMMIT;

I'm not sure if we have control over that, but that may be preferable to changing the whole database over?

Copy link
Contributor Author

@MadLittleMods MadLittleMods Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anoadragon453 We can definitely set the isolation level on a per transaction basis but the problem is that the triggers run as part of the transaction inserting the events. So wherever we insert/delete events, we need to change the isolation level and it appears like the transactions involving events are pretty large so using IsolationLevel.SERIALIZABLE (the result of the transactions come out the other side as if they were run sequentially), concerns me as they may be waiting on each other to finish.

Example of changing isolation level:

# This first runs the purge transaction with READ_COMMITTED isolation level,
# meaning any new rows in the tables will not trigger a serialization error.
# We then run the same purge a second time without this isolation level to
# purge any of those rows which were added during the first.
logger.info("[purge] Starting initial main purge of [1/2]")
await self.db_pool.runInteraction(
"purge_room",
self._purge_room_txn,
room_id=room_id,
isolation_level=IsolationLevel.READ_COMMITTED,
)

Copy link
Member

@erikjohnston erikjohnston May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the default transaction isolation level in Synapse is repeatable read:

self.default_isolation_level = (
psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ
)

Copy link
Contributor Author

@MadLittleMods MadLittleMods May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this problem a bit better now. The problem is that our default IsolationLevel.REPEATABLE_READ (also known as "snapshot isolation") will always get a consistent snapshot of the database during the transaction and prevents lost updates by aborting offending transactions. So for example if the total_event_count is 5, when one transaction runs UPDATE event_stats SET total_event_count = total_event_count + 1, it will use 5 value in the snapshot, and if another concurrent transaction runs, it will also see the 5 value. If we allowed both transactions to succeed, the total_event_count would end up as 6 instead of the correct value 7. Postgres is protecting us from this "lost update" problem by aborting the second transaction.

We actually need to use a less strict isolation level like IsolationLevel.READ_COMMITTED in order to avoid the could not serialize access due to concurrent update serialization errors. This works because UPDATE event_stats SET total_event_count = total_event_count + 1 is an atomic operation (all the work happens in the database and avoids the read-modify-write cycle going back and forth from the database to application) and will just be able to read whatever value the database has at the time.

Basically, we need the TRIGGER counter logic to run with IsolationLevel.READ_COMMITTED or have the transactions run sequentially and wait for each other (not sure how to do this). Since we can't change the isolation level mid transaction, we would have to set the whole events insert/delete transaction to IsolationLevel.READ_COMMITTED (needs investigation in to whether that's viable).

I've also tried to research best-practices for concurrency-safe atomic counters when running in REPEATABLE READ isolation but I'm not finding anything useful.

I'm sure there are other solutions as well (feel free to suggest)

One alternative for total_event_count is to just naively add the highest stream_ordering and the absolute value of the lowest stream_ordering which will be negative for backfilled events. This doesn't work for the counts of m.room.message and m.room.encrypted though. And also doesn't account for deleting messages from the database.

Another alternative to cover all of the bases is just to regularly tally up the events and keep track of a stream_ordering position that we last counted. Perhaps using the same daily counts we already do. Doesn't account for deleting messages from the database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: whether it's viable to change the insert/delete events transactions from IsolationLevel.REPEATABLE_READ to IsolationLevel.READ_COMMITTED

There has already been a desire to get away from IsolationLevel.REPEATABLE_READ where possible, see matrix-org/synapse#11567 (#11567)

We will need to audit the relevant transactions to see whether the problems we're no longer protected from with the lower isolation level will cause havoc:

  1. Is it possible to have a "lost update" scenario?
  2. Will "read skew" affect our outcome?

We have two places that insert/delete events that we need to look at, _persist_events_txn and _purge_room_txn.

_persist_events_txn

For _persist_events_txn, it's a pretty big transaction harder to be confident in a conclusion. I feel like I need a second set of eyes on this. Maybe my criteria is wrong.

  • Acceptable "lost updates" means it does some UPDATE/DELETE or read-modify-write cycles but is fine because TODO or it's constrained to the room_id which means it's fine because we will only ever process events in a room serially.
  • N/A "lost updates" means it doesn't apply to this function because it doesn't do any UPDATE or read-modify-write cycles.
  • Acceptable "read skew" means the function does read values (WHERE conditions) but it's fine to get an up-to-date value.
  • N/A "read skew" means it doesn't apply to this function because it doesn't read anything (like a function that only does INSERT)

Check list (still have a few to go through):

  • SELECT room_version FROM rooms WHERE room_id = ? FOR SHARE
    • N/A "lost updates"
    • Acceptable "read skew" (room_version doesn't change)
  • _update_room_depths_txn
    • N/A "lost updates"
    • Acceptable "read skew" (only looks at room_id)
  • _update_outliers_txn
    • [Acceptable/N/A] "lost updates"
    • [Acceptable/N/A] "read skew"
  • _store_event_txn
    • Acceptable "lost updates" (marking a redaction event as have_censored = False isn't harmful, we can always redact again)
    • Acceptable "read skew" (marking a redaction event as have_censored = False isn't harmful, we can always redact again)
  • _update_forward_extremities_txn
    • Acceptable "lost updates" (DELETE constrained to room_id)
    • Acceptable "read skew" (only looks at room_id)
  • _persist_transaction_ids_txn
    • N/A "lost updates"
    • N/A "read skew"
  • _store_event_state_mappings_txn
    • Acceptable "lost updates" (only looks at event_id)
    • Acceptable "read skew" (only looks at event_id)
  • _persist_event_auth_chain_txn
    • Acceptable "lost updates" (it's fine to delete from event_auth_chain_to_calculate when the auth chain is calculated)
    • Acceptable "read skew" (only looks at event_id)
  • _store_rejected_events_txn
    • N/A "lost updates" (only INSERT)
    • N/A "read skew" (only INSERT)
  • _update_metadata_tables_txn
    • [Acceptable/N/A] "lost updates"
    • [Acceptable/N/A] "read skew"
  • _update_current_state_txn
    • ❌ "lost updates" (room_version never changes, mostly constrained to the room_id, there is a read-modify-write cycle between current_state_events and device_lists_remote_extremeties)
    • ❌ "read skew" (mostly constrained to the room_id, there is a read-modify-write cycle between current_state_events and device_lists_remote_extremeties)
  • _update_sliding_sync_tables_with_new_persisted_events_txn
    • Acceptable "lost updates" (constrained to room_id)
    • Acceptable "read skew" (constrained to room_id)

_purge_room_txn

We already run _purge_room_txn once with IsolationLevel.READ_COMMITTED and once with the default isolation level. It's unclear to me why the second run doesn't also use IsolationLevel.READ_COMMITTED? Feels like we could change that without consequence. In terms of why we're running this twice, there are a few notes at matrix-org/synapse#12942 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: even putting the isolation aside, it's worth noting that under READ COMMITTED, the UPDATE query will block until the other transaction commits. So we still get some bad effect from bottlenecking on one row — not sure if we care or not though.

@github-actions github-actions bot deployed to PR Documentation Preview April 30, 2025 21:13 Active
@github-actions github-actions bot deployed to PR Documentation Preview May 1, 2025 04:08 Active
@@ -0,0 +1 @@
Add `total_event_count`, `total_message_count`, and `total_e2ee_event_count` fields to the homeserver usage statistics.
Copy link
Contributor Author

@MadLittleMods MadLittleMods May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workaround

We already report synapse_storage_events_persisted_events and synapse_storage_events_persisted_events_sep metrics which end up with a _total suffix when reported so you will see synapse_storage_events_persisted_events_sep_total for example. Don't be confused by this _total suffix as it's just a Prometheus convention for counters and it tracks cumulative increments, not the absolute event count in Synapse

We additionally have a set of Prometheus rules to collect these metrics as synapse_storage_events_persisted_by_event_type

Which can be presented in Grafana,

Grafana panel JSON
{
  "datasource": {
    "uid": "$datasource",
    "type": "prometheus"
  },
  "fieldConfig": {
    "defaults": {
      "custom": {
        "lineWidth": 1,
        "fillOpacity": 80,
        "gradientMode": "none",
        "axisPlacement": "auto",
        "axisLabel": "",
        "axisColorMode": "text",
        "axisBorderShow": false,
        "scaleDistribution": {
          "type": "linear"
        },
        "axisCenteredZero": false,
        "hideFrom": {
          "tooltip": false,
          "viz": false,
          "legend": false
        },
        "thresholdsStyle": {
          "mode": "off"
        }
      },
      "color": {
        "mode": "palette-classic"
      },
      "mappings": [],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          },
          {
            "color": "red",
            "value": 80
          }
        ]
      },
      "min": 0,
      "unit": "none"
    },
    "overrides": [
      {
        "__systemRef": "hideSeriesFrom",
        "matcher": {
          "id": "byNames",
          "options": {
            "mode": "exclude",
            "names": [
              "m.room.encrypted"
            ],
            "prefix": "All except:",
            "readOnly": true
          }
        },
        "properties": [
          {
            "id": "custom.hideFrom",
            "value": {
              "viz": true,
              "legend": false,
              "tooltip": false
            }
          }
        ]
      }
    ]
  },
  "gridPos": {
    "h": 7,
    "w": 12,
    "x": 12,
    "y": 37
  },
  "id": 46,
  "options": {
    "orientation": "auto",
    "xTickLabelRotation": 0,
    "xTickLabelSpacing": 0,
    "showValue": "auto",
    "stacking": "normal",
    "groupWidth": 0.85,
    "barWidth": 0.97,
    "barRadius": 0,
    "fullHighlight": false,
    "tooltip": {
      "mode": "single",
      "sort": "none"
    },
    "legend": {
      "showLegend": true,
      "displayMode": "list",
      "placement": "bottom",
      "calcs": []
    }
  },
  "pluginVersion": "10.4.1",
  "targets": [
    {
      "datasource": {
        "uid": "$datasource"
      },
      "editorMode": "code",
      "expr": "sum by (type) (increase(synapse_storage_events_persisted_by_event_type{job=~\"$job\",index=~\"$index\",instance=\"$instance\"}[1d]))",
      "format": "time_series",
      "instant": false,
      "intervalFactor": 1,
      "legendFormat": "__auto",
      "refId": "A",
      "step": 20,
      "interval": "1d"
    }
  ],
  "title": "Events per day by Type",
  "type": "barchart"
}

Graph in Grafana showing off daily counts of sent m.room.encrypted messages

If you retain all of your Prometheus data forever, you can then go one step further and create absolute counts from this but it's usually common practice to only persist the last X months of metrics. To get absolute values, you will have to manually keep track and sum these daily counts.

Given that this PR would require further effort to figure out the concurrency problems or a refactor to a different solution, we're going to opt to close it in favor of this workaround. The people having collate these reports, manually collect data anyway so we will just have them continue with the database queries or these metrics.

Overall, this kind of feature seems fine but turned out to be a lot more effort than originally planned and we don't want to continue if people can live without it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants