Add total event, unencrypted message, and e2ee event counts to stats reporting #18260

anoadragon453 · 2025-03-21T19:51:38Z

Adds total message counts to homeserver stats reporting. This is primarily to comply with the TI-Messenger spec, which requires each homeserver to report "Number of message events as an integer, cumulative".

~~Recommended to review commit-by-commit.~~

Dev notes

poetry run trial tests.metrics.test_phone_home_stats
poetry run trial tests.storage.test_event_stats

Test with Postgres

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.metrics.test_phone_home_stats
SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.storage.test_event_stats

Todo

Bump delta synapse/storage/schema/main/delta/90
- Make sure 9002 reference is updated
Fix background updates
Add background update tests if possible

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct
(run the linters)

Add total message counts to homeserver stats reporting. This is primarily to comply with [the TI-Messenger spec](https://gemspec.gematik.de/docs/gemSpec/gemSpec_Perf/gemSpec_Perf_V2.55.1/index.html#A_23119-03), which requires each homeserver to report "Number of message events as an integer, cumulative".

We didn't appear to have any tests for the phone home stats yet, so this commit adds a unit test for every field, including those that were added. It works by first performing some activity (creating users, joining rooms, sending messages), then generating stats and checking their output. Relies on newly-added guest user support to the user shared-secret registration Admin API (PR'd separately).

MadLittleMods · 2025-03-21T23:52:26Z

synapse/storage/databases/main/metrics.py

+                WHERE type = 'm.room.message'
+                    AND state_key IS NULL


Drive-by: There isn't an index for type or state_key on the events table. This probably won't be play nice with the database.

Perhaps we're okay with the full table scan?

We do have similar queries for the daily counts but they are restricted to only scan over the current days worth of messages using stream_ordering

Very good point. I just tested on matrix.org, and it takes about ~5m to run this query. This isn't the worst thing for a metrics job that runs every 3hrs. However, it is concerning that a connection to the DB would be taken up for so long.

We could eliminate that concern by chunking the count query through batching on stream_ordering (which does have an index). But you'd still take significantly longer to generate the metrics than we do today.

We could also add a partial index on WHERE state_key IS NULL AND (type = 'm.room.message OR type = 'm.room.encrypted'). This will take up significantly less disk space than a full index on both type and state_key, while still making the query extremely quick.

The events table on matrix.org right now is ~4300GB. m.room.{encrypted,message} make up ~86.5% of the table, with the extreme majority of those rows having state_key = NULL; so the index would be roughly ~750GB:

Partial Index Size ≈ Table Size × % of Matching Rows × Index Ratio (10-30%) 750GB ≈ 4300GB × 0.865 × 0.20

A full index would be 1.5 - 2.5TB across both type and state_key. A partial index does reduce the flexibility of queries we can make, but I don't think we should add indexes with the hope of using them in the future, especially if it comes at the cost of a lot of disk space.

We'd need to add a background update to compute the index. And then, presumably, set these fields to 0 until the partial index has finished being added.

Does that sound like a reasonable path forward?

Feels like a waste for numbers that probably won't be looked at. Are these stats interesting at all beyond our daily counts?

We do have daily_user_type_xxx vs total_users and daily_active_rooms vstotal_room_count. So this is pretty much the equivalent for daily_sent_messages and daily_sent_e2ee_messages 👍

Overall, seems like an ok plan and necessary evil if we want this feature

@erikjohnston mentioned internally that such a large index would cause significant strain on performance as you'd need to pull out the 750GB index from disk each time in order to perform the count.

@reivilibre suggested that instead we keep track of the stream_ordering when we scan. So we'd only need to do a full scan once, then subsequently we'd only need a very fast query to scan the rows that have been added since the last scan. This eliminates the need for an index, which making every query other than the first very fast.

However, it doesn't account for the data becoming out of sync over time as rooms and events are deleted by users. To account for this, you could do a full rescan every so often to reset the error drift.

I am never sure how I feel about suggesting this, but you could also in theory use triggers. Perhaps using triggers for decrementing the count when an event is deleted, would not be the worst thing ever.

That would add a slightly more processing time to each event insert and removal, but likely a negligible amount? And would completely eliminate the massive query needed to count everything. I like this idea.

We'd still need a background update to initially populate the table, but that's reasonable.

I think I'll give that a shot. It's certainly much simpler than a batch job running on a timer. Thanks for suggesting it!

Updated to add a background job which adds the triggers and populates the event_stats table from the existing events ⏩

tests/metrics/test_phone_home_stats.py

Conflicts: synapse/types/storage/__init__.py

``` sqlite3.ProgrammingError: You can only execute one statement at a time. ```

Python interpreting `%` in the string instead of SQL

This avoid the pitfalls from events that may be lost in the time gap if we first populated the `event_stats` table then tried to add the triggers. We also get to avoid double-count issues from trying to keep track where the triggers were added if they happen in the delta itself.

``` File "/home/eric/Documents/github/element/synapse/synapse/storage/databases/main/events_bg_updates.py", line 2703, in _add_triggers_txn txn.execute( File "/home/eric/Documents/github/element/synapse/synapse/storage/database.py", line 427, in execute self._do_execute(self.txn.execute, sql, parameters) File "/home/eric/Documents/github/element/synapse/synapse/storage/database.py", line 489, in _do_execute return func(sql, *args, **kwargs) psycopg2.errors.SyntaxError: syntax error at or near "EXCEPTION" LINE 8: EXCEPTION ```

MadLittleMods · 2025-04-08T20:07:12Z

synapse/storage/databases/main/events_bg_updates.py

+                WITH event_batch AS (
+                    SELECT *
+                    FROM events
+                    WHERE stream_ordering > ? AND stream_ordering <= ?
+                    ORDER BY stream_ordering ASC
+                    LIMIT ?
+                ),


We first grab the events relevant to this batch. It's possible that there are no rows in this range.

MadLittleMods · 2025-04-08T20:07:59Z

synapse/storage/databases/main/events_bg_updates.py

+                    UNION ALL
+
+                    SELECT null, 0, 0, 0
+                    WHERE NOT EXISTS (SELECT 1 FROM event_batch)
+                    LIMIT 1


Even though we have COALESCE(..., 0) above, that whole SELECT will only return a row if there are rows in the event_batch.

So we use the UNION ALL with this fallback so that we return default 0 counts when there are no rows in the event_batch.

MadLittleMods · 2025-04-08T20:11:33Z

synapse/storage/schema/__init__.py

@@ -19,7 +19,7 @@
 #
 #

-SCHEMA_VERSION = 91  # remember to update the list below when updating
+SCHEMA_VERSION = 92  # remember to update the list below when updating


Using 92 as 91 already shipped in Synapse 1.128.0rc1 (2025-04-01) (bumped from #18277)

MadLittleMods · 2025-04-08T20:12:16Z

synapse/storage/schema/__init__.py

+Changes in SCHEMA_VERSION = 91
+    - TODO


Just marking this to fill in as a future task (not in this PR)

Relevant PR for 91 #18277

reivilibre · 2025-04-14T14:57:46Z

synapse/storage/databases/main/events_bg_updates.py

+                        IF TG_OP = 'INSERT' THEN
+                            -- Always increment total_event_count
+                            UPDATE event_stats SET total_event_count = total_event_count + 1;


not a proper review but wanted to ensure this is considered: do we know how this will interact with transactions (particularly asking for the frequent insertion case, rather than the less-frequent deletion case)?

I am a pinch worried that bottlenecking all event persistence transactions on one row could be problematic, either by adding extra latency due to blocking or causing some transactions to need retries due to could not serialize access due to concurrent update.

I guess a burn-in test on m.org would suffice.

→ looks like this is indeed troublesome: #18349 :(

anoadragon453

LGTM on the whole. Thanks for updating this PR @MadLittleMods.

I haven't considered https://github.com/element-hq/synapse/pull/18260/files#r2042334703 yet. Testing on matrix.org would be one option... as long as we have a rapid way to pause the bg update should it become a drag on resources

synapse/storage/schema/main/delta/92/01_event_stats.sql

anoadragon453 · 2025-04-14T19:30:57Z

tests/storage/test_event_stats.py

+        # We expect these values to double as the background update is being run *again*
+        # and will double-count the `events`.
+        self.assertEqual(self.get_success(self.store.count_total_events()), 48)
+        self.assertEqual(self.get_success(self.store.count_total_messages()), 20)
+        self.assertEqual(self.get_success(self.store.count_total_e2ee_events()), 10)


I worry slightly that the background update can't just be run again to reset the stats back to a known-good state.

I suppose guidance to anyone who has run into an inconsistency would be to clear the stats in the database, and then run the bg update?

nit: not a blocking comment, just checking my understanding.

I suppose guidance to anyone who has run into an inconsistency would be to clear the stats in the database, and then run the bg update?

Yes, that should work 👍

They would need to clear the stats while Synapse is down for that to work.
If doing it with Synapse running, the triggers will already be installed and will continue to increase the counts.
So they would need to be removed before clearing the stats and re-running the bg update.

We could update the background update to always clear the event_stats table when it adds the triggers. That way, we always get the correct count regardless of how many times the background update is run.

TODO

Co-authored-by: Andrew Morgan <[email protected]>

devonh

I have the same concerns about what kind of performance impact this is going to have. But trying it out in a RC seems like the best course of action.

The logic all looks correct.

anoadragon453 · 2025-04-15T14:48:33Z

I suggest we merge this for now and try it out. If successful, we'll tackle #18260 (comment) (and some documentation for it) in a follow-up PR.

…o stats reporting" (#18346) Reverts #18260 It is causing a failure when building release debs for `debian:bullseye` with the following error: ``` sqlite3.OperationalError: near "RETURNING": syntax error ```

…reporting (#18260) Co-authored-by: Eric Eastwood <[email protected]>

See #18260 This is useful for anyone who tried Synapse v1.129.0rc1 out Fixes #18349 To test: - checkout v1.129.0rc1 and start - check that the events table has the trigger (`\dS events` with postgres) - checkout this PR and start - check that the events table doesn't have the trigger anymore

anoadragon453 added 3 commits March 21, 2025 18:43

newsfile

5213c87

anoadragon453 marked this pull request as ready for review March 21, 2025 19:53

anoadragon453 requested a review from a team as a code owner March 21, 2025 19:53

github-actions bot deployed to PR Documentation Preview March 21, 2025 19:54 Active

MadLittleMods reviewed Mar 21, 2025

View reviewed changes

MadLittleMods requested a review from a team March 21, 2025 23:52

MadLittleMods reviewed Mar 24, 2025

View reviewed changes

tests/metrics/test_phone_home_stats.py Outdated Show resolved Hide resolved

tests/metrics/test_phone_home_stats.py Outdated Show resolved Hide resolved

tests/metrics/test_phone_home_stats.py Outdated Show resolved Hide resolved

wip

c8de2c5

MadLittleMods self-assigned this Apr 7, 2025

MadLittleMods added 6 commits April 7, 2025 11:31

Merge branch 'develop' into anoa/export_total_message_count

cc9e31a

Conflicts: synapse/types/storage/__init__.py

Add total_event_count

f0d6b26

Add descriptions

032ae5e

Fix some table/column mismatches

575bbe9

Fill in trigger logic

9b8d9a9

Can only run one statement at a time

e320f2d

``` sqlite3.ProgrammingError: You can only execute one statement at a time. ```

MadLittleMods changed the title ~~Add total message and e2ee event counts to stats reporting~~ Add total events, unencrypted message, and e2ee event counts to stats reporting Apr 7, 2025

MadLittleMods added 9 commits April 7, 2025 14:05

Adjust names

7c9fcb3

Fix some background update lints

8a04a08

Fix builtins.IndexError: tuple index out of range

90e57ff

Python interpreting `%` in the string instead of SQL

Iterate on background update

942d066

Better names

ba40e0c

backfill -> populate

6a1a03c

Make sure the trigger SQL can be run multiple times

4c65945

Docs

6cb50b0

MadLittleMods changed the title ~~Add total events, unencrypted message, and e2ee event counts to stats reporting~~ Add total event, unencrypted message, and e2ee event counts to stats reporting Apr 7, 2025

github-actions bot deployed to PR Documentation Preview April 7, 2025 21:41 Active

MadLittleMods added 3 commits April 7, 2025 19:01

Working _populate_txn

eeb6dba

Add tests for the background updates

55cff0e

Merge branch 'develop' into anoa/export_total_message_count

1e68f6a

github-actions bot deployed to PR Documentation Preview April 8, 2025 20:03 Active

MadLittleMods reviewed Apr 8, 2025

View reviewed changes

MadLittleMods added 2 commits April 8, 2025 15:14

Better doc comment

0985e14

Fix lints

0a06be3

github-actions bot deployed to PR Documentation Preview April 8, 2025 20:16 Active

MadLittleMods marked this pull request as ready for review April 8, 2025 21:11

Merge branch 'develop' into anoa/export_total_message_count

a67e185

reivilibre reviewed Apr 14, 2025

View reviewed changes

reivilibre requested a review from a team April 14, 2025 14:59

anoadragon453 commented Apr 14, 2025

View reviewed changes

Use correct delta ordering number

6381d99

Co-authored-by: Andrew Morgan <[email protected]>

github-actions bot deployed to PR Documentation Preview April 14, 2025 19:37 Active

devonh approved these changes Apr 14, 2025

View reviewed changes

anoadragon453 merged commit a832375 into develop Apr 15, 2025
41 checks passed

anoadragon453 deleted the anoa/export_total_message_count branch April 15, 2025 14:49

devonh mentioned this pull request Apr 16, 2025

Revert "Add total event, unencrypted message, and e2ee event counts to stats reporting" #18346

Merged

tulir mentioned this pull request Apr 17, 2025

Sending many events concurrently throws serialization errors on v1.129.0rc1 #18349

Closed

MadLittleMods added a commit that referenced this pull request Apr 29, 2025

Add total event, unencrypted message, and e2ee event counts to stats …

7f044f9

…reporting (#18260) Co-authored-by: Eric Eastwood <[email protected]>

MadLittleMods mentioned this pull request Apr 29, 2025

Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371

Closed

6 tasks

sandhose added a commit that referenced this pull request Apr 29, 2025

Remove the trigger added in #18260 and then reverted

d9794cb

sandhose mentioned this pull request Apr 29, 2025

Remove the trigger added in #18260 and then reverted #18373

Merged

Add total event, unencrypted message, and e2ee event counts to stats reporting #18260

Add total event, unencrypted message, and e2ee event counts to stats reporting #18260

Uh oh!

Conversation

anoadragon453 commented Mar 21, 2025 • edited by MadLittleMods Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dev notes

Todo

Pull Request Checklist

Uh oh!

MadLittleMods Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reivilibre Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reivilibre Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anoadragon453 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devonh left a comment

Choose a reason for hiding this comment

Uh oh!

anoadragon453 commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

anoadragon453 commented Mar 21, 2025 •

edited by MadLittleMods

Loading

MadLittleMods Mar 21, 2025 •

edited

Loading

MadLittleMods Apr 8, 2025 •

edited

Loading

reivilibre Apr 14, 2025 •

edited

Loading

reivilibre Apr 23, 2025 •

edited

Loading