Skip to content

Add total event, unencrypted message, and e2ee event counts to stats reporting #18260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Apr 15, 2025

Conversation

anoadragon453
Copy link
Member

@anoadragon453 anoadragon453 commented Mar 21, 2025

Adds total message counts to homeserver stats reporting. This is primarily to comply with the TI-Messenger spec, which requires each homeserver to report "Number of message events as an integer, cumulative".

Recommended to review commit-by-commit.

Dev notes

poetry run trial tests.metrics.test_phone_home_stats
poetry run trial tests.storage.test_event_stats

Test with Postgres

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.metrics.test_phone_home_stats
SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres poetry run trial tests.storage.test_event_stats

Todo

  • Bump delta synapse/storage/schema/main/delta/90
    • Make sure 9002 reference is updated
  • Fix background updates
  • Add background update tests if possible

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct
    (run the linters)

Add total message counts to homeserver stats reporting. This is
primarily to comply with [the TI-Messenger spec](https://gemspec.gematik.de/docs/gemSpec/gemSpec_Perf/gemSpec_Perf_V2.55.1/index.html#A_23119-03), which requires each
homeserver to report "Number of message events as an integer,
cumulative".
We didn't appear to have any tests for the phone home stats yet, so this
commit adds a unit test for every field, including those that were
added.

It works by first performing some activity (creating users, joining
rooms, sending messages), then generating stats and checking their
output.

Relies on newly-added guest user support to the user shared-secret
registration Admin API (PR'd separately).
@anoadragon453 anoadragon453 marked this pull request as ready for review March 21, 2025 19:53
@anoadragon453 anoadragon453 requested a review from a team as a code owner March 21, 2025 19:53
@github-actions github-actions bot deployed to PR Documentation Preview March 21, 2025 19:54 Active
Comment on lines 138 to 139
WHERE type = 'm.room.message'
AND state_key IS NULL
Copy link
Contributor

@MadLittleMods MadLittleMods Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by: There isn't an index for type or state_key on the events table. This probably won't be play nice with the database.

Perhaps we're okay with the full table scan?

We do have similar queries for the daily counts but they are restricted to only scan over the current days worth of messages using stream_ordering

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. I just tested on matrix.org, and it takes about ~5m to run this query. This isn't the worst thing for a metrics job that runs every 3hrs. However, it is concerning that a connection to the DB would be taken up for so long.

We could eliminate that concern by chunking the count query through batching on stream_ordering (which does have an index). But you'd still take significantly longer to generate the metrics than we do today.

We could also add a partial index on WHERE state_key IS NULL AND (type = 'm.room.message OR type = 'm.room.encrypted'). This will take up significantly less disk space than a full index on both type and state_key, while still making the query extremely quick.

The events table on matrix.org right now is ~4300GB. m.room.{encrypted,message} make up ~86.5% of the table, with the extreme majority of those rows having state_key = NULL; so the index would be roughly ~750GB:

Partial Index Size ≈ Table Size × % of Matching Rows × Index Ratio (10-30%)
750GB ≈ 4300GB × 0.865 × 0.20

A full index would be 1.5 - 2.5TB across both type and state_key. A partial index does reduce the flexibility of queries we can make, but I don't think we should add indexes with the hope of using them in the future, especially if it comes at the cost of a lot of disk space.

We'd need to add a background update to compute the index. And then, presumably, set these fields to 0 until the partial index has finished being added.

Does that sound like a reasonable path forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like a waste for numbers that probably won't be looked at. Are these stats interesting at all beyond our daily counts?

We do have daily_user_type_xxx vs total_users and daily_active_rooms vstotal_room_count. So this is pretty much the equivalent for daily_sent_messages and daily_sent_e2ee_messages 👍


Overall, seems like an ok plan and necessary evil if we want this feature

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikjohnston mentioned internally that such a large index would cause significant strain on performance as you'd need to pull out the 750GB index from disk each time in order to perform the count.

@reivilibre suggested that instead we keep track of the stream_ordering when we scan. So we'd only need to do a full scan once, then subsequently we'd only need a very fast query to scan the rows that have been added since the last scan. This eliminates the need for an index, which making every query other than the first very fast.

However, it doesn't account for the data becoming out of sync over time as rooms and events are deleted by users. To account for this, you could do a full rescan every so often to reset the error drift.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am never sure how I feel about suggesting this, but you could also in theory use triggers. Perhaps using triggers for decrementing the count when an event is deleted, would not be the worst thing ever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would add a slightly more processing time to each event insert and removal, but likely a negligible amount? And would completely eliminate the massive query needed to count everything. I like this idea.

We'd still need a background update to initially populate the table, but that's reasonable.

I think I'll give that a shot. It's certainly much simpler than a batch job running on a timer. Thanks for suggesting it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to add a background job which adds the triggers and populates the event_stats table from the existing events

@MadLittleMods MadLittleMods requested a review from a team March 21, 2025 23:52
@MadLittleMods MadLittleMods self-assigned this Apr 7, 2025
@MadLittleMods MadLittleMods changed the title Add total message and e2ee event counts to stats reporting Add total events, unencrypted message, and e2ee event counts to stats reporting Apr 7, 2025
Python interpreting `%` in the string instead of SQL
This avoid the pitfalls from events that may be lost in the time
gap if we first populated the `event_stats` table then tried
to add the triggers.

We also get to avoid double-count issues from trying to keep track
where the triggers were added if they happen in the delta itself.
@MadLittleMods MadLittleMods changed the title Add total events, unencrypted message, and e2ee event counts to stats reporting Add total event, unencrypted message, and e2ee event counts to stats reporting Apr 7, 2025
@github-actions github-actions bot deployed to PR Documentation Preview April 7, 2025 21:41 Active
```
  File "/home/eric/Documents/github/element/synapse/synapse/storage/databases/main/events_bg_updates.py", line 2703, in _add_triggers_txn
    txn.execute(
  File "/home/eric/Documents/github/element/synapse/synapse/storage/database.py", line 427, in execute
    self._do_execute(self.txn.execute, sql, parameters)
  File "/home/eric/Documents/github/element/synapse/synapse/storage/database.py", line 489, in _do_execute
    return func(sql, *args, **kwargs)
psycopg2.errors.SyntaxError: syntax error at or near "EXCEPTION"
LINE 8:                     EXCEPTION
```
@github-actions github-actions bot deployed to PR Documentation Preview April 8, 2025 20:03 Active
Comment on lines +2774 to +2780
WITH event_batch AS (
SELECT *
FROM events
WHERE stream_ordering > ? AND stream_ordering <= ?
ORDER BY stream_ordering ASC
LIMIT ?
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We first grab the events relevant to this batch. It's possible that there are no rows in this range.

Comment on lines +2789 to +2793
UNION ALL

SELECT null, 0, 0, 0
WHERE NOT EXISTS (SELECT 1 FROM event_batch)
LIMIT 1
Copy link
Contributor

@MadLittleMods MadLittleMods Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we have COALESCE(..., 0) above, that whole SELECT will only return a row if there are rows in the event_batch.

So we use the UNION ALL with this fallback so that we return default 0 counts when there are no rows in the event_batch.

@@ -19,7 +19,7 @@
#
#

SCHEMA_VERSION = 91 # remember to update the list below when updating
SCHEMA_VERSION = 92 # remember to update the list below when updating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 92 as 91 already shipped in Synapse 1.128.0rc1 (2025-04-01) (bumped from #18277)

Comment on lines +166 to +167
Changes in SCHEMA_VERSION = 91
- TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just marking this to fill in as a future task (not in this PR)

Relevant PR for 91 #18277

@github-actions github-actions bot deployed to PR Documentation Preview April 8, 2025 20:16 Active
@MadLittleMods MadLittleMods marked this pull request as ready for review April 8, 2025 21:11
Comment on lines +2657 to +2659
IF TG_OP = 'INSERT' THEN
-- Always increment total_event_count
UPDATE event_stats SET total_event_count = total_event_count + 1;
Copy link
Contributor

@reivilibre reivilibre Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a proper review but wanted to ensure this is considered: do we know how this will interact with transactions (particularly asking for the frequent insertion case, rather than the less-frequent deletion case)?

I am a pinch worried that bottlenecking all event persistence transactions on one row could be problematic, either by adding extra latency due to blocking or causing some transactions to need retries due to could not serialize access due to concurrent update.

I guess a burn-in test on m.org would suffice.

Copy link
Contributor

@reivilibre reivilibre Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

→ looks like this is indeed troublesome: #18349 :(

@reivilibre reivilibre requested a review from a team April 14, 2025 14:59
Copy link
Member Author

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the whole. Thanks for updating this PR @MadLittleMods.

I haven't considered https://github.com/element-hq/synapse/pull/18260/files#r2042334703 yet. Testing on matrix.org would be one option... as long as we have a rapid way to pause the bg update should it become a drag on resources

Comment on lines +167 to +171
# We expect these values to double as the background update is being run *again*
# and will double-count the `events`.
self.assertEqual(self.get_success(self.store.count_total_events()), 48)
self.assertEqual(self.get_success(self.store.count_total_messages()), 20)
self.assertEqual(self.get_success(self.store.count_total_e2ee_events()), 10)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry slightly that the background update can't just be run again to reset the stats back to a known-good state.

I suppose guidance to anyone who has run into an inconsistency would be to clear the stats in the database, and then run the bg update?

nit: not a blocking comment, just checking my understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose guidance to anyone who has run into an inconsistency would be to clear the stats in the database, and then run the bg update?

Yes, that should work 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would need to clear the stats while Synapse is down for that to work.
If doing it with Synapse running, the triggers will already be installed and will continue to increase the counts.
So they would need to be removed before clearing the stats and re-running the bg update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could update the background update to always clear the event_stats table when it adds the triggers. That way, we always get the correct count regardless of how many times the background update is run.

TODO

@github-actions github-actions bot deployed to PR Documentation Preview April 14, 2025 19:37 Active
Copy link
Member

@devonh devonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concerns about what kind of performance impact this is going to have. But trying it out in a RC seems like the best course of action.

The logic all looks correct.

@anoadragon453
Copy link
Member Author

I suggest we merge this for now and try it out. If successful, we'll tackle #18260 (comment) (and some documentation for it) in a follow-up PR.

@anoadragon453 anoadragon453 merged commit a832375 into develop Apr 15, 2025
41 checks passed
@anoadragon453 anoadragon453 deleted the anoa/export_total_message_count branch April 15, 2025 14:49
devonh added a commit that referenced this pull request Apr 16, 2025
…o stats reporting" (#18346)

Reverts #18260

It is causing a failure when building release debs for `debian:bullseye`
with the following error:
```
sqlite3.OperationalError: near "RETURNING": syntax error
```
MadLittleMods added a commit that referenced this pull request Apr 29, 2025
sandhose added a commit that referenced this pull request Apr 30, 2025
See #18260

This is useful for anyone who tried Synapse v1.129.0rc1 out

Fixes #18349

To test:

 - checkout v1.129.0rc1 and start
- check that the events table has the trigger (`\dS events` with
postgres)
 - checkout this PR and start
 - check that the events table doesn't have the trigger anymore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants