-
Notifications
You must be signed in to change notification settings - Fork 344
Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add total event, unencrypted message, and e2ee event counts to stats reporting (v2) #18371
Conversation
…reporting (#18260) Co-authored-by: Eric Eastwood <[email protected]>
This was the original reason the PR was reverted, see #18346 > The `RETURNING` syntax has been supported by SQLite since version 3.35.0 (2021-03-12). > > *-- https://www.sqlite.org/lang_returning.html* Synapse supports... > The oldest supported version of SQLite is the version [provided](https://packages.debian.org/bullseye/libsqlite3-0) by [Debian oldstable](https://wiki.debian.org/DebianOldStable). > > *-- https://element-hq.github.io/synapse/latest/deprecation_policy.html* which currently is https://packages.debian.org/bullseye/sqlite3 -> `3.34.1-3+deb11u1` We have `self.db_pool.engine.supports_returning` to detect whether we can use `RETURNING`.
So we always get the correct count regardless of how many times the background update is run,
tok=user_2_token, | ||
) | ||
|
||
def test_concurrent_event_insert(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use a few pointers on how best to architect this test. I'd like to start a transaction that involves the event_stats
table, then try to insert a new event to try to reproduce the concurrency error: psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update
(#18349). The problem with the current test is that the transaction just blocks everything where asdf1
is printed but not asdf2
. I'm not sure how best to release control back to the GIL to run other tasks and continue the test.
I'm assuming the real-life scenario that caused this error happens when you have multiple stream writer workers for the events
stream (multiple event_perister
) since it "experimentally supports having multiple writer workers, where load is sharded between them by room ID. Each writer is called an event persister." (source). And then you just send events into different rooms at the same time that are handled by separate workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think part of the problem is that ThreadedMemoryReactorClock
"doesn't really use threads" and uses make_fake_db_pool(...)
which "runs db queries synchronously [...] on the test reactor's main thread".
Lines 699 to 705 in d59bbd8
"""Wrapper for `make_pool` which builds a pool which runs db queries synchronously. | |
For more deterministic testing, we don't use a regular db connection pool: instead | |
we run all db queries synchronously on the test reactor's main thread. This function | |
is a drop-in replacement for the normal `make_pool` which builds such a connection | |
pool. | |
""" |
Lines 638 to 641 in d59bbd8
# [1]: we replace the threadpool backing the db connection pool with a | |
# mock ThreadPool which doesn't really use threads; but we still use | |
# reactor.callFromThread to feed results back from the db functions to the | |
# main thread. |
So no matter how much I try to split things in threads for parallel execution, it just blocks.
Other references:
defer_to_thread(...)
(thanks for pointing this one out @anoadragon453)self.reactor.callFromThread(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't have a solution for this.
I've also tried using a BaseMultiWorkerStreamTestCase
with sharded "event_persisters" (multiple workers for the events
stream) to mimic the real-life scenario that I think could reproduce this error (see test_sharded_event_persisters
) but only the first event in each room seems to send (requires more investigation). This kind of test doesn't guarantee a reproduction either as the stars could align and events could send fine. Additionally, as BaseMultiWorkerStreamTestCase
is implemented right now, I don't think it will work as all of workers share the same reactor so they at-best send one at a time without concurrency conflicts.
if isinstance(txn.database_engine, Sqlite3Engine): | ||
txn.execute( | ||
""" | ||
CREATE TRIGGER IF NOT EXISTS event_stats_events_insert_trigger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that TRIGGER
's may not be a viable option overall because of the concurrency problems (psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update
) with the default IsolationLevel.READ_COMMITTED
for database transactions.
When we insert events, we have pretty giant transactions to also update all of the associated tables in one go. I'm concerned that changing the isolation level of all of the transactions where we insert/delete events
will unacceptably degrade performance (probably need IsolationLevel.SERIALIZABLE
, need to investigate whether IsolationLevel.REPEATABLE_READ
is sufficient).
Any good alternatives or words of encouragement just to try it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can set the isolation level per transaction apparently:
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- your queries here
COMMIT;
I'm not sure if we have control over that, but that may be preferable to changing the whole database over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anoadragon453 We can definitely set the isolation level on a per transaction basis but the problem is that the triggers run as part of the transaction inserting the events
. So wherever we insert/delete events
, we need to change the isolation level and it appears like the transactions involving events
are pretty large so using IsolationLevel.SERIALIZABLE
(the result of the transactions come out the other side as if they were run sequentially), concerns me as they may be waiting on each other to finish.
Example of changing isolation level:
synapse/synapse/storage/databases/main/purge_events.py
Lines 345 to 356 in 5ab05e7
# This first runs the purge transaction with READ_COMMITTED isolation level, | |
# meaning any new rows in the tables will not trigger a serialization error. | |
# We then run the same purge a second time without this isolation level to | |
# purge any of those rows which were added during the first. | |
logger.info("[purge] Starting initial main purge of [1/2]") | |
await self.db_pool.runInteraction( | |
"purge_room", | |
self._purge_room_txn, | |
room_id=room_id, | |
isolation_level=IsolationLevel.READ_COMMITTED, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI the default transaction isolation level in Synapse is repeatable read:
synapse/synapse/storage/engines/postgres.py
Lines 73 to 75 in 2965c99
self.default_isolation_level = ( | |
psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this problem a bit better now. The problem is that our default IsolationLevel.REPEATABLE_READ
(also known as "snapshot isolation") will always get a consistent snapshot of the database during the transaction and prevents lost updates by aborting offending transactions. So for example if the total_event_count
is 5
, when one transaction runs UPDATE event_stats SET total_event_count = total_event_count + 1
, it will use 5
value in the snapshot, and if another concurrent transaction runs, it will also see the 5
value. If we allowed both transactions to succeed, the total_event_count
would end up as 6
instead of the correct value 7
. Postgres is protecting us from this "lost update" problem by aborting the second transaction.
We actually need to use a less strict isolation level like IsolationLevel.READ_COMMITTED
in order to avoid the could not serialize access due to concurrent update
serialization errors. This works because UPDATE event_stats SET total_event_count = total_event_count + 1
is an atomic operation (all the work happens in the database and avoids the read-modify-write cycle going back and forth from the database to application) and will just be able to read whatever value the database has at the time.
Basically, we need the TRIGGER
counter logic to run with IsolationLevel.READ_COMMITTED
or have the transactions run sequentially and wait for each other (not sure how to do this). Since we can't change the isolation level mid transaction, we would have to set the whole events
insert/delete transaction to IsolationLevel.READ_COMMITTED
(needs investigation in to whether that's viable).
I've also tried to research best-practices for concurrency-safe atomic counters when running in REPEATABLE READ
isolation but I'm not finding anything useful.
I'm sure there are other solutions as well (feel free to suggest)
One alternative for total_event_count
is to just naively add the highest stream_ordering
and the absolute value of the lowest stream_ordering
which will be negative for backfilled events. This doesn't work for the counts of m.room.message
and m.room.encrypted
though. And also doesn't account for deleting messages from the database.
Another alternative to cover all of the bases is just to regularly tally up the events and keep track of a stream_ordering
position that we last counted. Perhaps using the same daily
counts we already do. Doesn't account for deleting messages from the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: whether it's viable to change the insert/delete events
transactions from IsolationLevel.REPEATABLE_READ
to IsolationLevel.READ_COMMITTED
There has already been a desire to get away from IsolationLevel.REPEATABLE_READ
where possible, see matrix-org/synapse#11567 (#11567)
We will need to audit the relevant transactions to see whether the problems we're no longer protected from with the lower isolation level will cause havoc:
- Is it possible to have a "lost update" scenario?
- Will "read skew" affect our outcome?
We have two places that insert/delete events
that we need to look at, _persist_events_txn
and _purge_room_txn
.
_persist_events_txn
For _persist_events_txn
, it's a pretty big transaction harder to be confident in a conclusion. I feel like I need a second set of eyes on this. Maybe my criteria is wrong.
- Acceptable "lost updates" means it does some
UPDATE
/DELETE
or read-modify-write cycles but is fine because TODO or it's constrained to theroom_id
which means it's fine because we will only ever process events in a room serially. - N/A "lost updates" means it doesn't apply to this function because it doesn't do any
UPDATE
or read-modify-write cycles. - Acceptable "read skew" means the function does read values (
WHERE
conditions) but it's fine to get an up-to-date value. - N/A "read skew" means it doesn't apply to this function because it doesn't read anything (like a function that only does
INSERT
)
Check list (still have a few to go through):
-
SELECT room_version FROM rooms WHERE room_id = ? FOR SHARE
- N/A "lost updates"
- Acceptable "read skew" (
room_version
doesn't change)
-
_update_room_depths_txn
- N/A "lost updates"
- Acceptable "read skew" (only looks at
room_id
)
-
_update_outliers_txn
- [Acceptable/N/A] "lost updates"
- [Acceptable/N/A] "read skew"
-
_store_event_txn
- Acceptable "lost updates" (marking a redaction event as
have_censored = False
isn't harmful, we can always redact again) - Acceptable "read skew" (marking a redaction event as
have_censored = False
isn't harmful, we can always redact again)
- Acceptable "lost updates" (marking a redaction event as
-
_update_forward_extremities_txn
- Acceptable "lost updates" (
DELETE
constrained toroom_id
) - Acceptable "read skew" (only looks at
room_id
)
- Acceptable "lost updates" (
-
_persist_transaction_ids_txn
- N/A "lost updates"
- N/A "read skew"
-
_store_event_state_mappings_txn
- Acceptable "lost updates" (only looks at
event_id
) - Acceptable "read skew" (only looks at
event_id
)
- Acceptable "lost updates" (only looks at
-
_persist_event_auth_chain_txn
- Acceptable "lost updates" (it's fine to delete from
event_auth_chain_to_calculate
when the auth chain is calculated) - Acceptable "read skew" (only looks at
event_id
)
- Acceptable "lost updates" (it's fine to delete from
-
_store_rejected_events_txn
- N/A "lost updates" (only
INSERT
) - N/A "read skew" (only
INSERT
)
- N/A "lost updates" (only
-
_update_metadata_tables_txn
- [Acceptable/N/A] "lost updates"
- [Acceptable/N/A] "read skew"
-
_update_current_state_txn
- ❌ "lost updates" (
room_version
never changes, mostly constrained to theroom_id
, there is a read-modify-write cycle betweencurrent_state_events
anddevice_lists_remote_extremeties
) - ❌ "read skew" (mostly constrained to the
room_id
, there is a read-modify-write cycle betweencurrent_state_events
anddevice_lists_remote_extremeties
)
- ❌ "lost updates" (
-
_update_sliding_sync_tables_with_new_persisted_events_txn
- Acceptable "lost updates" (constrained to
room_id
) - Acceptable "read skew" (constrained to
room_id
)
- Acceptable "lost updates" (constrained to
_purge_room_txn
We already run _purge_room_txn
once with IsolationLevel.READ_COMMITTED
and once with the default isolation level. It's unclear to me why the second run doesn't also use IsolationLevel.READ_COMMITTED
? Feels like we could change that without consequence. In terms of why we're running this twice, there are a few notes at matrix-org/synapse#12942 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: even putting the isolation aside, it's worth noting that under READ COMMITTED
, the UPDATE
query will block until the other transaction commits. So we still get some bad effect from bottlenecking on one row — not sure if we care or not though.
@@ -0,0 +1 @@ | |||
Add `total_event_count`, `total_message_count`, and `total_e2ee_event_count` fields to the homeserver usage statistics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workaround
We already report synapse_storage_events_persisted_events
and synapse_storage_events_persisted_events_sep
metrics which end up with a _total
suffix when reported so you will see synapse_storage_events_persisted_events_sep_total
for example. Don't be confused by this _total
suffix as it's just a Prometheus convention for counters and it tracks cumulative increments, not the absolute event count in Synapse
We additionally have a set of Prometheus rules to collect these metrics as synapse_storage_events_persisted_by_event_type
Which can be presented in Grafana,
Grafana panel JSON
{
"datasource": {
"uid": "$datasource",
"type": "prometheus"
},
"fieldConfig": {
"defaults": {
"custom": {
"lineWidth": 1,
"fillOpacity": 80,
"gradientMode": "none",
"axisPlacement": "auto",
"axisLabel": "",
"axisColorMode": "text",
"axisBorderShow": false,
"scaleDistribution": {
"type": "linear"
},
"axisCenteredZero": false,
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"thresholdsStyle": {
"mode": "off"
}
},
"color": {
"mode": "palette-classic"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"min": 0,
"unit": "none"
},
"overrides": [
{
"__systemRef": "hideSeriesFrom",
"matcher": {
"id": "byNames",
"options": {
"mode": "exclude",
"names": [
"m.room.encrypted"
],
"prefix": "All except:",
"readOnly": true
}
},
"properties": [
{
"id": "custom.hideFrom",
"value": {
"viz": true,
"legend": false,
"tooltip": false
}
}
]
}
]
},
"gridPos": {
"h": 7,
"w": 12,
"x": 12,
"y": 37
},
"id": 46,
"options": {
"orientation": "auto",
"xTickLabelRotation": 0,
"xTickLabelSpacing": 0,
"showValue": "auto",
"stacking": "normal",
"groupWidth": 0.85,
"barWidth": 0.97,
"barRadius": 0,
"fullHighlight": false,
"tooltip": {
"mode": "single",
"sort": "none"
},
"legend": {
"showLegend": true,
"displayMode": "list",
"placement": "bottom",
"calcs": []
}
},
"pluginVersion": "10.4.1",
"targets": [
{
"datasource": {
"uid": "$datasource"
},
"editorMode": "code",
"expr": "sum by (type) (increase(synapse_storage_events_persisted_by_event_type{job=~\"$job\",index=~\"$index\",instance=\"$instance\"}[1d]))",
"format": "time_series",
"instant": false,
"intervalFactor": 1,
"legendFormat": "__auto",
"refId": "A",
"step": 20,
"interval": "1d"
}
],
"title": "Events per day by Type",
"type": "barchart"
}
If you retain all of your Prometheus data forever, you can then go one step further and create absolute counts from this but it's usually common practice to only persist the last X months of metrics. To get absolute values, you will have to manually keep track and sum these daily counts.
Given that this PR would require further effort to figure out the concurrency problems or a refactor to a different solution, we're going to opt to close it in favor of this workaround. The people having collate these reports, manually collect data anyway so we will just have them continue with the database queries or these metrics.
Overall, this kind of feature seems fine but turned out to be a lot more effort than originally planned and we don't want to continue if people can live without it for now.
Re-introduce #18260 which was reverted.
Adds total message counts to homeserver stats reporting. This is primarily to comply with the TI-Messenger spec, which requires each homeserver to report "Number of message events as an integer, cumulative".
Dev notes
Test with Postgres
RETURNING
syntax in SQLiteSynapse supports...
which currently is https://packages.debian.org/bullseye/sqlite3 ->
3.34.1-3+deb11u1
We have
self.db_pool.engine.supports_returning
to detect whether we can useRETURNING
."could not serialize access due to concurrent update"
Related:
REPEATABLE READ
everywhere matrix-org/synapse#11567Postgres
Context #18349
Relevant error:
The default isolation in Postgres is
IsolationLevel.READ_COMMITTED
but we've set the default to beIsolationLevel.REPEATABLE_READ
in Synapse:synapse/synapse/storage/engines/postgres.py
Lines 73 to 75 in 2965c99
Docs: https://www.postgresql.org/docs/current/transaction-iso.html
Postgres
SERIALIZABLE
isolation level is actually "Serializable Snapshot Isolation" (SSI) where it optimistically allows transactions to proceed and aborts only if guarantees are violated (the serialization errors we're seeing).SQLite
In SQLite, all transactions are SERIALIZABLE by default and because of the way it works, each write waits their turn to run in serial order so we won't run into the same concurrency problems for incrementing a counter:
Are
TRIGGER
's still viable?Discussion around whether
TRIGGER
are viable: #18371 (comment)Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
The Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems book by Martin Kleppmann is fantastic reference for trying to understand the problem here.
Todo
RETURNING
not being compatible with old versions of SQLite (the original reason the PR was reverted), see Revert "Add total event, unencrypted message, and e2ee event counts to stats reporting" #18346event_stats
table when we add theTRIGGER
so we always get the correct count regardless of how many times the background update is run, see Add total event, unencrypted message, and e2ee event counts to stats reporting #18260 (comment)Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.(run the linters)