Investigate why /messages
is slow #13356
Description
Background
For Matrix Public Archive, we're getting a lot of 504 timeouts waiting for Matrix API requests to finish. Specifically
/messages
seems to be the slow culprit and I haven't really heard anything about optimizing it like/join
. From a clean Synapse cache,/messages
often takes 80s - 120s to respond and many times timing out itself from Synapse.Here is the average data for
#matrix:matrix.org
from thegitter.ems.host
homeserver which is sampled every 61 minutes to be outside of the state cache expiry,/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter={"lazy_load_members":true}
?limit=500
: 111.81s (492 events / 129 state events)?limit=100
: 119.504s (92 events / 50 state events)?limit=20
: 112.541s (20 events / 11 state events)After Synapse has a warm cache (I am guessing), these requests take about 5s - 10s
What can we do to make the
/messages
requests faster? Has any thought already been put in this? For existing issues, I see that we might be backfilling when we shouldn't which aligns with the request being slow regardless how big thelimit
is, #10742Caching the response on our end in the archive doesn't make the request fast in the first place.
From the
matrix.org
Grafana, I looked into theRoomMessageListRestServlet
timing and it all appears very fast (5ms - 500ms) and not what I experience. I am guessing it's not good at catching the outliers since the metric seems to just to total time divided by samples. But I would expect a lot more 1 second to 3 second timings even after a warm cache.-- @MadLittleMods, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$7ctkqN00-4HiYP5TT7bnfROfGXP5Lyt7yyWjrg8bGu4?via=matrix.org&via=element.io&via=beeper.com
Traces for reference
These include a bunch of the /messages
instrumentation added:
- 144.93s
/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/dfa0dad8a32255b366461c76b55c8ecb - 135.3s
/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/5b945a1733a414c61be6e393397c3f9d - 154.72s
/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/d3f9c621707081755ff629000c60b3f3 - 185.9s
/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761
Why is it slow?
Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11
This part is WIP. Just documenting some of the log diving I've done for /messages
being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.
1. Backfill linearizer lock takes forever
- Room:
#matrix:matrix.org
- API URL:
https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
- Request -> response duration:
117.808s
- Logs in Kibana:
GET-6890
- For reference on adjusting the logging config for
gitter.ems.host
(setDEBUG
), see this comment.
- For reference on adjusting the logging config for
Notes:
- Why does linearizer lock take ~45s to aquire lock?
- It doesn't seem to take a lot of time but we don't need to fetch 4.6k backfill points out of the database
- For me on the outside, the request took
117.808s
. Where did the rest of the time go?- Delay in connecting to the server?
- Delay in serializing the response?
- Need to time the requests more precisely when the started and ended to compare against the Synapse logs
Timing summary from looking at the logs:
03:42:43.129
(t+0s
): Received request03:42:43.712
(t+0s
) Waiting to acquire linearizer lock 'room_backfill'03:43:26.806
(t+43s
) Acquired linearizer lock 'room_backfill'03:43:26.825
(t+43s
) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill pointsget_oldest_event_ids_with_depth_in_room
only took .12s to get the 4.6k events
03:43:28.716
(t+45s
) Asking t2bot.io for backill03:43:28.985
(t+45s
) t2bot.io responded with 100 events03:43:29.009
(t+46s
) Starting to verify content hashes03:43:29.105
(t+46s
) Done verifying content hashes03:43:29.106
(t+46s
) Processing the pulled events03:43:29.147
(t+46s
) Done processing 100 backfilled except for$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
03:43:29.348
(t+46s
) Done trying to de-outlier$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
(404/state_ids
from t2bot.io)03:43:34.026
(t+51s
)_get_state_groups_from_groups
start03:43:38.470
(t+55s
)_get_state_groups_from_groups
end03:43:41.112
(t+58s
) Response sent
This isn't a fluke, here is another request I went through the logs on (GET-196
). This time the duration matched the Synapse logs pretty well:
20:38:38.049
(t+0s
) Received request20:38:38.062
(t+0s
) Waiting to acquire linearizer lock 'room_backfill'20:39:23.622
(t+45s
) Acquired linearizer lock 'room_backfill'20:39:23.640
(t+45s
)_maybe_backfill_inner
and pulled 4.6k events out of the database as potential backfill points20:39:25.625
(t+47s
) Asking t2bot.io for backill20:39:35.262
(t+57s
) t2bot.io responded with 100 events20:39:35.283
(t+...
) Starting to verify content hashes20:39:35.382
(t+...
) Done verifying content hashes20:39:35.383
(t+...
) Processing the pulled events20:39:35.424
(t+...
) Done processing 100 backfilled events except for$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
20:39:35.651
(t+...
) Done trying to de-outlier$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
(404/state_ids
from t2bot.io)20:39:43.668
(t+65s
) Response sent
2. Loading tons of events
- Room:
#matrix:matrix.org
- API URL:
https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
- Request -> response duration: ~60s
- Logs in Kibana:
GET-5541
Notes:
- Tons of
get_aggregation_groups_for_event
ofget_recent_references_for_event
calls (2k). Why? - Why did we load 79k events?
- This lines up with the number of
state_events
in the room
- This lines up with the number of
- Why did we redact some events? Is this redacting events in our response or in general?
- Why didn't these things happen in the other calls above?
Timing summary from looking at the logs:
02:09:51.026
(t+0s
) Received request02:09:51.959
(t+1s
)_maybe_backfill_inner
backfill and pulled 4.6k events out of the database as potential backfill points02:09:52.726
(t+2s
) synapse.storage.databases.main.events_worker Loading 79277 events (why?)02:10:10.462
(t+19s
) Also fetching redaction events02:10:10.462
(t+19s
) Loaded 79277 events02:10:23.246
(t+31s
) Done redacting 105 events (why?)02:10:23.779
(t+31s
) Asking t2bot.io for backill02:10:33.290
(t+41s
) t2bot.io responded with 100 events...
(t+...
) 2k calls toget_recent_references_for_event
andget_aggregation_groups_for_event
(why?)02:10:54.261
(t+63s
) Response sent
3. Slow /state_id
requests but also really slow processing
524 timeout: 123.26s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm
524 timeout
2022-07-22 @4-43
123.26s
https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab
- 19.24s
/backfill
request - 53.45s
/state_ids
524 timeout: 117.96s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c
524 timeout
2022-07-22 @4:56
117.96s
https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9
- 921ms
/backfill
request - 47.24s
/state_ids
request
524 timeout: 115.64s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g
524 timeout
2022-7-22 @5:02:33
{
"ok": false,
"timing": 99185.70000004768,
"startTime": 1658527353492,
"endTime": 1658527452680,
"requestId": "p8c09g"
}
115.64s
https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae
- 14.49s
/backfill
request - 54.47s
/state_ids
request
200 ok: 83.51s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw
200 ok
{
"ok": true,
"status": 200,
"timing": 83771.90000009537,
"numEventsReturned": 491,
"numStateEventsReturned": 129,
"startTime": 1658527887072,
"endTime": 1658527970847,
"requestId": "cjfpw",
"traceId": "ae7c694e57113282"
}
83.51s
https://jaeger.proxy.matrix.org/trace/ae7c694e57113282
- 526ms
/backfill
request - 42.77s
/state_ids
request
200 ok: 75.7s
https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k
2022-07-22 @5-27pm
{
"ok": true,
"status": 200,
"timing": 76131.40000009537,
"numEventsReturned": 491,
"numStateEventsReturned": 129,
"startTime": 1658528812471,
"endTime": 1658528888604,
"requestId": "wg6g8k"
"traceId": "d048d9f59c20555c"
}
75.7s
https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c
- 549ms
/backfill
request - 32.58s
/state_ids
request
/messages
timing script
Every 61 minutes (to be outside of the state cache expiry), will call and time /messages
for each room defined. And will do this with ?limit=
500
, 100
, 20
to see if the amount of messages matters on the timing.
Run in the browser. Define let MATRIX_TOKEN = 'xxx';
before running the script.
matrix-messages-timing.js
let resultantTimingData = {};
async function fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch) {
// via https://stackoverflow.com/a/8084248/796832
const requestId = (Math.random() + 1).toString(36).substring(7);
let res;
let timeBeforeRes;
let timeAfterRes;
let startTime;
let endTime;
try {
startTime = Date.now();
timeBeforeRes = performance.now();
res = await fetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`, {
headers: {
'Authorization': `Bearer ${MATRIX_TOKEN}`
}
});
} catch(err) {
console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`, err);
} finally {
endTime = Date.now();
timeAfterRes = performance.now();
}
const traceId = res?.headers?.['synapse-trace-id'];
let data = {};
if (res?.ok) {
data = await res.json();
}
const timingEntry = {
ok: res?.ok || false,
status: res?.status,
timing: timeAfterRes - timeBeforeRes,
numEventsReturned: data?.chunk?.length,
numStateEventsReturned: data?.state?.length,
startTime,
endTime,
requestId,
traceId
};
return timingEntry;
}
(async () => {
const ROOMS_TO_TEST = {
'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org',
'openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im',
'ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im',
'matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk',
'MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im'
};
const NUM_MESSAGES_TO_FETCH_MATRIX = [
500,
100,
20
];
const ONE_MINUTE_MS = 60 * 1000;
// The `_state_cache` in Synapse has 60 minute expiry so we chose something just bigger
const DELAY_BETWEEN_TRIALS = 61 * ONE_MINUTE_MS;
const REPEAT = 5;
if (!MATRIX_TOKEN) {
console.error('MATRIX_TOKEN is not defined. Please define this before running this script');
}
for (let i = 0; i < REPEAT; i++) {
for (let numMessagesToFetch of NUM_MESSAGES_TO_FETCH_MATRIX) {
for (let roomKey of Object.keys(ROOMS_TO_TEST)) {
const roomId = ROOMS_TO_TEST[roomKey];
const timingEntry = await fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch);
// Default the result map so we don't run into null-pointer exception (NPE) issues
if (!resultantTimingData[roomKey]) {
resultantTimingData[roomKey] = {};
}
if (!resultantTimingData[roomKey][numMessagesToFetch]) {
resultantTimingData[roomKey][numMessagesToFetch] = [];
}
resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry);
console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`, timingEntry);
}
// Wait for the caches to clear out before running another trial against the same room
await new Promise((resolve) => {
setTimeout(resolve, DELAY_BETWEEN_TRIALS);
})
}
}
})()
function getAvg(roomId, numMessages) {
const timingEntries = data1[roomId][numMessages];
return timingEntries
.filter((timingEntry) => timingEntry.ok)
.reduce((total, timingEntry) => (total + timingEntry.timing), 0) / timingEntries.length;
}
getAvg('matrix:matrix.org', "500")