Investigate why `/messages` is slow

## Background

> For [Matrix Public Archive](https://github.com/matrix-org/matrix-public-archive), we're getting a lot of 504 timeouts waiting for Matrix API requests to finish. Specifically `/messages` seems to be the slow culprit and I haven't really heard anything about optimizing it like `/join`. From a clean Synapse cache, `/messages` often takes 80s - 120s to respond and many times timing out itself from Synapse.
> 
> Here is the average data for `#matrix:matrix.org` from the `gitter.ems.host` homeserver which is sampled every 61 minutes to be outside of the state cache expiry, `/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter={"lazy_load_members":true}`
> 
> - `?limit=500`: 111.81s (492 events / 129 state events)
> - `?limit=100`:  119.504s (92 events / 50 state events)
> - `?limit=20`:  112.541s (20 events / 11 state events)
> 
> After Synapse has a warm cache (I am guessing), these requests take about 5s - 10s
> 
> What can we do to make the `/messages` requests faster? Has any thought already been put in this? For existing issues, I see that we might be backfilling when we shouldn't which aligns with the request being slow regardless how big the `limit` is, https://github.com/matrix-org/synapse/issues/10742
> 
> Caching the response on our end in the archive doesn't make the request fast in the first place.
> 
> From the `matrix.org` Grafana, I looked into the [`RoomMessageListRestServlet` timing](https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&var-datasource=default&var-bucket_size=$__auto_interval_bucket_size&var-instance=matrix.org&var-job=synapse_client_reader&var-index=All) and it all appears very fast (5ms - 500ms) and not what I experience. I am guessing it's not good at catching the outliers since the metric seems to just to total time divided by samples. But I would expect a lot more 1 second to 3 second timings even after a warm cache.
>
> *-- @MadLittleMods, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$7ctkqN00-4HiYP5TT7bnfROfGXP5Lyt7yyWjrg8bGu4?via=matrix.org&via=element.io&via=beeper.com*


## Traces for reference

These include a bunch of the `/messages` instrumentation added:

 - 144.93s `/messages` (`RoomMessageListRestServlet`) on `matrix.org` Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/dfa0dad8a32255b366461c76b55c8ecb
 - 135.3s `/messages` (`RoomMessageListRestServlet`) on `matrix.org` Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/5b945a1733a414c61be6e393397c3f9d
 - 154.72s `/messages` (`RoomMessageListRestServlet`) on `matrix.org` Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/d3f9c621707081755ff629000c60b3f3
 - 185.9s `/messages` (`RoomMessageListRestServlet`) on `matrix.org` Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761



## Why is it slow?

**Update:** We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11


This part is WIP. Just documenting some of the log diving I've done for `/messages` being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.



### 1. Backfill linearizer lock takes forever

 - Room: `#matrix:matrix.org`
 - API URL: `https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678`
 - Request -> response duration: `117.808s`
 - [Logs in Kibana](https://modular-euwest2-kibana.proxy.matrix.org/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:'2022-07-22T03:42:00.171Z',to:'2022-07-22T03:44:00.171Z'))&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-6890%22%20and%20not%20message:%22synapse.util.metrics%22'),sort:!(!('@timestamp',desc)))): `GET-6890`
    - For reference on adjusting the logging config for `gitter.ems.host` (set `DEBUG`), see [this comment](https://github.com/matrix-org/internal-config/issues/1230#issuecomment-1124399478).

**Notes:**

 - Why does linearizer lock take ~45s to aquire lock?
 - It doesn't seem to take a lot of time but we don't need to fetch 4.6k backfill points out of the database
 - For me on the outside, the request took `117.808s`. Where did the rest of the time go?
    - Delay in connecting to the server?
    - Delay in serializing the response?
    - Need to time the requests more precisely when the started and ended to compare against the Synapse logs

**Timing summary from looking at the logs:**

1. `03:42:43.129` (`t+0s `):	Received request
1. `03:42:43.712` (`t+0s `)	Waiting to acquire linearizer lock 'room_backfill'
1. `03:43:26.806` (`t+43s`)	Acquired linearizer lock 'room_backfill'
1. `03:43:26.825` (`t+43s`)	_maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points 
   - `get_oldest_event_ids_with_depth_in_room` only took .12s to get the 4.6k events
1. `03:43:28.716` (`t+45s`)	Asking t2bot.io for backill
1. `03:43:28.985` (`t+45s`)	t2bot.io responded with 100 events
1. `03:43:29.009` (`t+46s`)	Starting to verify content hashes
1. `03:43:29.105` (`t+46s`)	Done verifying content hashes
1. `03:43:29.106` (`t+46s`)	Processing the pulled events
1. `03:43:29.147` (`t+46s`)	Done processing 100 backfilled except for `$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E`
1. `03:43:29.348` (`t+46s`)	Done trying to de-outlier `$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E` (404 `/state_ids` from t2bot.io)
1. `03:43:34.026` (`t+51s`)	`_get_state_groups_from_groups` start
1. `03:43:38.470` (`t+55s`)	`_get_state_groups_from_groups` end
1. `03:43:41.112` (`t+58s`)	Response sent

This isn't a fluke, here is another request I went through the [logs](https://modular-euwest2-kibana.proxy.matrix.org/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:'2022-07-21T20:32:50.331Z',to:'2022-07-21T20:47:47.737Z'))&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-196%22%20and%20not%20message:%22synapse.util.metrics%22'),sort:!(!('@timestamp',desc)))) on (`GET-196`). This time the duration matched the Synapse logs pretty well:

1. `20:38:38.049` (`t+0s `)	Received request
1. `20:38:38.062` (`t+0s `)	Waiting to acquire linearizer lock 'room_backfill'
1. `20:39:23.622` (`t+45s`)	Acquired linearizer lock 'room_backfill'
1. `20:39:23.640` (`t+45s`)	`_maybe_backfill_inner` and pulled 4.6k events out of the database as potential backfill points
1. `20:39:25.625` (`t+47s`)	Asking t2bot.io for backill
1. `20:39:35.262` (`t+57s`)	t2bot.io responded with 100 events
1. `20:39:35.283` (`t+...`)	Starting to verify content hashes
1. `20:39:35.382` (`t+...`)	Done verifying content hashes
1. `20:39:35.383` (`t+...`)	Processing the pulled events 
1. `20:39:35.424` (`t+...`)	Done processing 100 backfilled events except for `$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E`
1. `20:39:35.651` (`t+...`)	Done trying to de-outlier `$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E` (404 `/state_ids` from t2bot.io)
1. `20:39:43.668` (`t+65s`)	Response sent


### 2. Loading tons of events


 - Room: `#matrix:matrix.org`
 - API URL: `https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678`
 - Request -> response duration: ~60s
 - Logs in Kibana: `GET-5541`
    - [1](https://modular-euwest2-kibana.proxy.matrix.org/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:'2022-07-22T02:09:35.966Z',to:'2022-07-22T02:11:15.572Z'))&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-5541%22%20and%20not%20message:%22synapse.util.metrics%22'),sort:!(!('@timestamp',desc))))
    - [2](https://modular-euwest2-kibana.proxy.matrix.org/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:'2022-07-22T02:09:35.966Z',to:'2022-07-22T02:11:15.572Z'))&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-5541%22%20and%20not%20message:%22synapse.util.metrics%22%20and%20not%20message:%22synapse.storage.txn%22%20and%20not%20message:%22synapse.storage.databases.main.events_worker%22'),sort:!(!('@timestamp',desc))))

**Notes:**

 - Tons of `get_aggregation_groups_for_event` of `get_recent_references_for_event` calls (2k). Why?
 - Why did we load 79k events? 
    - This lines up with the number of `state_events` in the room
 - Why did we redact some events? Is this redacting events in our response or in general?
 - Why didn't these things happen in the other calls above?

**Timing summary from looking at the logs:**

1. `02:09:51.026` (`t+0s `)		Received request
1. `02:09:51.959` (`t+1s `)		`_maybe_backfill_inner` backfill and pulled 4.6k events out of the database as potential backfill points
1. `02:09:52.726` (`t+2s`)		synapse.storage.databases.main.events_worker Loading 79277 events (why?)
1. `02:10:10.462` (`t+19s`)	Also fetching redaction events
1. `02:10:10.462` (`t+19s`)	Loaded 79277 events
1. `02:10:23.246` (`t+31s`)	Done redacting 105 events (why?)
1. `02:10:23.779` (`t+31s`)	Asking t2bot.io for backill
1. `02:10:33.290` (`t+41s`)	t2bot.io responded with 100 events
1. `...                ` (`t+...`)	2k calls to `get_recent_references_for_event` and `get_aggregation_groups_for_event` (why?)
1. `02:10:54.261` (`t+63s`)	Response sent


### 3. Slow `/state_id` requests but also really slow processing

<details>
<summary>524 timeout: 123.26s</summary>

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm`
524 timeout
2022-07-22 @4-43

123.26s
https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab

 - 19.24s `/backfill` request
 - 53.45s `/state_ids`

</details> 


<details>
<summary>524 timeout: 117.96s</summary>

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c`
524 timeout
2022-07-22 @4:56

117.96s
https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9

 - 921ms `/backfill` request
 - 47.24s `/state_ids` request

</details>


<details>
<summary>524 timeout: 115.64s</summary>

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g`
524 timeout
2022-7-22 @5:02:33

```json
{
    "ok": false,
    "timing": 99185.70000004768,
    "startTime": 1658527353492,
    "endTime": 1658527452680,
    "requestId": "p8c09g"
}
```

115.64s
https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae

 - 14.49s `/backfill` request
 - 54.47s `/state_ids` request

</details>


<details>
<summary>200 ok: 83.51s</summary>

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw`
200 ok

```json
{
    "ok": true,
    "status": 200,
    "timing": 83771.90000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658527887072,
    "endTime": 1658527970847,
    "requestId": "cjfpw",
    "traceId": "ae7c694e57113282"
}
```

83.51s
https://jaeger.proxy.matrix.org/trace/ae7c694e57113282

 - 526ms `/backfill` request
 - 42.77s `/state_ids` request


</details>


<details>
<summary>200 ok: 75.7s</summary>

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k`
2022-07-22 @5-27pm

```json
{
    "ok": true,
    "status": 200,
    "timing": 76131.40000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658528812471,
    "endTime": 1658528888604,
    "requestId": "wg6g8k"
    "traceId": "d048d9f59c20555c"
}
```

75.7s
https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c

 - 549ms `/backfill` request
 - 32.58s `/state_ids` request

</details>




## `/messages` timing script

Every 61 minutes (to be outside of the state cache expiry), will call and time `/messages` for each room defined. And will do this with `?limit=` `500`, `100`, `20` to see if the amount of messages matters on the timing.

Run in the browser. Define `let MATRIX_TOKEN = 'xxx';` before running the script.

<details>
<summary><code>matrix-messages-timing.js</code></summary>

```js
let resultantTimingData = {};
  
async function fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch) {
  // via https://stackoverflow.com/a/8084248/796832
  const requestId = (Math.random() + 1).toString(36).substring(7);

  let res;
  let timeBeforeRes;
  let timeAfterRes;
  let startTime;
  let endTime;
  try {
    startTime = Date.now();
    timeBeforeRes = performance.now();
    res = await fetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`, {
      headers: {
        'Authorization': `Bearer ${MATRIX_TOKEN}`
      }
    });
  } catch(err) {
    console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`, err);
  } finally {
    endTime = Date.now();
    timeAfterRes = performance.now();
    }
  
  const traceId = res?.headers?.['synapse-trace-id'];
  let data = {};
  if (res?.ok) {
    data = await res.json();
  }

  const timingEntry = {
    ok: res?.ok || false,
    status: res?.status,
    timing: timeAfterRes - timeBeforeRes,
    numEventsReturned: data?.chunk?.length,
    numStateEventsReturned: data?.state?.length,
    startTime,
    endTime,
    requestId,
    traceId
  };

  return timingEntry;
}

(async () => {
  const ROOMS_TO_TEST = {
    'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org',
    'openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im',
    'ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im',
    'matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk',
    'MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im'
  };

  const NUM_MESSAGES_TO_FETCH_MATRIX = [
    500,
    100,
    20
  ];

  const ONE_MINUTE_MS = 60 * 1000;
  // The `_state_cache` in Synapse has 60 minute expiry so we chose something just bigger
  const DELAY_BETWEEN_TRIALS = 61 * ONE_MINUTE_MS;

  const REPEAT = 5;

  if (!MATRIX_TOKEN) {
    console.error('MATRIX_TOKEN is not defined. Please define this before running this script');
  }

  for (let i = 0; i < REPEAT; i++) {
    for (let numMessagesToFetch of NUM_MESSAGES_TO_FETCH_MATRIX) {
      for (let roomKey of Object.keys(ROOMS_TO_TEST)) {
        const roomId = ROOMS_TO_TEST[roomKey];

        const timingEntry = await fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch);


        // Default the result map so we don't run into null-pointer exception (NPE) issues
        if (!resultantTimingData[roomKey]) {
          resultantTimingData[roomKey] = {};
        }

        if (!resultantTimingData[roomKey][numMessagesToFetch]) {
          resultantTimingData[roomKey][numMessagesToFetch] = [];
        }

        resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry);

        console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`, timingEntry);
      }

      // Wait for the caches to clear out before running another trial against the same room
      await new Promise((resolve) => {
        setTimeout(resolve, DELAY_BETWEEN_TRIALS);
      })
    }
  }
})()
```

```js
function getAvg(roomId, numMessages) {
    const timingEntries = data1[roomId][numMessages];
    return timingEntries
        .filter((timingEntry) => timingEntry.ok)
        .reduce((total, timingEntry) => (total + timingEntry.timing), 0) / timingEntries.length;
}

getAvg('matrix:matrix.org', "500")
```

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Investigate why `/messages` is slow #13356

Background

Traces for reference

Why is it slow?

1. Backfill linearizer lock takes forever

2. Loading tons of events

3. Slow `/state_id` requests but also really slow processing

`/messages` timing script

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Investigate why /messages is slow #13356

Description

Background

Traces for reference

Why is it slow?

1. Backfill linearizer lock takes forever

2. Loading tons of events

3. Slow /state_id requests but also really slow processing

/messages timing script

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Investigate why `/messages` is slow #13356

3. Slow `/state_id` requests but also really slow processing

`/messages` timing script