[server] Improve observability regarding communication with messagebus #3607

ArthurSens · 2021-03-26T12:12:58Z

When ready, this PR will include a counter metric that increases every time server tries to read a topic from messagebus and also log when it fails to do so.

Fixes #3578

ArthurSens · 2021-03-26T12:38:46Z

@geropl @csweichel The metric should be ready to review.

Regarding the log message, different from Go, typescript functions don't seem to return an error when something fails. Any ideas on how to log an error during listening? From what I could understand, the error occurs at messagebus' code instead of server's

components/server/src/workspace/messagebus-integration.ts

geropl · 2021-04-06T09:17:36Z

components/server/src/workspace/messagebus-integration.ts

@@ -117,6 +118,7 @@ export class MessageBusIntegration extends AbstractMessageBusIntegration {
        const listener = new HeadlessWorkspaceLogListener(this.messageBusHelper, callback, workspaceID);
        const cancellationTokenSource = new CancellationTokenSource()
        this.listen(listener, cancellationTokenSource.token);
+        increaseMessagebusTopicReads(listener.topic.name)
        return Disposable.create(() => cancellationTokenSource.cancel())
    }



Why not listen in listenForPrebuildUpdatableQueue as well? 🤔

PrebuildUpdatableQueue's topic didn't have a name attribute. I had a quick talk with @csweichel last week and we agreed on removing the metric over there.

@csweichel I feel like we had another reason that I can't remember right now. Do you remember if there were more reasons to not expose a metric for this queue?

primarily the fact that it's a queue and not an exchange and we didn't find a good naming scheme.

geropl · 2021-04-06T09:22:08Z

/werft run

👍 started the job as gitpod-build-as-server-improve-o11y.3

geropl · 2021-04-06T09:41:06Z

The metrics currently emitted:
gitpod_server_topic_reads_total{topic="topic"} 5

@ArthurSens using await listener.topic() instead of listener.topic.name should fix that.

akosyakov · 2021-04-06T13:36:07Z

components/server/src/workspace/messagebus-integration.ts

@@ -127,10 +129,11 @@ export class MessageBusIntegration extends AbstractMessageBusIntegration {
        return Disposable.create(() => cancellationTokenSource.cancel())
    }

-    listenForWorkspaceInstanceUpdates(userId: string | undefined, callback: (ctx: TraceContext, workspaceInstance: WorkspaceInstance) => void): Disposable {
+    async listenForWorkspaceInstanceUpdates(userId: string | undefined, callback: (ctx: TraceContext, workspaceInstance: WorkspaceInstance) => void): Disposable {


this method cannot be async, it lead to memory leaks in the past because of hanging listeners when server is gone

the same for other listen methods which return disposable object, it should happen synchronously to be registered fro the current server

Very good point, changing it to synchronously would make us lose the ability to use listener.topic() which returns the topic name.

I guess this one won't make into tomorrow's release then, I need to think of another way to get the name

you can do listener.topic().then(increaseMessagebusTopicReads, () => { /* no-op*/ })

We could also try to get rid of the async in front of topic @ArthurSens ping me if you find time for this today.

akosyakov · 2021-04-07T07:09:32Z

btw is topic name meaningful? Last time when I looked directly at messagebus everything had cryptic names maybe we should give better names to topics/queues first?

ArthurSens · 2021-04-07T12:47:29Z

btw is topic name meaningful? Last time when I looked directly at messagebus everything had cryptic names maybe we should give better names to topics/queues first?

The result of listener.topic() doesn't look good to me, I'd rather hard code a human readable string 😅

ArthurSens · 2021-04-07T13:35:59Z

On commit 7152f82 we have these results:

While in 49950cf we have these:

@geropl @akosyakov @csweichel which one do you prefer?

geropl · 2021-04-07T13:45:24Z

I like 49950cf because it's interesting to thee the number of messages per instance to detect outliers.

If that blows up our scrape-budget I'm fine with the alternative as well.

ArthurSens · 2021-04-07T14:11:03Z

I like 49950cf because it's interesting to thee the number of messages per instance to detect outliers.

If that blows up our scrape-budget I'm fine with the alternative as well.

That ID is the workspace instance ID? I think it can get very expensive as we scale 😬 Workspace instance ID is the highest cardinality we can get out of Gitpod, besides, how do you plan to use the workspace ID on an alert or a dashboard? I'm afraid it's unactionable data...

ArthurSens · 2021-04-07T14:54:08Z

I'm taking a look at the TSDB whitepaper again just to have a better judgment of the costs.

We have 64 bytes for each timeseries header(one timeseries per workspace ID), then 64 bytes for the first sample of each series. Resulting in at least 128 bytes per timeseries.

The next samples' compression rate depends on how much it changes in comparison with the last sample. While testing the metric, I saw some increases of 6~10 to the counters each time I made an interaction with a workspace(create or stop), it's a small increase and will probably lead to less than 1 byte per scrape for each workspace.

Prometheus usually retain metrics in memory for 2~3h before dumping it to disk
Assuming our scrape interval is 30s, in the worst case scenario we would do 720 scrapes before dumping to disk.
Even though workspaces are ephemeral, metrics are not. That means that a metric for a workspace that doesn't exist anymore will keep being exposed until the server pod is restarted. Assuming we only restart server once a month (during our monthly updates), by the end of the month we'll be scraping one sample for each workspace that has ever been created in that month.

At the end of the month the memory necessary to have prometheus running only to collect these metrics would be:

(720  x bytes per sample(1) x # of workspace in a month x 2(we are introducing 2 metrics) )
+
(128 x # of workspace in a month)

Taking a look at our Analytics platform, in march we hosted 209.734 workspace instances, so the results would be:

720 x 1 x 209734 x 2 = 302016960 bytes
+ 
128 x 209734 = 26845952 bytes

---
302016960 + 26845952 = 328862912 bytes ≅ 313,62 MB

I think 313MB is okayish for now, but we need to keep in mind that this won't scale well as we grow our community :)

ArthurSens · 2021-04-07T15:12:50Z

If we don't add a workspace instance ID to this metric, the calculation would be:

# of scrapes in 3h(720) x bytes per sample(1) x # of timeseries(3)
+
128 x 3
---

(720 x 3) + (128 x 3)= 2544 bytes = 2.5KB

meysholdt

@ArthurSens and I had a quick call and decided that our infra can handle the extra load of having the instance ID included; If we see in self-monitoring at some point in the future that it's too much data we can still turn if off.

…nts the amount of times that server reads messagebus topic for each workspace. Signed-off-by: ArthurSens <[email protected]>

ArthurSens changed the title ~~As/server improve o11y~~ [server] Improve observability regarding communication with messagebus Mar 26, 2021

ArthurSens mentioned this pull request Mar 26, 2021

[server] Improve observability regarding communication with messagebus #3600

Closed

ArthurSens force-pushed the as/server-improve-o11y branch from 48fb69e to 3f21972 Compare March 26, 2021 12:33

ArthurSens requested review from csweichel and geropl March 26, 2021 12:35

csweichel reviewed Mar 29, 2021

View reviewed changes

components/server/src/workspace/messagebus-integration.ts Outdated Show resolved Hide resolved

geropl reviewed Apr 6, 2021

View reviewed changes

akosyakov reviewed Apr 6, 2021

View reviewed changes

csweichel mentioned this pull request Apr 6, 2021

[first week] April 2021 #3775

Closed

24 tasks

meysholdt self-requested a review April 7, 2021 15:55

meysholdt approved these changes Apr 7, 2021

View reviewed changes

Introduce gitpod_server_topic_read_total counter metrics, which cou…

60febb5

…nts the amount of times that server reads messagebus topic for each workspace. Signed-off-by: ArthurSens <[email protected]>

ArthurSens force-pushed the as/server-improve-o11y branch from 7152f82 to 60febb5 Compare April 7, 2021 16:51

ArthurSens merged commit a8da492 into main Apr 7, 2021

ArthurSens deleted the as/server-improve-o11y branch April 7, 2021 16:55

[server] Improve observability regarding communication with messagebus #3607

[server] Improve observability regarding communication with messagebus #3607

Uh oh!

Conversation

ArthurSens commented Mar 26, 2021

Uh oh!

ArthurSens commented Mar 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

geropl Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

ArthurSens Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

csweichel Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

geropl commented Apr 6, 2021 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geropl commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akosyakov Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akosyakov Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

ArthurSens Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

akosyakov Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

geropl Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

akosyakov commented Apr 7, 2021

Uh oh!

ArthurSens commented Apr 7, 2021 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurSens commented Apr 7, 2021

Uh oh!

geropl commented Apr 7, 2021

Uh oh!

ArthurSens commented Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurSens commented Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurSens commented Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meysholdt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurSens commented Mar 26, 2021 •

edited

Loading

geropl commented Apr 6, 2021 •

edited by werft-gitpod-dev-com bot

Loading

geropl commented Apr 6, 2021 •

edited

Loading

akosyakov Apr 6, 2021 •

edited

Loading

ArthurSens commented Apr 7, 2021 •

edited by werft-gitpod-dev-com bot

Loading

ArthurSens commented Apr 7, 2021 •

edited

Loading

ArthurSens commented Apr 7, 2021 •

edited

Loading

ArthurSens commented Apr 7, 2021 •

edited

Loading