Skip to content

[server] Improve observability regarding communication with messagebus #3607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 7, 2021

Conversation

ArthurSens
Copy link
Contributor

When ready, this PR will include a counter metric that increases every time server tries to read a topic from messagebus and also log when it fails to do so.

Fixes #3578

@ArthurSens ArthurSens changed the title As/server improve o11y [server] Improve observability regarding communication with messagebus Mar 26, 2021
@ArthurSens ArthurSens force-pushed the as/server-improve-o11y branch from 48fb69e to 3f21972 Compare March 26, 2021 12:33
@ArthurSens ArthurSens requested review from csweichel and geropl March 26, 2021 12:35
@ArthurSens
Copy link
Contributor Author

ArthurSens commented Mar 26, 2021

@geropl @csweichel The metric should be ready to review.

Regarding the log message, different from Go, typescript functions don't seem to return an error when something fails. Any ideas on how to log an error during listening? From what I could understand, the error occurs at messagebus' code instead of server's

@@ -117,6 +118,7 @@ export class MessageBusIntegration extends AbstractMessageBusIntegration {
const listener = new HeadlessWorkspaceLogListener(this.messageBusHelper, callback, workspaceID);
const cancellationTokenSource = new CancellationTokenSource()
this.listen(listener, cancellationTokenSource.token);
increaseMessagebusTopicReads(listener.topic.name)
return Disposable.create(() => cancellationTokenSource.cancel())
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not listen in listenForPrebuildUpdatableQueue as well? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrebuildUpdatableQueue's topic didn't have a name attribute. I had a quick talk with @csweichel last week and we agreed on removing the metric over there.

@csweichel I feel like we had another reason that I can't remember right now. Do you remember if there were more reasons to not expose a metric for this queue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primarily the fact that it's a queue and not an exchange and we didn't find a good naming scheme.

@geropl
Copy link
Member

geropl commented Apr 6, 2021

/werft run

👍 started the job as gitpod-build-as-server-improve-o11y.3

@geropl
Copy link
Member

geropl commented Apr 6, 2021

The metrics currently emitted:
gitpod_server_topic_reads_total{topic="topic"} 5

@ArthurSens using await listener.topic() instead of listener.topic.name should fix that.

@@ -127,10 +129,11 @@ export class MessageBusIntegration extends AbstractMessageBusIntegration {
return Disposable.create(() => cancellationTokenSource.cancel())
}

listenForWorkspaceInstanceUpdates(userId: string | undefined, callback: (ctx: TraceContext, workspaceInstance: WorkspaceInstance) => void): Disposable {
async listenForWorkspaceInstanceUpdates(userId: string | undefined, callback: (ctx: TraceContext, workspaceInstance: WorkspaceInstance) => void): Disposable {
Copy link
Member

@akosyakov akosyakov Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method cannot be async, it lead to memory leaks in the past because of hanging listeners when server is gone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same for other listen methods which return disposable object, it should happen synchronously to be registered fro the current server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, changing it to synchronously would make us lose the ability to use listener.topic() which returns the topic name.

I guess this one won't make into tomorrow's release then, I need to think of another way to get the name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do listener.topic().then(increaseMessagebusTopicReads, () => { /* no-op*/ })

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also try to get rid of the async in front of topic @ArthurSens ping me if you find time for this today.

@csweichel csweichel mentioned this pull request Apr 6, 2021
24 tasks
@akosyakov
Copy link
Member

btw is topic name meaningful? Last time when I looked directly at messagebus everything had cryptic names maybe we should give better names to topics/queues first?

@ArthurSens
Copy link
Contributor Author

ArthurSens commented Apr 7, 2021

btw is topic name meaningful? Last time when I looked directly at messagebus everything had cryptic names maybe we should give better names to topics/queues first?

The result of listener.topic() doesn't look good to me, I'd rather hard code a human readable string 😅
image

@ArthurSens
Copy link
Contributor Author

On commit 7152f82 we have these results:

image

While in 49950cf we have these:
image

@geropl @akosyakov @csweichel which one do you prefer?

@geropl
Copy link
Member

geropl commented Apr 7, 2021

I like 49950cf because it's interesting to thee the number of messages per instance to detect outliers.

If that blows up our scrape-budget I'm fine with the alternative as well.

@ArthurSens
Copy link
Contributor Author

ArthurSens commented Apr 7, 2021

I like 49950cf because it's interesting to thee the number of messages per instance to detect outliers.

If that blows up our scrape-budget I'm fine with the alternative as well.

That ID is the workspace instance ID? I think it can get very expensive as we scale 😬 Workspace instance ID is the highest cardinality we can get out of Gitpod, besides, how do you plan to use the workspace ID on an alert or a dashboard? I'm afraid it's unactionable data...

@ArthurSens
Copy link
Contributor Author

ArthurSens commented Apr 7, 2021

I'm taking a look at the TSDB whitepaper again just to have a better judgment of the costs.

We have 64 bytes for each timeseries header(one timeseries per workspace ID), then 64 bytes for the first sample of each series. Resulting in at least 128 bytes per timeseries.

The next samples' compression rate depends on how much it changes in comparison with the last sample. While testing the metric, I saw some increases of 6~10 to the counters each time I made an interaction with a workspace(create or stop), it's a small increase and will probably lead to less than 1 byte per scrape for each workspace.

image

  • Prometheus usually retain metrics in memory for 2~3h before dumping it to disk
  • Assuming our scrape interval is 30s, in the worst case scenario we would do 720 scrapes before dumping to disk.
  • Even though workspaces are ephemeral, metrics are not. That means that a metric for a workspace that doesn't exist anymore will keep being exposed until the server pod is restarted. Assuming we only restart server once a month (during our monthly updates), by the end of the month we'll be scraping one sample for each workspace that has ever been created in that month.

At the end of the month the memory necessary to have prometheus running only to collect these metrics would be:

(720  x bytes per sample(1) x # of workspace in a month x 2(we are introducing 2 metrics) )
+
(128 x # of workspace in a month)

Taking a look at our Analytics platform, in march we hosted 209.734 workspace instances, so the results would be:

720 x 1 x 209734 x 2 = 302016960 bytes
+ 
128 x 209734 = 26845952 bytes

---
302016960 + 26845952 = 328862912 bytes ≅ 313,62 MB 

image

I think 313MB is okayish for now, but we need to keep in mind that this won't scale well as we grow our community :)

@ArthurSens
Copy link
Contributor Author

ArthurSens commented Apr 7, 2021

If we don't add a workspace instance ID to this metric, the calculation would be:

# of scrapes in 3h(720) x bytes per sample(1) x # of timeseries(3)
+
128 x 3
---

(720 x 3) + (128 x 3)= 2544 bytes = 2.5KB

@meysholdt meysholdt self-requested a review April 7, 2021 15:55
Copy link
Member

@meysholdt meysholdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurSens and I had a quick call and decided that our infra can handle the extra load of having the instance ID included; If we see in self-monitoring at some point in the future that it's too much data we can still turn if off.

…nts the amount of times that server reads messagebus topic for each workspace.

Signed-off-by: ArthurSens <[email protected]>
@ArthurSens ArthurSens force-pushed the as/server-improve-o11y branch from 7152f82 to 60febb5 Compare April 7, 2021 16:51
@ArthurSens ArthurSens merged commit a8da492 into main Apr 7, 2021
@ArthurSens ArthurSens deleted the as/server-improve-o11y branch April 7, 2021 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[server] Improve observability regarding communication with messagebus
5 participants