-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[server] Improve observability regarding communication with messagebus #3607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
48fb69e
to
3f21972
Compare
@geropl @csweichel The metric should be ready to review. Regarding the log message, different from Go, typescript functions don't seem to return an error when something fails. Any ideas on how to log an error during listening? From what I could understand, the error occurs at messagebus' code instead of server's |
@@ -117,6 +118,7 @@ export class MessageBusIntegration extends AbstractMessageBusIntegration { | |||
const listener = new HeadlessWorkspaceLogListener(this.messageBusHelper, callback, workspaceID); | |||
const cancellationTokenSource = new CancellationTokenSource() | |||
this.listen(listener, cancellationTokenSource.token); | |||
increaseMessagebusTopicReads(listener.topic.name) | |||
return Disposable.create(() => cancellationTokenSource.cancel()) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not listen in listenForPrebuildUpdatableQueue
as well? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrebuildUpdatableQueue's topic didn't have a name
attribute. I had a quick talk with @csweichel last week and we agreed on removing the metric over there.
@csweichel I feel like we had another reason that I can't remember right now. Do you remember if there were more reasons to not expose a metric for this queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
primarily the fact that it's a queue and not an exchange and we didn't find a good naming scheme.
/werft run 👍 started the job as gitpod-build-as-server-improve-o11y.3 |
The metrics currently emitted: @ArthurSens using |
@@ -127,10 +129,11 @@ export class MessageBusIntegration extends AbstractMessageBusIntegration { | |||
return Disposable.create(() => cancellationTokenSource.cancel()) | |||
} | |||
|
|||
listenForWorkspaceInstanceUpdates(userId: string | undefined, callback: (ctx: TraceContext, workspaceInstance: WorkspaceInstance) => void): Disposable { | |||
async listenForWorkspaceInstanceUpdates(userId: string | undefined, callback: (ctx: TraceContext, workspaceInstance: WorkspaceInstance) => void): Disposable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method cannot be async, it lead to memory leaks in the past because of hanging listeners when server is gone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same for other listen methods which return disposable object, it should happen synchronously to be registered fro the current server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point, changing it to synchronously would make us lose the ability to use listener.topic()
which returns the topic name.
I guess this one won't make into tomorrow's release then, I need to think of another way to get the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do listener.topic().then(increaseMessagebusTopicReads, () => { /* no-op*/ })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also try to get rid of the async
in front of topic
@ArthurSens ping me if you find time for this today.
btw is topic name meaningful? Last time when I looked directly at messagebus everything had cryptic names maybe we should give better names to topics/queues first? |
On commit 7152f82 we have these results: While in 49950cf we have these: @geropl @akosyakov @csweichel which one do you prefer? |
I like 49950cf because it's interesting to thee the number of messages per instance to detect outliers. If that blows up our scrape-budget I'm fine with the alternative as well. |
That ID is the workspace instance ID? I think it can get very expensive as we scale 😬 Workspace instance ID is the highest cardinality we can get out of Gitpod, besides, how do you plan to use the workspace ID on an alert or a dashboard? I'm afraid it's unactionable data... |
I'm taking a look at the TSDB whitepaper again just to have a better judgment of the costs. We have 64 bytes for each timeseries header(one timeseries per workspace ID), then 64 bytes for the first sample of each series. Resulting in at least 128 bytes per timeseries. The next samples' compression rate depends on how much it changes in comparison with the last sample. While testing the metric, I saw some increases of 6~10 to the counters each time I made an interaction with a workspace(create or stop), it's a small increase and will probably lead to less than 1 byte per scrape for each workspace.
At the end of the month the memory necessary to have prometheus running only to collect these metrics would be:
Taking a look at our Analytics platform, in march we hosted 209.734 workspace instances, so the results would be:
I think 313MB is okayish for now, but we need to keep in mind that this won't scale well as we grow our community :) |
If we don't add a workspace instance ID to this metric, the calculation would be:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurSens and I had a quick call and decided that our infra can handle the extra load of having the instance ID included; If we see in self-monitoring at some point in the future that it's too much data we can still turn if off.
…nts the amount of times that server reads messagebus topic for each workspace. Signed-off-by: ArthurSens <[email protected]>
7152f82
to
60febb5
Compare
When ready, this PR will include a counter metric that increases every time server tries to read a topic from messagebus and also log when it fails to do so.
Fixes #3578