-
Notifications
You must be signed in to change notification settings - Fork 2
Notifications for event type "monitor" are generating with wrong instance ID #1595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I see 7.2k errors/hour on prod |
Moving conversation from slack Question: How should the cortex ruler, notifications service, flux etc know whether an instance has been deleted or not? One option would be to use the deleted flag set in the instances table as the single source of information as to whether an instance is active or not. Services would use the users client (or an agreed API in the case of open source projects) to determine whether an instance is active or not. This can be cached on the client side. |
Seems a bit poor to have to poll the "is it deleted" service; I would just remove the rules for deleted instances via some sync process. |
Can we do it in users service after instance deletion via cortex's API to delete configs? |
|
Presenting the options as I understand them, trying to write trade-offs for each up fairly. (Apologies for the formality, but I don't really know how to do this without writing a design doc.) Problem
Options1. Provide an "is deleted?" endpointAs described in #1595 (comment)
Pros:
Cons:
2. Projects have "instance deleted" endpoint
Pros:
Cons:
3. Central configs service
Pros:
Cons:
3.1. Combine with users serviceRather than this all-singing, all-dancing configs service be a separate service with a separate DB, it would just be a part of the users service, with the configs stored in the users DB. Pros:
3.2. Standalone serviceStandalone configs service with its own database. jml can't think of advantages to this. Common components
|
If it gets an update saying the set of rules for an instance is now empty, that should have the desired effect of stopping it doing any work. It already handles this kind of update. The instance object would disappear on next restart. |
That one seems an obvious yes to me. |
My own preferences are for 2 or 3.1. I'm considering cortexproject/cortex#620 blocked on this. If we go with 2, I want to keep its current direction. If we go with either 3 option, I'll move the new endpoints back to the configs service. |
If we do this, I suggest we use 'RefuseDataUpload' to stop rule evaluation, instead of trial expiry. |
@jml @bboreham What about combination 2 and 3.1?
If
Pros:
|
Seems workable. The endpoints to call should be configurable, on the expectation we will add more. |
Flux has an events history database that is amenable to 2., but rather less so to 3. |
@lelenanam's design is pretty sound I reckon. A subtly that it nicely accounts for is that cleanup might take longer than a request timeout, so the cleaner-up service should keep asking until the cleanee gives a definitive answer. (I'd suggest a 202 Accepted is mandated unless the cleanee definitely finished cleaning stuff up). |
Cool. I also like @lelenanam's design. I think we've got enough consensus. Full steam ahead! Also, this unblocks cortexproject/cortex#620, which makes me happy. |
Note we need to support both "deleting" and "undeleting", on the expectation that some people will pay up after being blocked, or call support after hitting the wrong button. |
Within a grace period of Ndays, after which no "undeleting" is possible? Or just forever? |
On the assumption we are going to get asymptotically-increasing numbers of customers, it should not be forever. |
Pending release of the automated solution, I would like to fix manually in prod. Check in
Proposed SQL to run in
|
LGTM |
I had to restart the |
There are a few more dead instances now:
Proposed bandaid to run in
|
Proposed bandaid to run in
|
Fixed via cortexproject/cortex#629 and cortexproject/cortex#683 |
eventmanager receives requests with event type
monitor
and wronginstanceID
.When eventmanager tries to get instance name from user service:
https://github.com/weaveworks/notification/blob/4e1d40e0cba471d0393e4b657e31b6291aaa6d3f/eventmanager/manager.go#L349
it receives the error "Not found".
Maybe instance was deleted.
logs from prod eventmanager:
for dev:
Cortex shouldn't generate events with nonexistent
instanceID
.The text was updated successfully, but these errors were encountered: