-
Notifications
You must be signed in to change notification settings - Fork 816
A start at a guide to running Cortex in production #1553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
docs/running.md
Outdated
|
||
### Spread out ingesters | ||
|
||
Don't run multiple ingesters on the same node, as that raises the risk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding a quick note how ingesters transfer data (one pod will be spawned additionally, while the other is in the terminating state and transfer the data to the new pod)? Additionally I'd mention here that an operator always wants to avoid losing or moving ingesters as they are semi stateful. This impacts how you want to distribute the ingesters across zones, nodes and what machines you want to use (e. g. no preemptible/spot instances etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hand-over would go along with basic documentation on ingesters, which we also need; it's not particularly a production thing.
Agreed on the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1560 for hand-over doc
Rather incomplete and somewhat high-level, but it's a start. Signed-off-by: Bryan Boreham <[email protected]>
Signed-off-by: Bryan Boreham <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with weeco's comment about mentioning that moving/rolling ingesters is not desirable, plus a couple other small comments. Otherwise, this looks awesome and thanks a lot for putting these docs up!
docs/running.md
Outdated
### Components | ||
|
||
Every Cortex installation will need Distributor, Ingester and Querier. | ||
Alert-manager, Ruler and Query-frontend are optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Alertmanager is usually one word, happens in the next section as well.
docs/running.md
Outdated
### Ingester replication factor | ||
|
||
The standard replication factor is three, so that we can drop one | ||
sample and be unconcerned, as we still have two copies of the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean 'replica' here instead of 'sample'?
Signed-off-by: Bryan Boreham <[email protected]>
Signed-off-by: Bryan Boreham <[email protected]>
Signed-off-by: Bryan Boreham <[email protected]>
I think I addressed all review comments. |
``` | ||
|
||
We do not recommend configuring a liveness probe on ingesters - | ||
killing them is a last resort and should not be left to a machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are talking about how you want to avoid losing Ingesters, what do you think about adding a note about the impacts if you still lose more than sampleReplicas
ingesters. Either temporarily (e. g. ingester pods have network issues (zone failure) and therefore can temporarily not be queried and ingested to) or permanently (e. g. one zone has a power outage).
Speaking of zones you might want to add a short note about availability across zones and probably link that issue too: #612 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#731 makes this embarrassing to write about. Typically the pattern is I fix an issue rather than document it.
interface and humans sending queries from GUIs, supply credentials | ||
which identify them and confirm they are authorised. | ||
|
||
When configuring the remote_write API in Prometheus there is no way to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicitly you can define headers in the remote write API in prometheus by setting a bearer token (which then will be used in the Authorization
header). This is what we use and we also embed the tenant id inside of it and sign the token on our cortex gateway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that different to "http user and password" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
username/password and bearer tokens are just different forms of data in the Authorization header. I guess to be complete you could say "The bearer_token or username and password fields can be set to convey..."
docs/auth.md
Outdated
which identify them and confirm they are authorised. | ||
|
||
When configuring the remote_write API in Prometheus there is no way to | ||
add extra headers. The http user and password fields can be user to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/user/used
interface and humans sending queries from GUIs, supply credentials | ||
which identify them and confirm they are authorised. | ||
|
||
When configuring the remote_write API in Prometheus there is no way to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
username/password and bearer tokens are just different forms of data in the Authorization header. I guess to be complete you could say "The bearer_token or username and password fields can be set to convey..."
Signed-off-by: Bryan Boreham <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
From #1560 we should also say what the "unhealthy" state of ingesters means and what to do about it. |
I'm going to merge this even though it's very incomplete, so it gives some value. |
This is something a lot of people ask about.
Probably it will take months to get into decent shape but the sooner we get started the better, so I jotted down some points.