Skip to content

A start at a guide to running Cortex in production #1553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 22, 2019
Merged

Conversation

bboreham
Copy link
Contributor

@bboreham bboreham commented Aug 2, 2019

This is something a lot of people ask about.

Probably it will take months to get into decent shape but the sooner we get started the better, so I jotted down some points.

docs/running.md Outdated

### Spread out ingesters

Don't run multiple ingesters on the same node, as that raises the risk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a quick note how ingesters transfer data (one pod will be spawned additionally, while the other is in the terminating state and transfer the data to the new pod)? Additionally I'd mention here that an operator always wants to avoid losing or moving ingesters as they are semi stateful. This impacts how you want to distribute the ingesters across zones, nodes and what machines you want to use (e. g. no preemptible/spot instances etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hand-over would go along with basic documentation on ingesters, which we also need; it's not particularly a production thing.
Agreed on the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1560 for hand-over doc

Rather incomplete and somewhat high-level, but it's a start.

Signed-off-by: Bryan Boreham <[email protected]>
@bboreham bboreham marked this pull request as ready for review August 6, 2019 14:11
Copy link
Contributor

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with weeco's comment about mentioning that moving/rolling ingesters is not desirable, plus a couple other small comments. Otherwise, this looks awesome and thanks a lot for putting these docs up!

docs/running.md Outdated
### Components

Every Cortex installation will need Distributor, Ingester and Querier.
Alert-manager, Ruler and Query-frontend are optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Alertmanager is usually one word, happens in the next section as well.

docs/running.md Outdated
### Ingester replication factor

The standard replication factor is three, so that we can drop one
sample and be unconcerned, as we still have two copies of the data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean 'replica' here instead of 'sample'?

@bboreham
Copy link
Contributor Author

bboreham commented Aug 8, 2019

I think I addressed all review comments.

```

We do not recommend configuring a liveness probe on ingesters -
killing them is a last resort and should not be left to a machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are talking about how you want to avoid losing Ingesters, what do you think about adding a note about the impacts if you still lose more than sampleReplicas ingesters. Either temporarily (e. g. ingester pods have network issues (zone failure) and therefore can temporarily not be queried and ingested to) or permanently (e. g. one zone has a power outage).

Speaking of zones you might want to add a short note about availability across zones and probably link that issue too: #612 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#731 makes this embarrassing to write about. Typically the pattern is I fix an issue rather than document it.

interface and humans sending queries from GUIs, supply credentials
which identify them and confirm they are authorised.

When configuring the remote_write API in Prometheus there is no way to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicitly you can define headers in the remote write API in prometheus by setting a bearer token (which then will be used in the Authorization header). This is what we use and we also embed the tenant id inside of it and sign the token on our cortex gateway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that different to "http user and password" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

username/password and bearer tokens are just different forms of data in the Authorization header. I guess to be complete you could say "The bearer_token or username and password fields can be set to convey..."

docs/auth.md Outdated
which identify them and confirm they are authorised.

When configuring the remote_write API in Prometheus there is no way to
add extra headers. The http user and password fields can be user to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/user/used

interface and humans sending queries from GUIs, supply credentials
which identify them and confirm they are authorised.

When configuring the remote_write API in Prometheus there is no way to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

username/password and bearer tokens are just different forms of data in the Authorization header. I guess to be complete you could say "The bearer_token or username and password fields can be set to convey..."

Signed-off-by: Bryan Boreham <[email protected]>
Copy link
Contributor

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@bboreham
Copy link
Contributor Author

From #1560 we should also say what the "unhealthy" state of ingesters means and what to do about it.
Also mention that auto-scaling ingesters is, er, brave.

@bboreham
Copy link
Contributor Author

I'm going to merge this even though it's very incomplete, so it gives some value.
I invite everyone to add small amounts to improve it.

@bboreham bboreham merged commit 1344e61 into master Aug 22, 2019
@bboreham bboreham deleted the prod-guide branch August 22, 2019 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants