A start at a guide to running Cortex in production #1553

bboreham · 2019-08-02T13:48:48Z

This is something a lot of people ask about.

Probably it will take months to get into decent shape but the sooner we get started the better, so I jotted down some points.

docs/running.md

weeco · 2019-08-02T14:36:42Z

docs/running.md

+
+### Spread out ingesters
+
+Don't run multiple ingesters on the same node, as that raises the risk


What do you think about adding a quick note how ingesters transfer data (one pod will be spawned additionally, while the other is in the terminating state and transfer the data to the new pod)? Additionally I'd mention here that an operator always wants to avoid losing or moving ingesters as they are semi stateful. This impacts how you want to distribute the ingesters across zones, nodes and what machines you want to use (e. g. no preemptible/spot instances etc).

I think hand-over would go along with basic documentation on ingesters, which we also need; it's not particularly a production thing.
Agreed on the rest.

#1560 for hand-over doc

Rather incomplete and somewhat high-level, but it's a start. Signed-off-by: Bryan Boreham <[email protected]>

Signed-off-by: Bryan Boreham <[email protected]>

csmarchbanks

I agree with weeco's comment about mentioning that moving/rolling ingesters is not desirable, plus a couple other small comments. Otherwise, this looks awesome and thanks a lot for putting these docs up!

docs/running.md

csmarchbanks · 2019-08-06T15:29:04Z

docs/running.md

+### Components
+
+Every Cortex installation will need Distributor, Ingester and Querier.
+Alert-manager, Ruler and Query-frontend are optional.


Nit: Alertmanager is usually one word, happens in the next section as well.

rade · 2019-08-06T17:28:02Z

docs/running.md

+### Ingester replication factor
+
+The standard replication factor is three, so that we can drop one
+sample and be unconcerned, as we still have two copies of the data


Did you mean 'replica' here instead of 'sample'?

Signed-off-by: Bryan Boreham <[email protected]>

bboreham · 2019-08-08T16:05:44Z

I think I addressed all review comments.

weeco · 2019-08-08T16:35:21Z

docs/running.md

+```
+
+We do not recommend configuring a liveness probe on ingesters -
+killing them is a last resort and should not be left to a machine.


Since you are talking about how you want to avoid losing Ingesters, what do you think about adding a note about the impacts if you still lose more than sampleReplicas ingesters. Either temporarily (e. g. ingester pods have network issues (zone failure) and therefore can temporarily not be queried and ingested to) or permanently (e. g. one zone has a power outage).

Speaking of zones you might want to add a short note about availability across zones and probably link that issue too: #612 .

#731 makes this embarrassing to write about. Typically the pattern is I fix an issue rather than document it.

weeco · 2019-08-08T16:38:40Z

docs/auth.md

+interface and humans sending queries from GUIs, supply credentials
+which identify them and confirm they are authorised.
+
+When configuring the remote_write API in Prometheus there is no way to


Implicitly you can define headers in the remote write API in prometheus by setting a bearer token (which then will be used in the Authorization header). This is what we use and we also embed the tenant id inside of it and sign the token on our cortex gateway.

Is that different to "http user and password" ?

username/password and bearer tokens are just different forms of data in the Authorization header. I guess to be complete you could say "The bearer_token or username and password fields can be set to convey..."

csmarchbanks · 2019-08-09T21:07:39Z

docs/auth.md

+which identify them and confirm they are authorised.
+
+When configuring the remote_write API in Prometheus there is no way to
+add extra headers. The http user and password fields can be user to


s/user/used

csmarchbanks · 2019-08-09T21:12:38Z

docs/auth.md

+interface and humans sending queries from GUIs, supply credentials
+which identify them and confirm they are authorised.
+
+When configuring the remote_write API in Prometheus there is no way to


username/password and bearer tokens are just different forms of data in the Authorization header. I guess to be complete you could say "The bearer_token or username and password fields can be set to convey..."

Signed-off-by: Bryan Boreham <[email protected]>

csmarchbanks

LGTM, thanks!

bboreham · 2019-08-20T13:25:13Z

From #1560 we should also say what the "unhealthy" state of ingesters means and what to do about it.
Also mention that auto-scaling ingesters is, er, brave.

bboreham · 2019-08-22T15:26:31Z

I'm going to merge this even though it's very incomplete, so it gives some value.
I invite everyone to add small amounts to improve it.

weeco reviewed Aug 2, 2019

View reviewed changes

bboreham added 2 commits August 6, 2019 13:57

Docs on running Cortex in Production

2f26894

Rather incomplete and somewhat high-level, but it's a start. Signed-off-by: Bryan Boreham <[email protected]>

Add an estimate of CPU increase for compression

b99a0d3

Signed-off-by: Bryan Boreham <[email protected]>

bboreham force-pushed the prod-guide branch from f6ed86b to b99a0d3 Compare August 6, 2019 13:59

bboreham marked this pull request as ready for review August 6, 2019 14:11

bboreham mentioned this pull request Aug 6, 2019

Document the ingester hand-over process #1560

Merged

csmarchbanks reviewed Aug 6, 2019

View reviewed changes

rade reviewed Aug 6, 2019

View reviewed changes

bboreham added 3 commits August 8, 2019 14:11

Add doc about authentication and authorization

3265120

Signed-off-by: Bryan Boreham <[email protected]>

Extend ingesters section

39c9fcf

Signed-off-by: Bryan Boreham <[email protected]>

Spelling of Alertmanager

aca26b5

Signed-off-by: Bryan Boreham <[email protected]>

weeco reviewed Aug 8, 2019

View reviewed changes

weeco mentioned this pull request Aug 8, 2019

Multi-tenancy documentation #1517

Closed

csmarchbanks reviewed Aug 9, 2019

View reviewed changes

Mention Bearer token for auth

c182aef

Signed-off-by: Bryan Boreham <[email protected]>

csmarchbanks approved these changes Aug 16, 2019

View reviewed changes

bboreham merged commit 1344e61 into master Aug 22, 2019

bboreham deleted the prod-guide branch August 22, 2019 15:26


		### Spread out ingesters

		Don't run multiple ingesters on the same node, as that raises the risk

A start at a guide to running Cortex in production #1553

A start at a guide to running Cortex in production #1553

Uh oh!

Conversation

bboreham commented Aug 2, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bboreham commented Aug 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

bboreham commented Aug 20, 2019

Uh oh!

bboreham commented Aug 22, 2019

Uh oh!

Uh oh!