Skip to content

out-of-order labelset in compactor #5419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nschad opened this issue Jun 21, 2023 · 11 comments
Closed

out-of-order labelset in compactor #5419

nschad opened this issue Jun 21, 2023 · 11 comments

Comments

@nschad
Copy link
Contributor

nschad commented Jun 21, 2023

out-of-order Labelset in compactor

We have noticed that sometimes (since a couple of days) blocks appear with "out-of-order" errors in the compactor log. Example below. We had different and seemingly random errors, affecting different labelsets and different metrics. Sometimes labels were duped and literally appearing twice in the same set with different values. In this example you can see that, what is supposed to be beta_kubernetes_instance_type, is completely corrupted. Because of this corruption the required sorting of the labels in the set is wrong, thus the error. This is also the only occurence in the block. Running tsdb analyze shows that other metrics do not have this buggy label.

msg="out-of-order label set: known bug in Prometheus 2.8.0 and below" labelset="{__name__=\"container_memory_usage_bytes\", beta_kubernetes_io_arch=\"amd64\", beta_kuber\u0000(\ufffd@\u0010\ufffdౡ\ufffd1stance_type=\"c1.2\", beta_kubernetes_io_os=\"linux\", ....}

Additionally the Grafana Metric Browser dashboard sometimes shows labels like these. Which is interesting to me since the labels there are fetched through the prometheus /api/v1/labels API, right? For me it means the problem already exists in the ingesters and we can remove potential data corruption at the s3 layer.
image

To Reproduce
Steps to reproduce the behavior:

  1. Unknown

Expected behavior
non corrupted data

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Helm
  • Cortex Version: 1.15.1

Additional Context
Also we don't run any prometheus below 2.35

@friedrichg
Copy link
Member

"out-of-order label set: known bug in Prometheus 2.8.0 and below",

@friedrichg
Copy link
Member

It's important to know if the block having the issues is a block generated by a ingester or is a new block created by the compactor.
if the problem is for a block that an ingester generated, can you check the logs for that ingester?

@nschad
Copy link
Contributor Author

nschad commented Jun 21, 2023

It's important to know if the block having the issues is a block generated by a ingester or is a new block created by the compactor. if the problem is for a block that an ingester generated, can you check the logs for that ingester?

The meta.json for the block in s3 says

{
	"ulid": "01H3A9M6PP20DT65VK52NE1FST",
	"minTime": 1687183200000,
	"maxTime": 1687190400000,
	"stats": {
		"numSamples": 232175131,
		"numSeries": 1614280,
		"numChunks": 1957317
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01H3A9M6PP20DT65VK52NE1FST"
		]
	},

Is that already after the vertical compaction or is level 1 the starting point? Will see if I can find something in ingester logs.

@nschad
Copy link
Contributor Author

nschad commented Jun 21, 2023

According to this

Level: 1 is the starting point. So no compaction via the compactor was yet done. So basically the block was ingester generated @friedrichg which would go in hand with my assessment that something is already wrong with the ingester data.

Unless I'm mistaken.

@alanprot
Copy link
Member

hi @nschad, do you have compression enabled on your GRPC client? if so, which one?

@alanprot
Copy link
Member

Ok..

Im not sure but this may be related to #5193 not playing well with grpc/grpc-go#6355 in cases of timeout.

I will create a PR to not reuse the request in case of error just in case and release 1.15.2.

@alanprot
Copy link
Member

Hi @nschad ,

I cut v1.15.3 release. Can you update and see if this problem is still hapenning?

@nschad
Copy link
Contributor Author

nschad commented Jun 23, 2023

Hi @nschad ,

I cut v1.15.3 release. Can you update and see if this problem is still hapenning?

Cool. I'll try that right now. Also it should be noted that we do have a lot of timeouts from time to time due to some other underlying problem (not related) so this very well could be the case.

@nschad
Copy link
Contributor Author

nschad commented Jun 23, 2023

hi @nschad, do you have compression enabled on your GRPC client? if so, which one?

grpc compression is disabled.api.response_compression_enabled is enabled however but that should be unrelated

@nschad
Copy link
Contributor Author

nschad commented Jun 27, 2023

@alanprot We have run 1.15.3 since Friday and haven't experienced the issue yet.

@alanprot
Copy link
Member

@nschad Thanks. Feel free to close the issue when you are confident that this was the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants