out-of-order labelset in compactor #5419

nschad · 2023-06-21T07:32:46Z

out-of-order Labelset in compactor

We have noticed that sometimes (since a couple of days) blocks appear with "out-of-order" errors in the compactor log. Example below. We had different and seemingly random errors, affecting different labelsets and different metrics. Sometimes labels were duped and literally appearing twice in the same set with different values. In this example you can see that, what is supposed to be beta_kubernetes_instance_type, is completely corrupted. Because of this corruption the required sorting of the labels in the set is wrong, thus the error. This is also the only occurence in the block. Running tsdb analyze shows that other metrics do not have this buggy label.

msg="out-of-order label set: known bug in Prometheus 2.8.0 and below" labelset="{__name__=\"container_memory_usage_bytes\", beta_kubernetes_io_arch=\"amd64\", beta_kuber\u0000(\ufffd@\u0010\ufffdౡ\ufffd1stance_type=\"c1.2\", beta_kubernetes_io_os=\"linux\", ....}

Additionally the Grafana Metric Browser dashboard sometimes shows labels like these. Which is interesting to me since the labels there are fetched through the prometheus /api/v1/labels API, right? For me it means the problem already exists in the ingesters and we can remove potential data corruption at the s3 layer.

To Reproduce
Steps to reproduce the behavior:

Unknown

Expected behavior
non corrupted data

Environment:

Infrastructure: Kubernetes
Deployment tool: Helm
Cortex Version: 1.15.1

Additional Context
Also we don't run any prometheus below 2.35

The text was updated successfully, but these errors were encountered:

friedrichg · 2023-06-21T08:16:05Z

cortex/vendor/github.com/thanos-io/thanos/pkg/block/index.go

Line 270 in 3d94719

"out-of-order label set: known bug in Prometheus 2.8.0 and below",

friedrichg · 2023-06-21T08:27:46Z

It's important to know if the block having the issues is a block generated by a ingester or is a new block created by the compactor.
if the problem is for a block that an ingester generated, can you check the logs for that ingester?

nschad · 2023-06-21T08:53:21Z

It's important to know if the block having the issues is a block generated by a ingester or is a new block created by the compactor. if the problem is for a block that an ingester generated, can you check the logs for that ingester?

The meta.json for the block in s3 says

{
	"ulid": "01H3A9M6PP20DT65VK52NE1FST",
	"minTime": 1687183200000,
	"maxTime": 1687190400000,
	"stats": {
		"numSamples": 232175131,
		"numSeries": 1614280,
		"numChunks": 1957317
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01H3A9M6PP20DT65VK52NE1FST"
		]
	},

Is that already after the vertical compaction or is level 1 the starting point? Will see if I can find something in ingester logs.

nschad · 2023-06-21T09:25:44Z

According to this

cortex/vendor/github.com/prometheus/prometheus/tsdb/compact.go

Line 495 in 3d94719

meta.Compaction.Level = 1

Level: 1 is the starting point. So no compaction via the compactor was yet done. So basically the block was ingester generated @friedrichg which would go in hand with my assessment that something is already wrong with the ingester data.

Unless I'm mistaken.

alanprot · 2023-06-22T23:57:47Z

hi @nschad, do you have compression enabled on your GRPC client? if so, which one?

alanprot · 2023-06-23T01:39:49Z

Ok..

Im not sure but this may be related to #5193 not playing well with grpc/grpc-go#6355 in cases of timeout.

I will create a PR to not reuse the request in case of error just in case and release 1.15.2.

alanprot · 2023-06-23T05:34:56Z

Hi @nschad ,

I cut v1.15.3 release. Can you update and see if this problem is still hapenning?

nschad · 2023-06-23T06:49:39Z

Hi @nschad ,

I cut v1.15.3 release. Can you update and see if this problem is still hapenning?

Cool. I'll try that right now. Also it should be noted that we do have a lot of timeouts from time to time due to some other underlying problem (not related) so this very well could be the case.

nschad · 2023-06-23T07:50:11Z

hi @nschad, do you have compression enabled on your GRPC client? if so, which one?

grpc compression is disabled.api.response_compression_enabled is enabled however but that should be unrelated

nschad · 2023-06-27T12:03:13Z

@alanprot We have run 1.15.3 since Friday and haven't experienced the issue yet.

alanprot · 2023-06-27T16:32:56Z

@nschad Thanks. Feel free to close the issue when you are confident that this was the problem.

friedrichg added type/bug component/ingester component/compactor and removed component/ingester labels Jun 21, 2023

alanprot mentioned this issue Jun 23, 2023

Do not reuse remote write requests in case of error #5422

Merged

2 tasks

alanprot mentioned this issue Jun 27, 2023

[Experimental] Signing write requests #5430

Merged

3 tasks

nschad closed this as completed Jun 29, 2023

friedrichg mentioned this issue Nov 28, 2024

Querying basis on filter and getting metrics we should'nt get #5709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out-of-order labelset in compactor #5419

out-of-order labelset in compactor #5419

nschad commented Jun 21, 2023 •

edited

Loading

friedrichg commented Jun 21, 2023

friedrichg commented Jun 21, 2023

nschad commented Jun 21, 2023

nschad commented Jun 21, 2023

alanprot commented Jun 22, 2023

alanprot commented Jun 23, 2023

alanprot commented Jun 23, 2023

nschad commented Jun 23, 2023 •

edited

Loading

nschad commented Jun 23, 2023

nschad commented Jun 27, 2023

alanprot commented Jun 27, 2023

out-of-order labelset in compactor #5419

out-of-order labelset in compactor #5419

Comments

nschad commented Jun 21, 2023 • edited Loading

friedrichg commented Jun 21, 2023

friedrichg commented Jun 21, 2023

nschad commented Jun 21, 2023

nschad commented Jun 21, 2023

alanprot commented Jun 22, 2023

alanprot commented Jun 23, 2023

alanprot commented Jun 23, 2023

nschad commented Jun 23, 2023 • edited Loading

nschad commented Jun 23, 2023

nschad commented Jun 27, 2023

alanprot commented Jun 27, 2023

nschad commented Jun 21, 2023 •

edited

Loading

nschad commented Jun 23, 2023 •

edited

Loading