-
Notifications
You must be signed in to change notification settings - Fork 816
Port Cortex to use Prometheus 2 packages #583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(will take a look at the remaining test failures) |
Hmm, tests were "just" flaky, not sure if related to my changes or not. |
Did you forget to |
@rade Indeed, thanks. Still new to |
Actually, according to the docs one is meant to run Apologies for the mis-guidance; I haven't built cortex for a while. |
...though it looks like that would also retain the BUILD.bazel files of removed dependencies, which doesn't seem right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll get to testing this.
Comments are mostly questions to help me understand than suggestions for changing things.
engine, | ||
metadataQueryable, | ||
querier.DummyTargetRetriever{}, | ||
querier.DummyAlertmanagerRetriever{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to give the API dummy retrievers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normal Prometheus API can provide a list of targets (e.g. http://demo.robustperception.io:9090/api/v1/targets) and configured Alertmanagers (http://demo.robustperception.io:9090/api/v1/alertmanagers). We don't need/have this information for Cortex, so we pass in dummy implementations that just give empty lists.
pkg/chunk/chunk_store.go
Outdated
orgID, err := user.ExtractOrgID(ctx) | ||
if err != nil { | ||
return nil, err | ||
// TODO(prom2): Does this contraption over-match? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I still want to look at closer before merging. Mainly I extracted this code from the lazy iterators and moved it to here (for eager loading). I just need to reason through it all one more time before giving a better answer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to elaborate:
The question I have around this code is not related to this PR really, as I just moved it here to be eager-loading instead of lazy-loading. The same question applies to the current master
.
Somehow it seems to actually select only the correct series, although if I run it in my head there should be a bug: the code gets all series where the metric name matcher applies, then filters them down to only the series where the other matchers also apply. From those matching series, it builds a set of new matchers that are then used to look up the final series chunks. But those matchers should then match too many series in some cases.
So e.g. I thought this would over-match in the following scenario:
- TSDB has both
foo{a="b"}
andfoo{a="b", c="d"}
- Query is for
{__name__=~"foo", c!="d"}
- (correctly this should return
foo{a="b"}
, but notfoo{a="b", c="d"})
- (correctly this should return
What I would expect to happen in the code is then:
- it finds and matches both
foo
series in the metric name index - it filters out the
foo{a="b", c="d"}
series because it doesn't match thec!="d"
matcher - it builds new matchers out of the remaining series's labels (
foo{a="b"}
) - it uses those new matchers to look up all series that match them, which should also match
foo{a="b", c="d"}
, since it is a superset of labels
But somehow, it does not return this wrong series and the query result is correct. But why? Do you see what I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So of course I couldn't just sleep with this mystery unsolved. Many debug Printf's later, I found out the following:
- The returned series were only correct because they came from the ingesters only, and not from the chunk store.
- The chunk store actually didn't select any data for the above sample scenario because of a bug I introduced (fixed in latest push) that passed in
__name__
instead of the metric name intogetMetricNameMatrix()
. Now that that's fixed, it does select too many series, as outlined in my example scenario. - The current
master
is actually worse: rather than selecting too many series, the lazy iterator just crashes with anil pointer dereference
in the above scenario, as soon as there are relevant chunks in S3. This is currently happening in production as well.
So: master
is currently broken in a crashy way (queries fail completely), whereas this PR branch is broken in that it selects too many series in an edge case. That problem would also occur in master
, if it didn't crash before that. Thus, I am almost tending to not try and fix this unrelated bug in this PR. @jml What do you think?
pkg/chunk/chunk_store_test.go
Outdated
} | ||
|
||
// TestChunkStore_Get tests iterators are returned correctly depending on the type of query | ||
func TestChunkStore_Get_concrete(t *testing.T) { | ||
// TODO(prom2): reintroduce tests that were part of TestChunkStore_Get_lazy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that our next step is testing, it would be good to actually reintroduce these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
@@ -8,6 +8,7 @@ import ( | |||
"net/url" | |||
"time" | |||
|
|||
gklog "github.com/go-kit/kit/log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use go-kit's logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus 2.0 packages use go-kit's logger interfaces everywhere now, so we need to interface with that. See this in the same file below:
// TODO(prom2): Wrap our logger into a gokit logger.
rule = rules.NewAlertingRule(r.Name, r.Expr, r.Duration, r.Labels, r.Annotations, gklog.NewNopLogger())
pkg/querier/series_set.go
Outdated
return e.err | ||
} | ||
|
||
type metadataSeriesSet struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I forgot to remove that type. Removed.
@rade Hm yeah, update-vendor:
dep ensure && dep prune
git status | grep BUILD.bazel | cut -d' ' -f 5 | xargs git checkout HEAD I guess preserving those BUILD.bazel files would only work now if they hadn't already been removed in previous commits. And that also means we have to manually write Bazel files for new dependencies? Or is there an automation for that? /cc @tomwilkie |
Ok, so I re-added the Bazel build files by running
It builds with Bazel now :) |
473008a
to
44a2337
Compare
@jml I finished all remaining Most importantly, the over-selection edge case in the chunk store is now fixed, and there are tests to cover that and other cases. The new test cases also uncovered another bug that is fixed now. I also simplified the test scenario code a bit. I think this is ready for final testing and review. |
New commits LGTM. Thanks especially for the tests! |
Besides testing the alert notifications end-to-end one more time in dev, does anyone see any other things that we should ensure before merging? I at least have not found more issues so far, and beginning of the week would generally be good timing for deploying to prod and then dealing with any potential fallout. |
Just chatted w/ Marcus about this in person.
- end-to-end alert notifications still need to be tested
- need to clarify migration plan for rules/alert configs
We want to have an alert correctness job, but we shouldn't block the prom2
merge/prod-deploy on that.
…On Sun, 29 Oct 2017 at 09:37 Julius Volz ***@***.***> wrote:
Besides testing the alert notifications end-to-end one more time in dev,
does anyone see any other things that we should ensure before merging? I at
least have not found more issues so far, and beginning of the week would
generally be good timing for deploying to prod and then dealing with any
potential fallout.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#583 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHq6uJ2cXGo3J2hIqFYOHLjgy89TIyRks5sxEdQgaJpZM4P7lu6>
.
|
@leth – tagging re above message.
…On Mon, 30 Oct 2017 at 13:45 Jonathan Lange ***@***.***> wrote:
Just chatted w/ Marcus about this in person.
- end-to-end alert notifications still need to be tested
- need to clarify migration plan for rules/alert configs
We want to have an alert correctness job, but we shouldn't block the prom2
merge/prod-deploy on that.
On Sun, 29 Oct 2017 at 09:37 Julius Volz ***@***.***> wrote:
> Besides testing the alert notifications end-to-end one more time in dev,
> does anyone see any other things that we should ensure before merging? I at
> least have not found more issues so far, and beginning of the week would
> generally be good timing for deploying to prod and then dealing with any
> potential fallout.
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#583 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAHq6uJ2cXGo3J2hIqFYOHLjgy89TIyRks5sxEdQgaJpZM4P7lu6>
> .
>
|
Note that this PR doesn't change the rule format yet, so there's no
migration needed for this yet.
I'll do the (still manual) alert testing tonight.
…On Oct 30, 2017 2:45 PM, "Jonathan Lange" ***@***.***> wrote:
Just chatted w/ Marcus about this in person.
- end-to-end alert notifications still need to be tested
- need to clarify migration plan for rules/alert configs
We want to have an alert correctness job, but we shouldn't block the prom2
merge/prod-deploy on that.
On Sun, 29 Oct 2017 at 09:37 Julius Volz ***@***.***> wrote:
> Besides testing the alert notifications end-to-end one more time in dev,
> does anyone see any other things that we should ensure before merging? I
at
> least have not found more issues so far, and beginning of the week would
> generally be good timing for deploying to prod and then dealing with any
> potential fallout.
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#583 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AAHq6uJ2cXGo3J2hIqFYOHLjgy89TIyRks5sxEdQgaJpZM4P7lu6>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#583 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAg1mEl-d_MUMDy-y8JaoSqUNUuCNkkpks5sxdL3gaJpZM4P7lu6>
.
|
73a82ba
to
95f4254
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our conviction is like an arrow already in flight.
The updated `go vet` complains about more errors than the old one.
The upstream prometheus/common/log changed and doesn't automatically register flags anymore.
The laziness was actually never used.
Only the v8 schema supports omitting the metric name in a query.
- Fix over-selection bug when the metric name was non-equals-matched and one selects a series that has a subset of labels of another. - Regex matches against the metric name were broken because the metric name was equals-compared to the matcher regex instead of applying it as a regex. - Unneeded sorting of label matchers has been removed.
The ingester only needs the MaxChunkAge from the SchemaConfig, and the ingester.Config is actually a better place to store and configure that information authoritatively.
The deep copy that dumped it as JSON and loaded it again stumbled over the fact that the JSON marshaling renders Secret fields as <secret> and thus loses the original secret field contents when reloading it from JSON.
Unfortunately this caused table-manager to see a zero value for |
This is the main part of #555
As promised, this is a crazy mega change. Unfortunately, when swapping out vendored packages and making everything work again, a smaller change wasn't really possible.
Here's a high-level overview:
pkg/prom1/...
and rewrote all the imports for them.prometheus/common/log
does not automatically configure log level flags anymore, so there's extra code to do that in Cortex now.vet
.Select()
method is used for both metadata and sample querying, but a remote implementation would want to know beforehand whether it needs to get full sample data or only metadata. That's why I allow parametrizing theQueryable
(which produces aQuerier
) to tell it whether it should produce full sample or only metadata queriers, and then hand in the right kind ofQueryable
to the using packages.prometheus/common/route
package doesn't have a context function argument in its constructor anymore, but we also didn't need that anymore, so that dummy function argument is just removed.golang.org/x/net/context
to nativecontext
in many places (was required for some things), but not everywhere yet.LazySeriesIterators
were never actually used lazily, and would have made Prom 2 usage much harder, so I removed them in favor of eagerly evaluating even queries that fuzzy-match the metric name.Get()
andQuery()
functions now returnmodel.Matrix
instead of iterators, since the above-mentioned eager eval obviated the need for iterators again and this made things simpler.concreteSeries
/concreteSeriesSet
/errorSeriesSet
implementations from Prometheus's remote storage code to be able to return series fromSelect()
that are backed by already present sample data (or errors).pkg/ruler/compat.go
fixes Don't create one ingester write request per rule-generated sample #569 as a side-effect by batching up all samples from a given rule evaluation and only sending them uponCommit()
(called at the end of a rule evaluation).dep
dependency vendoring wrangling was necessary to make all the different package versions and conflicting dependencies work. Hopefully, everything fits well enough together now (at least it compiles!).User-visible changes due to this PR:
/api/v1
functionality should be available (there weren't many (any?)) changes. Note that/api/v2
isn't available yet, but only contains some extra endpoints which we don't want to implement anyway so far.promtool
to do conversions). This is great, as it allows us to keep at least one major user-visible change out of this PR, and it is easy to change the code to expect the Prom 2.0 rule format in a subsequent change.How to review this change? Let's be honest: you probably can't :) Instead, help test it! All the pieces should be roughly there, but I expect bugs. I tested basic ingestion/querying/alerting rules/AM UI, but no edge cases or other very thorough manual testing yet.
Please help test everything you can think of in this current state and let me know what is still broken! It would be good to get this merged after we are reasonably confident that it is working, and do other code cleanups and features in later changes, since probably a lot of other stuff is blocked on this one large code change.