Skip to content

Refactor query frontend to return prometheus error response #5811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

yeya24
Copy link
Contributor

@yeya24 yeya24 commented Mar 13, 2024

What this PR does:

Change QFE to return Prometheus type error response so that the response can be parsed directly by Grafana.

Which issue(s) this PR fixes:
Fixes #5324

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@yeya24 yeya24 force-pushed the refactor-query-frontend-api-response branch 3 times, most recently from 7e750b4 to 5fed352 Compare March 13, 2024 23:18
@yeya24
Copy link
Contributor Author

yeya24 commented Mar 14, 2024

@alanprot @friedrichg Can you help review the pr and see if the error response code makes sense?

@emanlodovice @rajagopalanand Can you help review the ruler part since this pr refactors the Ruler response API?

@yeya24 yeya24 force-pushed the refactor-query-frontend-api-response branch from 5fed352 to b94715c Compare March 14, 2024 06:56
@@ -163,19 +117,19 @@ func (a *API) PrometheusRules(w http.ResponseWriter, req *http.Request) {
userID, err := tenant.TenantID(req.Context())
if err != nil || userID == "" {
level.Error(logger).Log("msg", "error extracting org id from context", "err", err)
respondError(logger, w, "no valid org id found")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error response got changed from 5xx to 4xx, is it ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you run cortex without an auth service in front this change would make sense because its the client's fault for not setting the header, but if there is another service that sets the header then we will be returning 4xx when that service is not working fine. My vote is keeping the 5xx just because I think its more common that it's another service that sets the header and not the client itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only return 5xx if it is a Cortex server side issue. The case you mentioned is Proxy deployed together with Cortex server. However, Proxy can be something deployed at Client side, too. But both ways, the proxy is still a client to Cortex.

Anyway, I think this code is only used when Cortex Server Auth is not enabled. When auth is enabled, there is an auth middleware before which returns 401 when there is no valid Org ID. https://github.com/cortexproject/cortex/blob/master/vendor/github.com/weaveworks/common/middleware/http_auth.go#L12

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These errors catchs are weird, why would it not have the tenant id here?

I think the thing to remember here is that rulers can also send requests to other rulers, so the question is what happens if one ruler returns 4xx to another, instead of 5xx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When one Ruler calls other Rulers to fetch rules for example, those are gRPC calls and does not go thru API layer. It calls Rules(ctx context.Context, in *RulesRequest). So it is OK I think

@@ -275,7 +229,7 @@ func (a *API) PrometheusAlerts(w http.ResponseWriter, req *http.Request) {
userID, err := tenant.TenantID(req.Context())
if err != nil || userID == "" {
level.Error(logger).Log("msg", "error extracting org id from context", "err", err)
respondError(logger, w, "no valid org id found")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error response got changed from 5xx to 4xx, is it ok?

@@ -433,7 +387,7 @@ func (a *API) ListRules(w http.ResponseWriter, req *http.Request) {

userID, namespace, _, err := parseRequest(req, false, false)
if err != nil {
respondError(logger, w, err.Error())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error response got changed from 5xx to 4xx, is it ok?

@@ -466,7 +420,7 @@ func (a *API) GetRuleGroup(w http.ResponseWriter, req *http.Request) {
logger := util_log.WithContext(req.Context(), a.logger)
userID, namespace, groupName, err := parseRequest(req, true, true)
if err != nil {
respondError(logger, w, err.Error())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error response got changed from 5xx to 4xx, is it ok?

@@ -488,7 +442,7 @@ func (a *API) CreateRuleGroup(w http.ResponseWriter, req *http.Request) {
logger := util_log.WithContext(req.Context(), a.logger)
userID, namespace, _, err := parseRequest(req, true, false)
if err != nil {
respondError(logger, w, err.Error())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error response got changed from 5xx to 4xx, is it ok?

@@ -570,7 +524,7 @@ func (a *API) DeleteNamespace(w http.ResponseWriter, req *http.Request) {

userID, namespace, _, err := parseRequest(req, true, false)
if err != nil {
respondError(logger, w, err.Error())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error response got changed from 5xx to 4xx, is it ok?

@@ -592,7 +546,7 @@ func (a *API) DeleteRuleGroup(w http.ResponseWriter, req *http.Request) {

userID, namespace, groupName, err := parseRequest(req, true, true)
if err != nil {
respondError(logger, w, err.Error())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error response got changed from 5xx to 4xx, is it ok?

@yeya24 yeya24 force-pushed the refactor-query-frontend-api-response branch 2 times, most recently from ba3687e to 57d4a0a Compare March 14, 2024 07:21
Data interface{} `json:"data,omitempty"`
ErrorType v1.ErrorType `json:"errorType,omitempty"`
Error string `json:"error,omitempty"`
Warnings []string `json:"warnings,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having omitempty to be consistent with Prometheus API behavior.

@yeya24 yeya24 force-pushed the refactor-query-frontend-api-response branch from 57d4a0a to 254527f Compare March 14, 2024 07:26
@rajagopalanand
Copy link
Contributor

LGTM

Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query frontend looks fine, not sure about ruler

@@ -163,19 +117,19 @@ func (a *API) PrometheusRules(w http.ResponseWriter, req *http.Request) {
userID, err := tenant.TenantID(req.Context())
if err != nil || userID == "" {
level.Error(logger).Log("msg", "error extracting org id from context", "err", err)
respondError(logger, w, "no valid org id found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These errors catchs are weird, why would it not have the tenant id here?

I think the thing to remember here is that rulers can also send requests to other rulers, so the question is what happens if one ruler returns 4xx to another, instead of 5xx.

Comment on lines 459 to 484
headers := w.Header()
for k, values := range additionalHeaders {
for _, value := range values {
headers.Set(k, value)
}
}
resp, ok := httpgrpc.HTTPResponseFromError(err)
if ok {
for k, values := range additionalHeaders {
resp.Headers = append(resp.Headers, &httpgrpc.Header{Key: k, Values: values})
code := int(resp.Code)
var errTyp v1.ErrorType
switch resp.Code {
case http.StatusBadRequest, http.StatusRequestEntityTooLarge:
errTyp = v1.ErrBadData
case StatusClientClosedRequest:
errTyp = v1.ErrCanceled
case http.StatusGatewayTimeout:
errTyp = v1.ErrTimeout
case http.StatusUnprocessableEntity:
errTyp = v1.ErrExec
case int32(codes.PermissionDenied):
// Convert gRPC status code to HTTP status code.
code = http.StatusUnprocessableEntity
errTyp = v1.ErrBadData
default:
errTyp = v1.ErrServer
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some places that seems the response is already in the prometheus format? (i think when is returned by the querier itself, not the query frontend?) https://github.com/cortexproject/cortex/blob/master/pkg/querier/tripperware/queryrange/query_range.go#L269
In this case u think make sense to just "pipe" it?

Im afraid the message in this case is another prometheus response serialized. haha

Nit: Does it make sense to move this method now to be in the util/response.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move the method to util.

In this case u think make sense to just "pipe" it?

Do you have any suggestion, how I can get the error and know it is already Prometheus formatted? I have to deserialize the error in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit


type Response struct {
Status string `json:"status"`
Data interface{} `json:"data,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could unexpected break some consumer code because now all 5xx responses will not have a data key. Maybe we can move this on a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Sure I will remove this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit

@yeya24 yeya24 force-pushed the refactor-query-frontend-api-response branch from 254527f to 0ba10db Compare April 10, 2024 02:51
@yeya24 yeya24 force-pushed the refactor-query-frontend-api-response branch from 736e4fd to fa57317 Compare April 11, 2024 06:15
Signed-off-by: Ben Ye <[email protected]>
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good tests

@alanprot alanprot merged commit 6bca2d5 into cortexproject:master Apr 22, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QueryFrontend error cannot be parsed correctly in Grafana
5 participants