-
Notifications
You must be signed in to change notification settings - Fork 818
Refactor query frontend to return prometheus error response #5811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor query frontend to return prometheus error response #5811
Conversation
7e750b4
to
5fed352
Compare
@alanprot @friedrichg Can you help review the pr and see if the error response code makes sense? @emanlodovice @rajagopalanand Can you help review the ruler part since this pr refactors the Ruler response API? |
5fed352
to
b94715c
Compare
@@ -163,19 +117,19 @@ func (a *API) PrometheusRules(w http.ResponseWriter, req *http.Request) { | |||
userID, err := tenant.TenantID(req.Context()) | |||
if err != nil || userID == "" { | |||
level.Error(logger).Log("msg", "error extracting org id from context", "err", err) | |||
respondError(logger, w, "no valid org id found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error response got changed from 5xx to 4xx, is it ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you run cortex without an auth service in front this change would make sense because its the client's fault for not setting the header, but if there is another service that sets the header then we will be returning 4xx when that service is not working fine. My vote is keeping the 5xx just because I think its more common that it's another service that sets the header and not the client itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only return 5xx if it is a Cortex server side issue. The case you mentioned is Proxy deployed together with Cortex server. However, Proxy can be something deployed at Client side, too. But both ways, the proxy is still a client to Cortex.
Anyway, I think this code is only used when Cortex Server Auth is not enabled. When auth is enabled, there is an auth middleware before which returns 401 when there is no valid Org ID. https://github.com/cortexproject/cortex/blob/master/vendor/github.com/weaveworks/common/middleware/http_auth.go#L12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These errors catchs are weird, why would it not have the tenant id here?
I think the thing to remember here is that rulers can also send requests to other rulers, so the question is what happens if one ruler returns 4xx to another, instead of 5xx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When one Ruler calls other Rulers to fetch rules for example, those are gRPC calls and does not go thru API layer. It calls Rules(ctx context.Context, in *RulesRequest)
. So it is OK I think
@@ -275,7 +229,7 @@ func (a *API) PrometheusAlerts(w http.ResponseWriter, req *http.Request) { | |||
userID, err := tenant.TenantID(req.Context()) | |||
if err != nil || userID == "" { | |||
level.Error(logger).Log("msg", "error extracting org id from context", "err", err) | |||
respondError(logger, w, "no valid org id found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error response got changed from 5xx to 4xx, is it ok?
@@ -433,7 +387,7 @@ func (a *API) ListRules(w http.ResponseWriter, req *http.Request) { | |||
|
|||
userID, namespace, _, err := parseRequest(req, false, false) | |||
if err != nil { | |||
respondError(logger, w, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error response got changed from 5xx to 4xx, is it ok?
@@ -466,7 +420,7 @@ func (a *API) GetRuleGroup(w http.ResponseWriter, req *http.Request) { | |||
logger := util_log.WithContext(req.Context(), a.logger) | |||
userID, namespace, groupName, err := parseRequest(req, true, true) | |||
if err != nil { | |||
respondError(logger, w, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error response got changed from 5xx to 4xx, is it ok?
@@ -488,7 +442,7 @@ func (a *API) CreateRuleGroup(w http.ResponseWriter, req *http.Request) { | |||
logger := util_log.WithContext(req.Context(), a.logger) | |||
userID, namespace, _, err := parseRequest(req, true, false) | |||
if err != nil { | |||
respondError(logger, w, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error response got changed from 5xx to 4xx, is it ok?
@@ -570,7 +524,7 @@ func (a *API) DeleteNamespace(w http.ResponseWriter, req *http.Request) { | |||
|
|||
userID, namespace, _, err := parseRequest(req, true, false) | |||
if err != nil { | |||
respondError(logger, w, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error response got changed from 5xx to 4xx, is it ok?
@@ -592,7 +546,7 @@ func (a *API) DeleteRuleGroup(w http.ResponseWriter, req *http.Request) { | |||
|
|||
userID, namespace, groupName, err := parseRequest(req, true, true) | |||
if err != nil { | |||
respondError(logger, w, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error response got changed from 5xx to 4xx, is it ok?
ba3687e
to
57d4a0a
Compare
Data interface{} `json:"data,omitempty"` | ||
ErrorType v1.ErrorType `json:"errorType,omitempty"` | ||
Error string `json:"error,omitempty"` | ||
Warnings []string `json:"warnings,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having omitempty
to be consistent with Prometheus API behavior.
57d4a0a
to
254527f
Compare
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query frontend looks fine, not sure about ruler
@@ -163,19 +117,19 @@ func (a *API) PrometheusRules(w http.ResponseWriter, req *http.Request) { | |||
userID, err := tenant.TenantID(req.Context()) | |||
if err != nil || userID == "" { | |||
level.Error(logger).Log("msg", "error extracting org id from context", "err", err) | |||
respondError(logger, w, "no valid org id found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These errors catchs are weird, why would it not have the tenant id here?
I think the thing to remember here is that rulers can also send requests to other rulers, so the question is what happens if one ruler returns 4xx to another, instead of 5xx.
pkg/frontend/transport/handler.go
Outdated
headers := w.Header() | ||
for k, values := range additionalHeaders { | ||
for _, value := range values { | ||
headers.Set(k, value) | ||
} | ||
} | ||
resp, ok := httpgrpc.HTTPResponseFromError(err) | ||
if ok { | ||
for k, values := range additionalHeaders { | ||
resp.Headers = append(resp.Headers, &httpgrpc.Header{Key: k, Values: values}) | ||
code := int(resp.Code) | ||
var errTyp v1.ErrorType | ||
switch resp.Code { | ||
case http.StatusBadRequest, http.StatusRequestEntityTooLarge: | ||
errTyp = v1.ErrBadData | ||
case StatusClientClosedRequest: | ||
errTyp = v1.ErrCanceled | ||
case http.StatusGatewayTimeout: | ||
errTyp = v1.ErrTimeout | ||
case http.StatusUnprocessableEntity: | ||
errTyp = v1.ErrExec | ||
case int32(codes.PermissionDenied): | ||
// Convert gRPC status code to HTTP status code. | ||
code = http.StatusUnprocessableEntity | ||
errTyp = v1.ErrBadData | ||
default: | ||
errTyp = v1.ErrServer | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some places that seems the response is already in the prometheus format? (i think when is returned by the querier itself, not the query frontend?) https://github.com/cortexproject/cortex/blob/master/pkg/querier/tripperware/queryrange/query_range.go#L269
In this case u think make sense to just "pipe" it?
Im afraid the message in this case is another prometheus response serialized. haha
Nit: Does it make sense to move this method now to be in the util/response.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move the method to util.
In this case u think make sense to just "pipe" it?
Do you have any suggestion, how I can get the error and know it is already Prometheus formatted? I have to deserialize the error in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit
pkg/util/api/response.go
Outdated
|
||
type Response struct { | ||
Status string `json:"status"` | ||
Data interface{} `json:"data,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could unexpected break some consumer code because now all 5xx responses will not have a data
key. Maybe we can move this on a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. Sure I will remove this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit
254527f
to
0ba10db
Compare
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
736e4fd
to
fa57317
Compare
Signed-off-by: Ben Ye <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good tests
What this PR does:
Change QFE to return Prometheus type error response so that the response can be parsed directly by Grafana.
Which issue(s) this PR fixes:
Fixes #5324
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]