Description
Is this a security vulnerability?
No.
Issue
The time that is spent on Gorouter retries is unclear / invisible in the gorouter 'access log' and requires manual correlation with other logs.
The response_time metric does not break down further into components that would be useful for error triage, latency analysis and customer support.
Affected Versions
All.
Context
Gorouter retries requests that can safely be retried on other route endpoints (if available). Such retries can take time. Depending on the type of error, this time can be much higher than the regular response time.
Each retry is logged explicitly as log message, indicating the request attempt and reason for error, but the number of attempts or time taken for all unsuccessful attempts is not recorded in the final log summarizing the request.
Other times, e.g. the dial time to particular endpoints, is also not recorded and would help to fine-tune the endpoint_dial_timeout
property that would allow reducing the time spent on retries due to unreachable endpoint that ends in an i/o timeout
error.
The 'access log' is the gorouter log that includes the URL, response code, etc. and is logged at the end of the negroni middleware chain.
Proposal
We propose to:
- capture the time taken on a particular attempt and log it in the retry error log
- log the cumulative time spent on unsuccessful attempts, and the number of attempts it took to complete the request in the 'access log'
- capture the dial time as part of the
Dialer
to collect metrics, which would allow allow finding the 'worst case scenario' for currently deployed backends and adjust a lowerendpoint_dial_timeout
accordingly. - log the dial time for each attempt.
The result would be an access log that contains:
response_time
-> the complete time the request has taken from arriving at gorouter to responding to the client (unchanged)gorouter_time
-> the time gorouter required for internal processing, including route lookup, etc.retry_time
-> the time gorouter spent on unsuccessful attempts that needed to be retried.dial_time
-> the dial_time of the successful response
The app response time for the successful response could then be computed as follows: response_time - gorouter_time - retry_time
The same logic could apply for failed requests, i.e. when none of the gorouter retries would lead to a successful backend response.