Skip to content

Capture and log information about gorouter retries #304

Closed
cloudfoundry/gorouter
#346
@peanball

Description

@peanball

Is this a security vulnerability?

No.

Issue

The time that is spent on Gorouter retries is unclear / invisible in the gorouter 'access log' and requires manual correlation with other logs.

The response_time metric does not break down further into components that would be useful for error triage, latency analysis and customer support.

Affected Versions

All.

Context

Gorouter retries requests that can safely be retried on other route endpoints (if available). Such retries can take time. Depending on the type of error, this time can be much higher than the regular response time.

Each retry is logged explicitly as log message, indicating the request attempt and reason for error, but the number of attempts or time taken for all unsuccessful attempts is not recorded in the final log summarizing the request.

Other times, e.g. the dial time to particular endpoints, is also not recorded and would help to fine-tune the endpoint_dial_timeout property that would allow reducing the time spent on retries due to unreachable endpoint that ends in an i/o timeout error.

The 'access log' is the gorouter log that includes the URL, response code, etc. and is logged at the end of the negroni middleware chain.

Proposal

We propose to:

  1. capture the time taken on a particular attempt and log it in the retry error log
  2. log the cumulative time spent on unsuccessful attempts, and the number of attempts it took to complete the request in the 'access log'
  3. capture the dial time as part of the Dialer to collect metrics, which would allow allow finding the 'worst case scenario' for currently deployed backends and adjust a lower endpoint_dial_timeout accordingly.
  4. log the dial time for each attempt.

The result would be an access log that contains:

  • response_time -> the complete time the request has taken from arriving at gorouter to responding to the client (unchanged)
  • gorouter_time -> the time gorouter required for internal processing, including route lookup, etc.
  • retry_time -> the time gorouter spent on unsuccessful attempts that needed to be retried.
  • dial_time -> the dial_time of the successful response

The app response time for the successful response could then be computed as follows: response_time - gorouter_time - retry_time

The same logic could apply for failed requests, i.e. when none of the gorouter retries would lead to a successful backend response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions