Skip to content

Refresh token flow breaks up during db server downtime #1928

@struhtanov

Description

@struhtanov

Describe the bug

Sometimes I start receiving "The refresh token has not been found: : not_found" during refresh token flow. After that I no more can refresh the token, end user has to pass oauth flow again.

Reproducing the bug

I didn't spent enough time to reproduce it on local setup. First attempts fails. Even on production setup it's not reproduced every time. So I'll describe steps on production setup instead.

1.) I have client, that infinitely refreshes oauth token. Here is NodeJs code snippet of express app, that handles oauth flow and then starts infinite token refresh:

const args = require("./args");
const express = require("express");
const ClientOAuth2 = require("client-oauth2");

const fiberyTheAuth = new ClientOAuth2({
  clientId: args.authClientId,
  clientSecret: args.authClientSecret,
  accessTokenUri: `${args.authUrl}/oauth2/token`,
  authorizationUri: `${args.authUrl}/oauth2/auth`,
  redirectUri: args.redirectUri,
  scopes: ["openid", "offline"],
  state: "asdfasfdsafd",
});

const fiberyAuth = fiberyTheAuth;

const app = express();

app.get("/", function (req, res, next) {
  var uri = fiberyAuth.code.getUri();
  res.redirect(uri)
});

const refresh = async (token) => {
  const oldToken = fiberyAuth.createToken(token.data.access_token, token.data.refresh_token);
  const time = Date.now();
  try {
    const refreshed = await oldToken.refresh();
    logger.info(`Token refreshed.`, {elapsedTime: Date.now() - time});
    return refreshed;
  } catch(e) {
    sendToSlack(`Token was not refreshed`, getErrorMeta(e))
    logger.error("Token was not refreshed", {...getErrorMeta(e), oldToken: oldToken.accessToken});
    return oldToken;
  }
};

const repeatRefresh = async (token) => {
  const newToken = await refresh(token);
  setTimeout(() => repeatRefresh(newToken), args.refreshInterval);
}

app.get("/callback", async function (req, res, next) {
  try {
    const token = await fiberyAuth.code.getToken(args.redirectUri.includes("localhost") ? req.originalUrl : `/api/authCheck${req.originalUrl}`);
    logger.info(token);
    repeatRefresh(token);

    res.sendStatus(200);
  } catch(e){
    res.sendStatus(401);
    logger.info("Failed to get token", getErrorMeta(e));
  }
});

module.exports = {app};

2.) In my kubernetes cluster I have v1.4.5-alpine hydra version. I use helm chart of this version. But I believe it is not important. I have replicaCount set to 3, and use postgres as database.

3.) At some point of time of the day my postgres db experience downtime, which is related to current cloud provider.

4.) After that sometimes my client cannot refresh token any more. Here are part of hydra logs:

{"debug":"failed to connect to `host=stolon-proxy-service.postgres.svc.cluster.local user=hydra database=hydra`: dial error (dial tcp 100.75.226.174:5432: operation was canceled)","description":"Client authentication failed (e.g., unknown client, no client authentication included, or unsupported authentication method)","error":"invalid_client","level":"error","msg":"An error occurred","time":"2020-06-25T09:04:36Z"}
{"debug":"failed to connect to `host=stolon-proxy-service.postgres.svc.cluster.local user=hydra database=hydra`: dial error (dial tcp 100.75.226.174:5432: operation was canceled)","description":"Client authentication failed (e.g., unknown client, no client authentication included, or unsupported authentication method)","error":"invalid_client","level":"error","msg":"An error occurred","time":"2020-06-25T09:04:40Z"}
{"error":"context canceled","level":"error","msg":"An error occurred","time":"2020-06-25T09:08:54Z"}
{"debug":"The refresh token has not been found: : not_found","description":"The provided authorization grant (e.g., authorization code, resource owner credentials) or refresh token is invalid, expired, revoked, does not match the redirection URI used in the authorization request, or was issued to another client","error":"invalid_grant","level":"error","msg":"An error occurred","time":"2020-06-25T09:08:54Z"}

I removed log records with similar content to keep posted logs short.
After 09:08:54 every attempt to refresh token gets same not_found error.

Server configuration

# Number of ORY Hydra members
replicaCount: 3

image:
  # ORY Hydra image
  repository: oryd/hydra
  # ORY Hydra version
  tag: v1.4.5-alpine
  # Image pull policy
  pullPolicy: IfNotPresent

# Image pull secrets
imagePullSecrets: []
# Chart name override
nameOverride: ""
# Full chart name override
fullnameOverride: ""

# Configures the Kubernetes service
service:
  # Configures the Kubernetes service for the proxy port.
  public:
    # En-/disable the service
    enabled: true
    # The service type
    type: ClusterIP
    # The service port
    port: 4444
    # If you do want to specify annotations, uncomment the following
    # lines, adjust them as necessary, and remove the curly braces after 'annotations:'.
    annotations:
      kong/request-host: ""
      kong/request-path: "/"
      kong/preserve-host: "true"
      kong/strip-request-path: "true"

    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  # Configures the Kubernetes service for the api port.
  admin:
    # En-/disable the service
    enabled: true
    # The service type
    type: ClusterIP
    # The service port
    port: 4445
    # If you do want to specify annotations, uncomment the following
    # lines, adjust them as necessary, and remove the curly braces after 'annotations:'.
    annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"

# Configure ingress
ingress:
  # Configure ingress for the proxy port.
  public:
    # En-/Disable the proxy ingress.
    enabled: false

  admin:
    # En-/Disable the api ingress.
    enabled: false

# Configure ORY Hydra itself
hydra:
  # The ORY Hydra configuration. For a full list of available settings, check:
  #   https://github.com/ory/hydra/blob/master/docs/config.yaml
  config:
    dsn: "postgres://{{ .Secrets.HydraPgUser }}:{{ .Secrets.HydraPgPassword }}@{{ .Secrets.PgHostname }}:5432/{{ .Secrets.HydraPgDatabase }}?sslmode=disable"
    log:
      level: "error"
      format: "json"
    serve:
      public:
        port: 4444
      admin:
        port: 4445
      tls:
        allow_termination_from:
          - 0.0.0.0/0
    secrets:
      system: "{{ .Secrets.HydraSystemSecret }}"
      cookie: ""
    urls:
      self:
        issuer: "https://auth.fibery.io"
      login: "https://fibery.io/oauth-login"
      consent: "https://fibery.io/oauth-consent"
    ttl:
      access_token: 1h
    strategies:
      access_token: "jwt"

  autoMigrate: true
  dangerousForceHttp: true
  dangerousAllowInsecureRedirectUrls: false

deployment:
  resources:
  #  We usually recommend not to specify default resources and to leave this as a conscious
  #  choice for the user. This also increases chances charts run on environments with little
  #  resources, such as Minikube. If you do want to specify resources, uncomment the following
  #  lines, adjust them as necessary, and remove the curly braces after 'resources:'.
    limits:
      cpu: 500m
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 128Mi

  labels: {}


  annotations: {}


  # Node labels for pod assignment.
  nodeSelector: {}


  # Configure node tolerations.
  tolerations: []

# Configure node affinity
affinity: {}

# Configures controller setup
maester:
  enabled: false
 

Expected behavior

Refresh flow does not break up

Environment

I believe it is described in helm chart.

Additional context

I was able to reproduce the issue on older version of hydra locally, but didn't find precise steps. What I tried was to turn on\off db, restart hydra container. At the same time I executed infinite token refresh. After upgrading to the latest hydra version I was not able to reproduce the issue, and hoped everything would go well, but it didn't.

I for sure understand that described details are not precise and may be not enough. Probably I would try to find exact steps. My only alternative is to move to some other oauth server which is of course time consuming.
Maybe someone can suggest any workaround of this issue. At the moment I have several client apps, like Zapier and Slack integrations. And clients are not happy with accidentally losing auth info.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is not working.package/persistence/sqlAffects a SQL statement, schema or component

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions