Skip to content

Ringpop issues with docker swarm mode #442

@tiagoad

Description

@tiagoad

Expected Behavior

I start the following stack on docker swarm, and temporal "just works"

version: '3.5'

services:
  mysql:
    image: mysql:5.7

    networks:
      - temporal

    volumes:
      - mysql:/var/lib/mysql

    environment:
      MYSQL_ROOT_PASSWORD: "${TEMPORAL_MYSQL_PASSWORD}"

    deploy:
      replicas: 1

  temporal:
    image: temporalio/auto-setup:0.23.1
    hostname: temporal_engine

    ports:
      - 7233:7233

    networks:
      - temporal

    environment:
      DB: mysql
      MYSQL_USER: root
      MYSQL_PWD: "${TEMPORAL_MYSQL_PASSWORD}"
      MYSQL_SEEDS: mysql
      STATSD_ENDPOINT: "telegraf:8125"
      DYNAMIC_CONFIG_FILE_PATH: config/dynamicconfig/development.yaml

    deploy:
      replicas: 1

Actual Behavior

Temporal fails with the following error:

{"level":"fatal","ts":"2020-06-10T11:47:29.797Z","msg":"unable to resolve broadcast address","service":"history","error":"broadcastAddress required when listening on all interfaces (0.0.0.0/[::])","logging-call-at":"rpMonitor.go:104","stacktrace":"github.com/temporalio/temporal/common/log/loggerimpl.(*loggerImpl).Fatal\n\t/temporal/common/log/loggerimpl/logger.go:144\ngithub.com/temporalio/temporal/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:104\ngithub.com/temporalio/temporal/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:370\ngithub.com/temporalio/temporal/service/history.(*Service).Start\n\t/temporal/service/history/service.go:494\ngithub.com/temporalio/temporal/cmd/server/temporal.execute\n\t/temporal/cmd/server/temporal/server.go:356"}

I then add the following to the temporal stack, and try again

command: "/bin/bash -c 'export TEMPORAL_BROADCAST_ADDRESS=$$(hostname -i) && export BIND_ON_IP=0.0.0.0 && /start.sh'"

The container starts up, and correctly sets up the fresh mysql database. Then it will run properly for a few minutes/hours/days. After I restart (or let it run for a long time) I get the following errors:

{"level":"error","ts":"2020-06-10T11:51:50.826Z","msg":"Internal service error","service":"frontend","error":"Not enough hosts to serve the request","logging-call-at":"workflowHandler.go:3348","stacktrace":"github.com/temporalio/temporal/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngithub.com/temporalio/temporal/service/frontend.(*WorkflowHandler).error\n\t/temporal/service/frontend/workflowHandler.go:3348\ngithub.com/temporalio/temporal/service/frontend.(*WorkflowHandler).StartWorkflowExecution\n\t/temporal/service/frontend/workflowHandler.go:494\ngithub.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution.func2\n\t/temporal/service/frontend/dcRedirectionHandler.go:1114\ngithub.com/temporalio/temporal/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.go:116\ngithub.com/temporalio/temporal/service/frontend.(*DCRedirectionHandlerImpl).StartWorkflowExecution\n\t/temporal/service/frontend/dcRedirectionHandler.go:1110\ngithub.com/temporalio/temporal/service/frontend.(*AccessControlledWorkflowHandler).StartWorkflowExecution\n\t/temporal/service/frontend/accessControlledHandler.go:702\ngithub.com/temporalio/temporal/service/frontend.(*WorkflowNilCheckHandler).StartWorkflowExecution\n\t/temporal/service/frontend/workflowNilCheckHandler.go:112\ngo.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler.func1\n\t/go/pkg/mod/go.temporal.io/[email protected]/workflowservice/service.pb.go:1015\ngithub.com/temporalio/temporal/service/frontend.interceptor\n\t/temporal/service/frontend/service.go:316\ngo.temporal.io/temporal-proto/workflowservice._WorkflowService_StartWorkflowExecution_Handler\n\t/go/pkg/mod/go.temporal.io/[email protected]/workflowservice/service.pb.go:1017\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1082\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1405\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:746"}
{"level":"fatal","ts":"2020-06-10T11:51:53.700Z","msg":"unable to bootstrap ringpop","service":"history","error":"join duration of 49.070856456s exceeded max 30s","logging-call-at":"ringpop.go:82","stacktrace":"github.com/temporalio/temporal/common/log/loggerimpl.(*loggerImpl).Fatal\n\t/temporal/common/log/loggerimpl/logger.go:144\ngithub.1485827954.workers.dev/temporalio/temporal/common/membership.(*RingPop).Start\n\t/temporal/common/membership/ringpop.go:82\ngithub.1485827954.workers.dev/temporalio/temporal/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:115\ngithub.1485827954.workers.dev/temporalio/temporal/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:370\ngithub.1485827954.workers.dev/temporalio/temporal/service/history.(*Service).Start\n\t/temporal/service/history/service.go:494\ngithub.1485827954.workers.dev/temporalio/temporal/cmd/server/temporal.execute\n\t/temporal/cmd/server/temporal/server.go:356"}

I also get a lot of these (with other operations):

Error: Operation DescribeNamespace failed.
Error Details: context deadline exceeded

I also get more and more rows on the cluster_membership table, every time the engine restarts. It never eventually stabilizes.
I think this has partly to do with the fact that the docker containers get a random IP address every time, but I don't think it can explain it fully.

Here's the full first boot log: https://gist.github.com/tiagoad/51534f3b08b1407610dc50eb9fb166f0
Here's the full restart log: https://gist.github.com/tiagoad/47a958801c769934bb3a4c2ed3b4f2ad

Specifications

  • Version: 0.23.1
  • Platform: Linux x64

EDIT: I am aware I posted my mysql password. I have changed it already.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions