Skip to content

CASSANDRA_SEEDS in Swarm Mode #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
william-kerr opened this issue Feb 1, 2017 · 19 comments
Closed

CASSANDRA_SEEDS in Swarm Mode #94

william-kerr opened this issue Feb 1, 2017 · 19 comments
Labels

Comments

@william-kerr
Copy link

william-kerr commented Feb 1, 2017

I would like to bring up a stack by running:

docker stack deploy --compose-file docker-compose-stack.yaml cassandra

docker-compose-stack.yaml contents:

version: '3'
services:
  cassandra:
    image: cassandra
    environment:
      - CASSANDRA_SEEDS=10.0.0.3,10.0.0.4,10.0.0.5,10.0.0.6
    ports:
      - 7000:7000
    deploy:
      mode: global

However, the different Cassandra nodes don't know about each other unless I manually specify IPs that might quit getting used, in CASSANDRA_SEEDS, which is not ideal. How would I use a load balanced IP for CASSANDRA_SEEDS? I tried using CASSANDRA_SEEDS=10.0.0.2 and CASSANDRA_SEEDS=cassandra, but neither worked. Also, how should I handle the situation where the load balanced IP ends up pointing to the same Cassandra node instead of another one?

@yosifkit
Copy link
Member

yosifkit commented Feb 1, 2017

According to the documentation it has to be IP addresses. But you only need to give each new node one or more IPs of a node already in the cluster. What I would do to scale is start three seed nodes that point to each other (maybe a bash script to resolve their dns name to the set of IP addresses on startup and fill the seeds environment variable). Then to scale the rest I would point them at those IP addresses within that overlay network (or use the same script) and schedule the rest to be constrained to nodes that aren't running the seed nodes.

I would not recommend using a load balanced IP for the internal connections between Cassandra nodes; the seeds list should be the IP addresses that the nodes advertise via broadcast address.

For more info: http://stackoverflow.com/a/32183684

@william-kerr
Copy link
Author

https://apache.googlesource.com/cassandra/+/cassandra-3.9/src/java/org/apache/cassandra/locator/SimpleSeedProvider.java

seeds.add(InetAddress.getByName(host.trim()));

InetAddress.getByName() returns the IP address of a host's name. If an IP is passed in, "only the validity of the address format is checked".

So, a Docker Swarm Mode service name like cassandra should work. I really feel like this would be the correct way to approach this problem. Using a service name like cassandra that resolves to a load balancing IP, would make for an elegant seed. As long as we retry if seed is pointing to itself, this would be great!

@yosifkit
Copy link
Member

yosifkit commented Feb 3, 2017

I guess all their documentation about seeds needing to be IP addresses is wrong (that same function is used even back in 1.2). As nice as using a load-balanced hostname for seeds would be, it is too prone to racey conditions. How does the first node know when to stop resolving that hostname when trying to get an IP address that is not itself? What if you start 3 at once on 3 different machines and they all decide that they are the first node and thus their own seed?

This would either require a change upstream in cassandra itself or something in the entrypoint script. If we implement something for this in the entrypoint script, would we resolve the hostname and only pass an IP to cassandra? Having an IP that is not us does not even guarantee that the other node is a valid seed node. This really seems better suited for using something with full service discovery so that each node can register when they are just started vs part of the cluster. I know consul is often mentioned for this purpose.

@Richard-Mathie
Copy link

you can get a list of nodes from nslookup or getent hosts tasks.cassandra.

You would then need to remove hostname -i from that list of ip addresses to prevent the node from bootstrapping off itself and forming an orphaned cluster.

I have attempted to add to the scripting so that you can do that here https://github.com/amey-sam/cassandra

in the same fashion that wurstmeister/kafka does this...

docker service create --name cassandra \
  --mode global \
  --network my_swarm_net \
  --constraint 'node.labels.network == private' \
  -e "SEEDS_COMMAND=getent hosts tasks.cassandra.my_swarm_net | awk '{print \$1}'  | paste -d, -s -" \
  -e 'CASSANDRA_SEEDS=auto' \
  -e 'CASSANDRA_BROADCAST_ADDRESS=auto' \
  -e 'CASSANDRA_LISTEN_ADDRESS=auto' \
  --publish '9042:9042' \
  --mount type=bind,dst=/var/lib/cassandra,src=/home/cassandra \
  webscam/cassandra:3.9

the problem with the above is that self.ip isnt removed from seeds so auto bootstrapping does not work for added nodes (i.s. fail over and scaling :( ). Also I am not sure what happens if you remove 'hostname -i' (or whatever is appropriate) from the seed ip list on nodes when bootstrapping a cluster. Presumably there wouldn't be any seeds?

At the end of the day I think the cassandra model for doing this is a bit broken. Why have special seed nodes which become a point of failure? That is fail over and scaling of those nodes is not automatic, which is a bit of a let down given the hype of how fantastic and easy to use cassandra is supposed to be... (read its great, except for dev ops, buy our support... please).

@flybyray
Copy link

flybyray commented Jun 29, 2017

@Richard-Mathie thx for your brilliant input.
I managed to get it to work with the original image with that script.

nrOfTasks=`getent hosts tasks.cassandra | wc -l` ; many=`getent hosts tasks.cassandra | awk '{print $1}' | sed "/$(hostname --ip-address)/d" | paste -d, -s -` ; printf '%s' $( [ ${nrOfTasks} -gt 1 ] && echo ${many} || echo "$(hostname --ip-address)" )

A sample compose file for docker swarm would be:

version: "3.1"
services:
  cassandra:
    deploy:
      replicas: 1
      resources:
        limits:
          memory: 190M
        reservations:
          memory: 76M
    entrypoint:
    - "sh"
    - "-c"
    - export CASSANDRA_SEEDS=$$(nrOfTasks=`getent hosts tasks.cassandra | wc -l` ;
      many=`getent hosts tasks.cassandra | awk '{print $$1}' | sed "/$$(hostname --ip-address)/d"
      | paste -d, -s -` ; printf '%s' $$( [ $${nrOfTasks} -gt 1 ] && echo $${many} ||
      echo "$$(hostname --ip-address)" )) ; /docker-entrypoint.sh cassandra -f
    environment:
      HEAP_NEWSIZE: 12M
      MAX_HEAP_SIZE: 64M
    image: cassandra
    networks:
      backend:

networks:
  backend:

I just start my cluster with one replica

docker stack deploy -c docker-compose.cassandra.yml mycluster

Then you can scale

docker service scale mycluster_cassandra=2

i did not test heavy scale at once. just one by one after checking nodetool status

i hope the official image would include swarm support even with down scaling

@JnMik
Copy link

JnMik commented Jul 5, 2017

@flybyray Thx for pointing out the "tasks.service_name" to get containers IP inside service, I wasn't aware of that.
You're an illuminati!
Much appreciated
xD

@JnMik
Copy link

JnMik commented Jul 5, 2017

Here's how I dealt with Dynamic swarm cassandra seeds

I have a custom boot-node.sh which is used when Cassandra boot

#!/bin/bash

echo Pausing for 5s, give some time for the docker service to be reachable in case we are the first replica
sleep 5

echo LAUNCH NODETOOL REPAIR IN BACKGROUND, SCRIPT WILL WAIT FOR CASSANDRA TO BE FULLY BOOTED
nohup sh etc/cassandra/node-repair-after-full-boot.sh > /dev/stdout 2>&1 &

serviceAddress=$(getent hosts gateway_cluster_cassandra_peers | awk '{print $1}')
cassandraBroadcastAddress=$(ip a | grep inet | grep eth0 | grep -v $serviceAddress | awk '{print $2}' | sed 's|/16||g')
export CASSANDRA_BROADCAST_ADDRESS=$cassandraBroadcastAddress
echo Broadcast address will be $CASSANDRA_BROADCAST_ADDRESS

# method to implode different cassandra nodes in a string
function join_by { local IFS="$1"; shift; echo "$*"; }
export CASSANDRA_SEEDS=$(join_by , $(getent hosts tasks.gateway_cluster_cassandra_peers | awk '{print $1}'))

echo Cassandra seeds will be $CASSANDRA_SEEDS

echo LAUNCH CASSANDRA
/docker-entrypoint.sh cassandra -f

Also you will have to deal with containers scaling down and up, when they will scale up with the same IP as before, they will exit with an error because they need --replace-address while booting to take back their seat inside the cluster. I don't know why Cassandra would force us to have a manual intervention here. However, I found a workaround by adding some code at the end of cassandra-env.sh


# Get current broadcasted adresss
broadcastAddress=$(getent hosts $CASSANDRA_BROADCAST_ADDRESS | awk '{print $1}')

seeds=$(echo $CASSANDRA_SEEDS | tr "," "\n")

for seed in $seeds
do
    echo Trying to reach $seed

    ping -c 1 $seed >/dev/null 2>/dev/null
    PingResult=$?

    if [ "$PingResult" -eq 0 ]; then
        if [ $CASSANDRA_BROADCAST_ADDRESS = $seed ];
        then
            echo Current node match seed to evaluate, skip !
            continue
        fi

        echo $seed found, connecting to database to check if current node needs --replace_address

        # Connect to seed to investigate node status
        QUERY_RESPONSE=$(cqlsh $seed -e "select peer, host_id, rpc_address from system.peers where peer='$broadcastAddress';")
        echo $QUERY_RESPONSE

        NODE_FOUND=`echo $QUERY_RESPONSE | grep -c "1 rows"`

        if [ $NODE_FOUND = 0 ]; then
            echo Current node IP NOT FOUND in cluster, node will bootstrap and join normally
        else
            echo Current node ip FOUND in cluster, node will bootstrap with replace_address option and then join the cluster
            JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=$broadcastAddress"

        fi

        break

    elif [ "$PingResult" -eq 1 ]; then
        echo $seed not reachable, NEXT
    elif [ "$PingResult" -eq 2 ]; then
        echo $seed not reachable, service not activated yet, NEXT
    else
        echo Unknown status, NEXT
    fi

done

I still have one scenario to cover up, when containers failed, rejoin the cluster with another IP and let a dead row inside the database. I need something to clean up the old dead containers IP from the system.peers. Anyone wrote something for this ? If not I'll probably try some stuff soon.

cheers

@flybyray
Copy link

flybyray commented Jul 5, 2017

@JnMik credits go to @Richard-Mathie he useses

DNS using a special query <tasks.SERVICE-NAME> to find the IP addresses

documented in version 1.13 of docker docs https://docs.docker.com/v1.13/engine/swarm/networking/

I dont know why that documentation was removed in later versions
https://docs.docker.com/engine/swarm/networking/

Maybe a merge result from https://docs.docker.com/hackathon/

@Richard-Mathie
Copy link

Put this all together here: https://github.com/amey-sam/cassandra/tree/auto_scale
there is a docker image if you want to test webscam/cassandra:swarm_test

run

docker service create -d \
  --name cassandra \
  --network mercury \
  -e HEAP_NEWSIZE=12M \
  -e MAX_HEAP_SIZE=64M \
  webscam/cassandra:swarm_test

docker service scale cassandra=2

ect

It doesn't seem to like it if you scale more than one node at a time (cassandra complains if you try and join while other nodes are bootstrapping) Though you will eventually get to a stable state, there will be a lot of containers failing and restarting.

Also the cluster does not seem to like it if you lose the first seed node.

thanks @flybyray and @JnMik for the help, If you want to be added to that repo give us a shout.

@Richard-Mathie
Copy link

FYI when downscaling the cluster you may have to call remove node

http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsRemoveNode.html

its pretty manual. I guess when it matters, you should just pay datastack or intracluster to manage your db as a service and forget about all this pain.

@blop
Copy link

blop commented Aug 9, 2017

You got a typo in https://github.com/amey-sam/cassandra/blob/auto_scale/docker-entrypoint.sh

: ${SERVICE_NAME='cassadra'}

@blop
Copy link

blop commented Aug 9, 2017

Also, not sure this is right :

  if [ -n "${CASSANDRA_NAME:+1}" ]; then
    : ${CASSANDRA_SEEDS="cassandra"}
  fi

What should CASSANDRA_NAME contain?

@Richard-Mathie
Copy link

@blop CASSANDRA_NAME comes from ebbf163

so I think it must have been a thing when using links ect, perhaps we can remove?

thanks for spotting the typo, bit confused now how this even worked with that...
Ahh its hardcoded in the docker image still

@blop
Copy link

blop commented Aug 15, 2017

I tried your image using a simple stack like this :

version: '3.3'

services:
    cassandra:
        image: webscam/cassandra:swarm_test
        deploy:
            mode: replicated
            replicas: 3

and deploying it inside a swarm using :
docker stack deploy tk --compose-file cassandra.yml

I have this image :
webscam/cassandra swarm_test a0b4f14b1b73 3 days ago 400MB

It seems somethings not working properly, it keeps respawning new containers and cluster never get up properly. Did you try that?

@man4j
Copy link

man4j commented Oct 13, 2017

What about case when I reboot docker host and swarm repair container with different IP address? Nodetool shows new cassandra node, and old cassandra node I remove manually. I try automate it with -Dcassandra.replace_address=OLD_IP_ADDRESS but cassandra says: "Cannot replace address with a node that is already bootstrapped". In this case I mean that docker container repaired on new host and previous data volume not exists and cassandra generate new hostId. If data volume exists and hostId not changed change IP address is not problem without replace_address. Please help!

@Richard-Mathie
Copy link

@blop

you would have to set CASSANDRA_NAME to <project_name>_cassandra as docker-compose sticks the working directory or the COMPOSE_PROJECT_NAME env var in front of all service host names.

maybe you can set the hostname of the service with docker-compose? I don't know as i stopped using it.

@Danissss
Copy link

@flybyray Hi, I tried your yml file, and it successfully created two containers on different instance. However, I couldn't manage to let them see each other, which when I run nodetool status on each one of the container, I can only see one node
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.0.1.7 103.69 KiB 256 100.0% 84961df9-ff31-4ce4-ac65-9cd1949096d0 rack1
Could you give me some suggestions?

@doruchiulan
Copy link

There is an issue with CASSANDRA_LISTEN_ADDRESS, port 7000 cannot bind to 0.0.0.0 so I am having issues connecting to cassandra 7000 from another container in an overlay network.

Will open an issue with steps to reproduce

@tianon
Copy link
Member

tianon commented Oct 3, 2018

Closing given that there are now several workarounds now documented in this thread, in addition to this being a fundamental issue with Cassandra itself (not an issue with how we're packaging it) -- setting up a cluster automatically is always going to be somewhat fragile, and is out of scope for what this image provides (which is an attempt at providing a faithful "upstream" Cassandra experience -- warts and all). Building something like that on top of this image remains the best solution we can offer. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests