Additional information about scaling a service #178

mdlinville · 2016-10-12T18:14:46Z

Adding additional information as a follow-on to #148. Related to #105.

thaJeztah · 2016-10-13T02:01:50Z

engine/swarm/admin_guide.md

-[scale](../reference/commandline/service_scale/) the service.
+### Re-balancing a service after joining a new or previously failed node
+
+When you add a new node to a swarm, or a node re-joins after it has been


Would "reconnect" instead of "re-join" be an option?

I'm not in love with that, because the verb in the CLI is 'join' rather than 'connect'. In fact, I should probably change the 'add' to 'join' for consistency. I could probably just change 're-join' to 'join'. WDYT?

Hm, actually, that is the issue, because the node never "left" the swarm, it was only unreachable. When it becomes reachable again, it doesn't "join", it just erm, "dunnowhattocallit".

joining a swarm generates a new cryptographic identity, which isn't the case here.

@aaronlehmann @stevvooe thoughts?

@thaJeztah is correct.reconnect sounds right to me. Otherwise, "...or a node returns..."?

reconnect or re-register, but I am not sure if we use that terminology consistently.

stevvooe · 2016-10-14T20:54:24Z

engine/swarm/admin_guide.md

+unavailable, the new node does not automatically get a workload if the service
+is already running at the desired scale. Notably, when a failed node recovers
+and re-joins a swarm, the workloads it was previously running have been
+reassigned to other nodes, and it does not automatically take them back.


Might be good to include reasoning behind why this is the case, as I've found this behavior confusterates people.

@aluzzardi can give you more, but we don't place workload on the new server to prevent healthy services from being interrupted and to avoid dog piling newly joined nodes.

SwarmKit follows a pretty simple rule: No healthy container is ever disrupted unless it absolutely must.

A machine goes down? We move the containers to other machines, they were down anyway
A container crashes? We move it to another machine

However - a new machine comes up? There's no reason for SwarmKit to kill a perfectly fine production mysql container and move it to this new machine just for the sake of rebalancing.

If that container crashes on its own, then SwarmKit will consider redeploying it to the brand new machine in order to rebalance the cluster.

In all of the examples above, SwarmKit has never caused disruption to healthy containers.

In the future, we are planning to provide a flag to users so they can signal that a service can be Preempted - that is, killed even if perfectly healthy in order to make room for another service or to rebalance the cluster. We'll never do this without user permission though.

/cc @aaronlehmann

mdlinville · 2016-10-14T21:12:19Z

Thanks all, this is fantastic info and I will add it to the doc here.

On Fri, Oct 14, 2016 at 2:06 PM, Andrea Luzzardi [email protected]
wrote:

@aluzzardi commented on this pull request.

In engine/swarm/admin_guide.md
#178:

-If a node becomes unavailable, it cannot communicate with the rest of the swarm
-and its workload is redistributed among the other nodes.
-If access to that node is restored, it will join the swarm automatically, but it
-will join with no workload because the containers it was assigned have been
-reassigned. The node will only receive new workloads when the swarm is rebalanced.
-To force the swarm to be rebalanced, you can
-update or
-scale the service.
+### Re-balancing a service after joining a new or previously failed node
+
+When you add a new node to a swarm, or a node re-joins after it has been
+unavailable, the new node does not automatically get a workload if the service
+is already running at the desired scale. Notably, when a failed node recovers
+and re-joins a swarm, the workloads it was previously running have been
+reassigned to other nodes, and it does not automatically take them back.

SwarmKit follows a pretty simple rule: No healthy container is ever
disrupted unless it absolutely must.

A machine goes down? We move the containers to other machines, they were
down anyway
A container crashes? We move it to another machine

However - a new machine comes up? There's no reason for SwarmKit to kill a
perfectly fine production mysql container and move it to this new machine
just for the sake of rebalancing.

If that container crashes on its own, then SwarmKit will consider
redeploying it to the brand new machine in order to rebalance the cluster.

In all of the examples above, SwarmKit has never caused disruption to
healthy containers.

In the future, we are planning to provide a flag to users so they can
signal that a service can be Preempted - that is, killed even if
perfectly healthy in order to make room for another service or to rebalance
the cluster. We'll never do this without user permission though.

/cc @aaronlehmann https://github.com/aaronlehmann

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#178, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHUa9aUJPxVgj_gPWaqOhFALwLPYEv_Yks5qz-7ggaJpZM4KVDh2
.

mdlinville · 2016-10-19T22:40:21Z

OK, I tried to capture the feedback given by @aaronlehmann @stevvooe @aluzzardi . PTAL, thanks!

stevvooe · 2016-10-19T23:27:20Z

LGTM

thaJeztah · 2016-10-20T06:08:56Z

engine/swarm/admin_guide.md

+
+If you are concerned about an even balance of load and don't mind disrupting
+running tasks, you can force your swarm to re-balance by temporarily scaling
+the service upward.


Note that this PR, if accepted, would probably allow rebalancing without having to change the scale; moby/swarmkit#1664

But I can't talk about it until / unless it is. :)

I know; it was just a heads-up 😄

thaJeztah · 2016-10-20T06:10:34Z

engine/swarm/admin_guide.md

+
+See also
+[`docker service scale`](../reference/commandline/service_scale/) and
+[`docker service ps`](../reference/commandline/service_ps/).


oh! not introduced in this change, but should these links point to the .md file, so that they will work both on GitHub and on docs.docker.com?

Done, and also took the file portion out of the in-file links earlier up in the file.

Signed-off-by: Misty Stanley-Jones <[email protected]>

mdlinville · 2016-10-20T18:05:40Z

With the +1 from @stevvooe I'm going to merge.

) * Update workflow and add screenshots * Add screenshots

* Update workflow and add screenshots * Add screenshots

) * Update workflow and add screenshots * Add screenshots

mdlinville mentioned this pull request Oct 12, 2016

Not documented: Restore containers on crashed node in Swarm #105

Closed

thaJeztah reviewed Oct 13, 2016

View reviewed changes

stevvooe reviewed Oct 14, 2016

View reviewed changes

thaJeztah reviewed Oct 20, 2016

View reviewed changes

Additional information about scaling a service

717d643

Signed-off-by: Misty Stanley-Jones <[email protected]>

mdlinville merged commit 9540159 into docker:master Oct 20, 2016

mdlinville deleted the swarm_scale_clarifications branch October 20, 2016 18:05

joaofnfernandes pushed a commit to joaofnfernandes/docker.github.io that referenced this pull request Aug 16, 2017

Update workflow and add screenshots for rbac view-only topic (docker#178

4871e90

) * Update workflow and add screenshots * Add screenshots

joaofnfernandes pushed a commit to joaofnfernandes/docker.github.io that referenced this pull request Aug 16, 2017

Update workflow and add screenshots for rbac view-only topic (docker#178

8fb6672

) * Update workflow and add screenshots * Add screenshots

JimGalasyn pushed a commit that referenced this pull request Aug 16, 2017

Update workflow and add screenshots for rbac view-only topic (#178)

c72fd6e

* Update workflow and add screenshots * Add screenshots

shin- pushed a commit to shin-/docker.github.io that referenced this pull request Aug 19, 2017

Update workflow and add screenshots for rbac view-only topic (docker#178

2f04240

) * Update workflow and add screenshots * Add screenshots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional information about scaling a service #178

Additional information about scaling a service #178

mdlinville commented Oct 12, 2016

thaJeztah Oct 13, 2016

mdlinville Oct 13, 2016

thaJeztah Oct 14, 2016

mdlinville Oct 14, 2016

aaronlehmann Oct 14, 2016

stevvooe Oct 14, 2016

stevvooe Oct 14, 2016

aluzzardi Oct 14, 2016

mdlinville commented Oct 14, 2016

@aluzzardi commented on this pull request.

mdlinville commented Oct 19, 2016

stevvooe commented Oct 19, 2016

thaJeztah Oct 20, 2016

mdlinville Oct 20, 2016

thaJeztah Oct 20, 2016

thaJeztah Oct 20, 2016

mdlinville Oct 20, 2016

mdlinville commented Oct 20, 2016

Additional information about scaling a service #178

Additional information about scaling a service #178

Conversation

mdlinville commented Oct 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdlinville commented Oct 14, 2016

@aluzzardi commented on this pull request.

mdlinville commented Oct 19, 2016

stevvooe commented Oct 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdlinville commented Oct 20, 2016