Skip to content

Additional information about scaling a service #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 20, 2016
Merged

Additional information about scaling a service #178

merged 1 commit into from
Oct 20, 2016

Conversation

mdlinville
Copy link

Adding additional information as a follow-on to #148. Related to #105.

[scale](../reference/commandline/service_scale/) the service.
### Re-balancing a service after joining a new or previously failed node

When you add a new node to a swarm, or a node re-joins after it has been
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "reconnect" instead of "re-join" be an option?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in love with that, because the verb in the CLI is 'join' rather than 'connect'. In fact, I should probably change the 'add' to 'join' for consistency. I could probably just change 're-join' to 'join'. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, actually, that is the issue, because the node never "left" the swarm, it was only unreachable. When it becomes reachable again, it doesn't "join", it just erm, "dunnowhattocallit".

joining a swarm generates a new cryptographic identity, which isn't the case here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thaJeztah is correct.reconnect sounds right to me. Otherwise, "...or a node returns..."?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconnect or re-register, but I am not sure if we use that terminology consistently.

unavailable, the new node does not automatically get a workload if the service
is already running at the desired scale. Notably, when a failed node recovers
and re-joins a swarm, the workloads it was previously running have been
reassigned to other nodes, and it does not automatically take them back.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to include reasoning behind why this is the case, as I've found this behavior confusterates people.

@aluzzardi can give you more, but we don't place workload on the new server to prevent healthy services from being interrupted and to avoid dog piling newly joined nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SwarmKit follows a pretty simple rule: No healthy container is ever disrupted unless it absolutely must.

A machine goes down? We move the containers to other machines, they were down anyway
A container crashes? We move it to another machine

However - a new machine comes up? There's no reason for SwarmKit to kill a perfectly fine production mysql container and move it to this new machine just for the sake of rebalancing.

If that container crashes on its own, then SwarmKit will consider redeploying it to the brand new machine in order to rebalance the cluster.

In all of the examples above, SwarmKit has never caused disruption to healthy containers.

In the future, we are planning to provide a flag to users so they can signal that a service can be Preempted - that is, killed even if perfectly healthy in order to make room for another service or to rebalance the cluster. We'll never do this without user permission though.

/cc @aaronlehmann

@mdlinville
Copy link
Author

Thanks all, this is fantastic info and I will add it to the doc here.

On Fri, Oct 14, 2016 at 2:06 PM, Andrea Luzzardi [email protected]
wrote:

@aluzzardi commented on this pull request.

In engine/swarm/admin_guide.md
#178:

-If a node becomes unavailable, it cannot communicate with the rest of the swarm
-and its workload is redistributed among the other nodes.
-If access to that node is restored, it will join the swarm automatically, but it
-will join with no workload because the containers it was assigned have been
-reassigned. The node will only receive new workloads when the swarm is rebalanced.
-To force the swarm to be rebalanced, you can
-update or
-scale the service.
+### Re-balancing a service after joining a new or previously failed node
+
+When you add a new node to a swarm, or a node re-joins after it has been
+unavailable, the new node does not automatically get a workload if the service
+is already running at the desired scale. Notably, when a failed node recovers
+and re-joins a swarm, the workloads it was previously running have been
+reassigned to other nodes, and it does not automatically take them back.

SwarmKit follows a pretty simple rule: No healthy container is ever
disrupted unless it absolutely must.

A machine goes down? We move the containers to other machines, they were
down anyway
A container crashes? We move it to another machine

However - a new machine comes up? There's no reason for SwarmKit to kill a
perfectly fine production mysql container and move it to this new machine
just for the sake of rebalancing.

If that container crashes on its own, then SwarmKit will consider
redeploying it to the brand new machine in order to rebalance the cluster.

In all of the examples above, SwarmKit has never caused disruption to
healthy containers.

In the future, we are planning to provide a flag to users so they can
signal that a service can be Preempted - that is, killed even if
perfectly healthy in order to make room for another service or to rebalance
the cluster. We'll never do this without user permission though.

/cc @aaronlehmann https://github.com/aaronlehmann


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#178, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHUa9aUJPxVgj_gPWaqOhFALwLPYEv_Yks5qz-7ggaJpZM4KVDh2
.

@mdlinville
Copy link
Author

OK, I tried to capture the feedback given by @aaronlehmann @stevvooe @aluzzardi . PTAL, thanks!

@stevvooe
Copy link
Contributor

LGTM


If you are concerned about an even balance of load and don't mind disrupting
running tasks, you can force your swarm to re-balance by temporarily scaling
the service upward.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this PR, if accepted, would probably allow rebalancing without having to change the scale; moby/swarmkit#1664

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I can't talk about it until / unless it is. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know; it was just a heads-up 😄


See also
[`docker service scale`](../reference/commandline/service_scale/) and
[`docker service ps`](../reference/commandline/service_ps/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! not introduced in this change, but should these links point to the .md file, so that they will work both on GitHub and on docs.docker.com?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and also took the file portion out of the in-file links earlier up in the file.

@mdlinville
Copy link
Author

With the +1 from @stevvooe I'm going to merge.

@mdlinville mdlinville merged commit 9540159 into docker:master Oct 20, 2016
@mdlinville mdlinville deleted the swarm_scale_clarifications branch October 20, 2016 18:05
joaofnfernandes pushed a commit to joaofnfernandes/docker.github.io that referenced this pull request Aug 16, 2017
)

* Update workflow and add screenshots

* Add screenshots
joaofnfernandes pushed a commit to joaofnfernandes/docker.github.io that referenced this pull request Aug 16, 2017
)

* Update workflow and add screenshots

* Add screenshots
JimGalasyn pushed a commit that referenced this pull request Aug 16, 2017
* Update workflow and add screenshots

* Add screenshots
shin- pushed a commit to shin-/docker.github.io that referenced this pull request Aug 19, 2017
)

* Update workflow and add screenshots

* Add screenshots
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants