Support a "drain" mode for terminating pods and servers failing readiness checks by brianloss · Pull Request #95 · jcmoraisjr/haproxy-ingress

brianloss · 2018-02-16T23:53:25Z

When a pod starts terminating or the readiness check fails, leave the pod as a service
in the load balancer, but set its weight to 0 so that HAProxy does not send ordinary
traffic to the pod. Traffic using persistence, however, will be directed to the terminating
pods. This allows persistent requests to continue to flow to a terminating pod. Updates to
only the draining state are made using the stats socket rather than forcing a full reload
of HAProxy.

This PR includes the described change, as well as a few fixes for typos I found along the way, and clearing up some warnings IntelliJ was showing (e.g., initializing slices with empty literal instead of nil, error messages starting with a capital letter).

I have a use case for pods having a long termination period and new requests with a cookie need to be directed to terminating servers during this period. I was thinking this could potentially be useful for other people too, but can always keep in a fork if you don't think it fits.

…ness checks When a pod starts terminating or the readiness check fails, leave the pod as a service in the load balancer, but set its weight to 0 so that HAProxy does not send ordinary traffic to the pod. Traffic using persistence, however, will be directed to the terminating pods. This allows persistent requests to continue to flow to a terminating pod. Updates to only the draining state are made using the stats socket rather than forcing a full reload of HAProxy.

jcmoraisjr · 2018-02-20T11:22:00Z

A very specific usecase but indeed an interesting feature. I'll need a few more days to have a look at the approach.

@aiharos fyi.

aiharos · 2018-02-22T11:20:10Z

The stats_socket.go changes seem solid, and if draining is enabled the server weight is set in the configuration template so that seems good too. +1 for ignoring Ingress status changes, that also causes some needless reloads.

jcmoraisjr · 2018-03-12T10:30:11Z

Hi, finally, thanks for supporting this ingress controller with this amazing piece of code! Sorry about the long delay, I was on vacation in the last weeks.

Now you are adding NotReady and Terminating pods to the list of available endpoints. How do you controll whether this endpoints should or shouldn't be added to the backend servers when drain-support is false?

jcmoraisjr · 2018-03-21T01:24:10Z

Hi, only two pending issues after a few tests:

check if svc namespace and pod namespace are the same at GetTerminatingServicePods
only add draining pods to the backend list if draining is enabled - on our usecase we need terminating pods being removed from the balance asap

Moreover I pushed by mistake a commit that conflict with your change. Please have a look if you can rebase on top of master.

brianloss · 2018-03-21T12:19:37Z

Thanks for the feedback. I’ve been sidetracked on some other projects and am on vacation now, but will try to get this fixed up next week.

…

On Mar 20, 2018, at 7:24 PM, Joao Morais ***@***.***> wrote: Hi, only two pending issues after a few tests: check if svc namespace and pod namespace are the same at GetTerminatingServicePods only add draining pods to the backend list if draining is enabled - on our usecase we need terminating pods being removed from the balance asap Moreover I pushed by mistake a commit that conflict with your change. Please have a look if you can rebase on top of master. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

* validate namespace match when pulling terminating service pods * don't include not ready or terminating pods when DrainSupport is not active

jcmoraisjr · 2018-04-09T23:44:21Z

Thanks!

dcowden · 2018-04-10T12:09:13Z

@brianloss Kudos for developing this-- we need it too, and we're planning on trying it out. Just wanted to say thanks and let you know someone else indeed used it!

brianloss · 2018-04-10T12:18:05Z

No problem! Hope it works ok for you -- I've tested for my use case, but go isn't my primary language and I'm sure there are some parts I could have done better.

dcowden · 2018-04-10T12:23:16Z

I'm hoping to test today-- i'll let you you know testing goes. My scenario is a java app ( JSESSIONID) using sticky sessions to a 3- node deployment. My goal is to execute a rolling update of the deployment and have the nodes finish terminating when they have no more active sessions.

To pull this off, I also need to fail the health check when there are no more active sessions, which in java is a JMX thing. You're not using java by any chance are you? If so, how did you loop active session count into the health check?

brianloss · 2018-04-10T12:31:05Z

I'm using Wildfly. Typically what would happen is you would initiate a shutdown of your app, and it would not accept new connections. However, existing connections (e.g., active sessions) will be left alone by both HAproxy and Kubernetes, so they could finish and then there would be no need for this feature (since terminating the pod will cause it to be removed from the endpoint as soon as termination begins and therefore no new requests will go to it while active connections are finishing up).

For my use case, there is a stateful API where users access a query by executing a create call to the server, followed by a number of next calls to retrieve pages of results. Since these are new requests that for us must go back to the same web server, I need to support the drain mode. By setting the weight of a server to 0 as soon as it is terminating, HAproxy won't send any new traffic to it. Only traffic that uses cookie persistence will go there.

dcowden · 2018-04-10T12:33:14Z

Yep, that's the same as our use case, almost exactly. How did you organize your health check so that it knows when its ok to stop the server?

brianloss · 2018-04-10T12:40:58Z

For us, the health check isn't really involved. As I mentioned above, the weight is set to 0, so HAProxy won't send any non-persistent traffic to the terminating pod. The rest is built into the docker image and service. When a pod is terminated, docker sends a SIGTERM which our entrypoint script traps and uses to invoke a curl call on our API. That call watches internally until either all active queries complete or a long timeout expires, and then tells Wildfly to shut down.

dcowden · 2018-04-10T12:48:18Z

ok thanks I understand now!

brianloss added 3 commits April 8, 2018 12:07

Merge branch 'master' into drain-support

fd183e0

PR feedback

d8bfdfd

* validate namespace match when pulling terminating service pods * don't include not ready or terminating pods when DrainSupport is not active

Fix nil pointer de-reference on first update.

d798a1c

jcmoraisjr merged commit 2000ef7 into jcmoraisjr:master Apr 9, 2018

jcmoraisjr mentioned this pull request Apr 10, 2018

Is it possible to automatically drain a backend in terminating status? #137

Closed

brianloss deleted the drain-support branch April 10, 2018 12:01

dcowden mentioned this pull request Jul 16, 2018

Drain a backend in terminating status? voyagermesh/voyager#1196

Closed

concaf mentioned this pull request Jun 10, 2019

Unable to drain endpoints before removal envoyproxy/envoy#7218

Open

Conversation

brianloss commented Feb 16, 2018

Uh oh!

jcmoraisjr commented Feb 20, 2018

Uh oh!

aiharos commented Feb 22, 2018

Uh oh!

jcmoraisjr commented Mar 12, 2018

Uh oh!

jcmoraisjr commented Mar 21, 2018

Uh oh!

brianloss commented Mar 21, 2018 via email

Uh oh!

jcmoraisjr commented Apr 9, 2018

Uh oh!

dcowden commented Apr 10, 2018

Uh oh!

brianloss commented Apr 10, 2018

Uh oh!

dcowden commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brianloss commented Apr 10, 2018

Uh oh!

dcowden commented Apr 10, 2018

Uh oh!

brianloss commented Apr 10, 2018

Uh oh!

dcowden commented Apr 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dcowden commented Apr 10, 2018 •

edited

Loading