Skip to content

Support a "drain" mode for terminating pods and servers failing readiness checks#95

Merged
jcmoraisjr merged 4 commits intojcmoraisjr:masterfrom
brianloss:drain-support
Apr 9, 2018
Merged

Support a "drain" mode for terminating pods and servers failing readiness checks#95
jcmoraisjr merged 4 commits intojcmoraisjr:masterfrom
brianloss:drain-support

Conversation

@brianloss
Copy link
Copy Markdown
Contributor

When a pod starts terminating or the readiness check fails, leave the pod as a service
in the load balancer, but set its weight to 0 so that HAProxy does not send ordinary
traffic to the pod. Traffic using persistence, however, will be directed to the terminating
pods. This allows persistent requests to continue to flow to a terminating pod. Updates to
only the draining state are made using the stats socket rather than forcing a full reload
of HAProxy.

This PR includes the described change, as well as a few fixes for typos I found along the way, and clearing up some warnings IntelliJ was showing (e.g., initializing slices with empty literal instead of nil, error messages starting with a capital letter).

I have a use case for pods having a long termination period and new requests with a cookie need to be directed to terminating servers during this period. I was thinking this could potentially be useful for other people too, but can always keep in a fork if you don't think it fits.

…ness checks

When a pod starts terminating or the readiness check fails, leave the pod as a service
in the load balancer, but set its weight to 0 so that HAProxy does not send ordinary
traffic to the pod. Traffic using persistence, however, will be directed to the terminating
pods. This allows persistent requests to continue to flow to a terminating pod. Updates to
only the draining state are made using the stats socket rather than forcing a full reload
of HAProxy.
@jcmoraisjr
Copy link
Copy Markdown
Owner

A very specific usecase but indeed an interesting feature. I'll need a few more days to have a look at the approach.

@aiharos fyi.

@aiharos
Copy link
Copy Markdown
Contributor

aiharos commented Feb 22, 2018

The stats_socket.go changes seem solid, and if draining is enabled the server weight is set in the configuration template so that seems good too. +1 for ignoring Ingress status changes, that also causes some needless reloads.

@jcmoraisjr
Copy link
Copy Markdown
Owner

Hi, finally, thanks for supporting this ingress controller with this amazing piece of code! Sorry about the long delay, I was on vacation in the last weeks.

Now you are adding NotReady and Terminating pods to the list of available endpoints. How do you controll whether this endpoints should or shouldn't be added to the backend servers when drain-support is false?

@jcmoraisjr
Copy link
Copy Markdown
Owner

Hi, only two pending issues after a few tests:

  • check if svc namespace and pod namespace are the same at GetTerminatingServicePods
  • only add draining pods to the backend list if draining is enabled - on our usecase we need terminating pods being removed from the balance asap

Moreover I pushed by mistake a commit that conflict with your change. Please have a look if you can rebase on top of master.

@brianloss
Copy link
Copy Markdown
Contributor Author

brianloss commented Mar 21, 2018 via email

* validate namespace match when pulling terminating service pods
* don't include not ready or terminating pods when DrainSupport is not active
@jcmoraisjr jcmoraisjr merged commit 2000ef7 into jcmoraisjr:master Apr 9, 2018
@jcmoraisjr
Copy link
Copy Markdown
Owner

Thanks!

@dcowden
Copy link
Copy Markdown

dcowden commented Apr 10, 2018

@brianloss Kudos for developing this-- we need it too, and we're planning on trying it out. Just wanted to say thanks and let you know someone else indeed used it!

@brianloss
Copy link
Copy Markdown
Contributor Author

No problem! Hope it works ok for you -- I've tested for my use case, but go isn't my primary language and I'm sure there are some parts I could have done better.

@dcowden
Copy link
Copy Markdown

dcowden commented Apr 10, 2018

I'm hoping to test today-- i'll let you you know testing goes. My scenario is a java app ( JSESSIONID) using sticky sessions to a 3- node deployment. My goal is to execute a rolling update of the deployment and have the nodes finish terminating when they have no more active sessions.

To pull this off, I also need to fail the health check when there are no more active sessions, which in java is a JMX thing. You're not using java by any chance are you? If so, how did you loop active session count into the health check?

@brianloss
Copy link
Copy Markdown
Contributor Author

I'm using Wildfly. Typically what would happen is you would initiate a shutdown of your app, and it would not accept new connections. However, existing connections (e.g., active sessions) will be left alone by both HAproxy and Kubernetes, so they could finish and then there would be no need for this feature (since terminating the pod will cause it to be removed from the endpoint as soon as termination begins and therefore no new requests will go to it while active connections are finishing up).

For my use case, there is a stateful API where users access a query by executing a create call to the server, followed by a number of next calls to retrieve pages of results. Since these are new requests that for us must go back to the same web server, I need to support the drain mode. By setting the weight of a server to 0 as soon as it is terminating, HAproxy won't send any new traffic to it. Only traffic that uses cookie persistence will go there.

@dcowden
Copy link
Copy Markdown

dcowden commented Apr 10, 2018

Yep, that's the same as our use case, almost exactly. How did you organize your health check so that it knows when its ok to stop the server?

@brianloss
Copy link
Copy Markdown
Contributor Author

For us, the health check isn't really involved. As I mentioned above, the weight is set to 0, so HAProxy won't send any non-persistent traffic to the terminating pod. The rest is built into the docker image and service. When a pod is terminated, docker sends a SIGTERM which our entrypoint script traps and uses to invoke a curl call on our API. That call watches internally until either all active queries complete or a long timeout expires, and then tells Wildfly to shut down.

@dcowden
Copy link
Copy Markdown

dcowden commented Apr 10, 2018

ok thanks I understand now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants