Skip to content
This repository was archived by the owner on Apr 17, 2019. It is now read-only.

peer-finder gets stuck in retry loop of DNS errors when no peers are found #1931

Closed
a-robinson opened this issue Oct 26, 2016 · 10 comments
Closed
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@a-robinson
Copy link
Contributor

a-robinson commented Oct 26, 2016

While working on an init container to help with a potential edge case in cockroachdb's petset config (cockroachdb/cockroach#10140), I spun up an init container with the following config as part of a petset:

        pod.alpha.kubernetes.io/init-containers: '[
            {
                "name": "bootstrap",
                "image": "gcr.io/google_containers/peer-finder:0.1",
                "args": [
                  "-on-start=\"readarray PEERS;
                               if [ ${#PEERS[@]} -eq 0 ]; then
                                 mkdir -p /cockroach/cockroach-data && touch /cockroach/cockroach-data/cluster_exists_marker
                               fi\"",
                  "-service=cockroachdb"],
                "env": [
                  {
                      "name": "POD_NAMESPACE",
                      "valueFrom": {
                          "fieldRef": {
                              "apiVersion": "v1",
                              "fieldPath": "metadata.namespace"
                          }
                      }
                   }
                ],
                "volumeMounts": [
                    {
                        "name": "datadir",
                        "mountPath": "/cockroach/cockroach-data"
                    }
                ]
            }
        ]'

The first pet has been stuck in the init state for more than 10 minutes, and the peer-finder is clearly having a bad time, with its logs containing the same DNS error over and over:

$ kc logs cockroachdb-0 bootstrap
2016/10/26 16:49:53 lookup cockroachdb on 10.3.240.10:53: server misbehaving
2016/10/26 16:49:54 lookup cockroachdb on 10.3.240.10:53: server misbehaving
2016/10/26 16:49:55 lookup cockroachdb on 10.3.240.10:53: server misbehaving
...

It's expected that no peers would be found, but not that DNS errors would be returned. golang/go#12712 looks like a potential cause, although it was supposedly fixed in 1.6 if you trust the milestone attached to it.

If I open a shell in the init container and play around with similar DNS lookups, this is what I see:

root@cockroachdb-0:/# dig cockroachdb

; <<>> DiG 9.10.3-P4-Ubuntu <<>> cockroachdb
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 18282
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;cockroachdb.           IN  A

;; Query time: 1 msec
;; SERVER: 10.3.240.10#53(10.3.240.10)
;; WHEN: Wed Oct 26 16:54:24 UTC 2016
;; MSG SIZE  rcvd: 29

root@cockroachdb-0:/# nslookup cockroachdb
Server:     10.3.240.10
Address:    10.3.240.10#53

** server can't find cockroachdb: SERVFAIL

root@cockroachdb-0:/# dig cockroachdb.default.svc.cluster.local

; <<>> DiG 9.10.3-P4-Ubuntu <<>> cockroachdb.default.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 53063
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;cockroachdb.default.svc.cluster.local. IN A

;; AUTHORITY SECTION:
cluster.local.      60  IN  SOA ns.dns.cluster.local. hostmaster.cluster.local. 1477501200 28800 7200 604800 60

;; Query time: 1 msec
;; SERVER: 10.3.240.10#53(10.3.240.10)
;; WHEN: Wed Oct 26 17:12:26 UTC 2016
;; MSG SIZE  rcvd: 148

root@cockroachdb-0:/# nslookup cockroachdb.default.svc.cluster.local
Server:     10.3.240.10
Address:    10.3.240.10#53

** server can't find cockroachdb.default.svc.cluster.local: NXDOMAIN

The SERVFAIL for nslookup cockroachdb is fairly damning, considering the peer-finder doesn't qualify its lookups with the namespace/suffix (I'd be interested to know why, but that's orthogonal). The kubedns containers in the cluster don't have any logs, but might be interesting with verbose logging enabled?

In case it matters, the cluster is at 1.4.0 on GKE using the alpha cluster option.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.0", GitCommit:"a16c0a7f71a6f93c7e0f222d961f4675cd97a46b", GitTreeState:"clean", BuildDate:"2016-09-26T18:16:57Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.0", GitCommit:"a16c0a7f71a6f93c7e0f222d961f4675cd97a46b", GitTreeState:"clean", BuildDate:"2016-09-26T18:10:32Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

@bprashanth - is this a known issue, or something that needs further investigation?

@bprashanth
Copy link

@a-robinson
Copy link
Contributor Author

That's not what the peer-finder is actually doing, though:

lookup() directly looks up the string passed to it:
https://github.com/kubernetes/contrib/blob/master/pets/peer-finder/peer-finder.go#L48

And main() passes it the bare service name, without qualifying it with the namespace:
https://github.com/kubernetes/contrib/blob/master/pets/peer-finder/peer-finder.go#L93

That's what I was asking about when saying "I'd be interested to know why, but that's orthogonal".

For what it's worth, qualifying it with the domain name fixes nslookup, at the very least:

root@cockroachdb-0:/# nslookup cockroachdb.default
Server:     10.3.240.10
Address:    10.3.240.10#53

** server can't find cockroachdb.default: NXDOMAIN

@bprashanth
Copy link

yeah the peer finder looks up srv records and compares them to its own fqdn, ie:

root@test-0:/# nslookup -type=srv test
Server:     10.0.0.10
Address:    10.0.0.10#53

test.default.svc.cluster.local  service = 10 50 0 test-1.test.default.svc.cluster.local.
test.default.svc.cluster.local  service = 10 50 0 test-0.test.default.svc.cluster.local.

@a-robinson
Copy link
Contributor Author

So you're saying the peer-finder code needs to start including the namespace in its lookups? Specifying -type=srv still returns SERVFAIL for a service with no endpoints.

root@cockroachdb-0:/# nslookup -type=srv cockroachdb
Server:     10.3.240.10
Address:    10.3.240.10#53

** server can't find cockroachdb: SERVFAIL

While it works fine for a service that does have endpoints:

root@cockroachdb-0:/# nslookup -type=srv my-release-etcd
Server:     10.3.240.10
Address:    10.3.240.10#53

my-release-etcd.default.svc.cluster.local   service = 10 33 0 my-release-etcd-0.my-release-etcd.default.svc.cluster.local.
my-release-etcd.default.svc.cluster.local   service = 10 33 0 my-release-etcd-2.my-release-etcd.default.svc.cluster.local.
my-release-etcd.default.svc.cluster.local   service = 10 33 0 my-release-etcd-1.my-release-etcd.default.svc.cluster.local.

@bprashanth
Copy link

ah that's why we have the tolearte-unready-endpoints annotation, so you can hang till you show up in DNS (kubernetes/kubernetes#25283). It's basically a lock on the petset.
The assumption is that database-y things manage their own readiness through internal protocols, and you can always create another service that does respect readiness and give that out to your clients if you want one that respects readiness.

Does this suit your use case, or do we need to add another feature (or make peerfinder srv lookup ns aware)?

@a-robinson
Copy link
Contributor Author

I mean, we could make things work with the tolerate-unready-endpoints annotation set, but we don't actually need it for anything and kind of prefer not using it so that nodes don't try to join themselves. If we were to start using it, we'd have to switch the join address that the nodes use to the "public" service, which does respect readiness.

Requiring the annotation for peer-finder to work seems like a strange coupling, and it also sounds in kubernetes/kubernetes#25283 like the annotation may be important for petsets in other ways? Is it effectively required for petset services?

@bprashanth
Copy link

Not required, only if you want to hang till you can resolve yourself, before allowing anyone else to start. If you don't care about that property (i.e etcd will just wait if started with a config of peers etcd1,2,3 till they come online, galera will crash saying I couldn't find these peers).

If you specify a readiness probe, you need to pass it to show up in dns, and you can't pass it without finishing up your init container. there are workarounds that involve a fatty entrypoint, but in the long run I think we will probably end up forking DNS so it inserts records for unreadyEndpoints as well, and kubeproxy will continue respecting just readyEndpoints.

@a-robinson
Copy link
Contributor Author

Just to provide a little closure, we talked on slack last week and concluded that it would be best for CockroachDB to start using tolerate-unready-endpoints. That means I'm no longer blocked on this, but it still may be best to include the service's namespace in the lookup to avoid hitting the DNS errors for empty result sets.

@fejta-bot
Copy link

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants