-
Notifications
You must be signed in to change notification settings - Fork 1.7k
peer-finder gets stuck in retry loop of DNS errors when no peers are found #1931
Comments
Hmm, |
That's not what the peer-finder is actually doing, though:
And That's what I was asking about when saying "I'd be interested to know why, but that's orthogonal". For what it's worth, qualifying it with the domain name fixes nslookup, at the very least: root@cockroachdb-0:/# nslookup cockroachdb.default
Server: 10.3.240.10
Address: 10.3.240.10#53
** server can't find cockroachdb.default: NXDOMAIN |
yeah the peer finder looks up srv records and compares them to its own fqdn, ie:
|
So you're saying the peer-finder code needs to start including the namespace in its lookups? Specifying root@cockroachdb-0:/# nslookup -type=srv cockroachdb
Server: 10.3.240.10
Address: 10.3.240.10#53
** server can't find cockroachdb: SERVFAIL While it works fine for a service that does have endpoints: root@cockroachdb-0:/# nslookup -type=srv my-release-etcd
Server: 10.3.240.10
Address: 10.3.240.10#53
my-release-etcd.default.svc.cluster.local service = 10 33 0 my-release-etcd-0.my-release-etcd.default.svc.cluster.local.
my-release-etcd.default.svc.cluster.local service = 10 33 0 my-release-etcd-2.my-release-etcd.default.svc.cluster.local.
my-release-etcd.default.svc.cluster.local service = 10 33 0 my-release-etcd-1.my-release-etcd.default.svc.cluster.local. |
ah that's why we have the tolearte-unready-endpoints annotation, so you can hang till you show up in DNS (kubernetes/kubernetes#25283). It's basically a lock on the petset. Does this suit your use case, or do we need to add another feature (or make peerfinder srv lookup ns aware)? |
I mean, we could make things work with the tolerate-unready-endpoints annotation set, but we don't actually need it for anything and kind of prefer not using it so that nodes don't try to join themselves. If we were to start using it, we'd have to switch the join address that the nodes use to the "public" service, which does respect readiness. Requiring the annotation for peer-finder to work seems like a strange coupling, and it also sounds in kubernetes/kubernetes#25283 like the annotation may be important for petsets in other ways? Is it effectively required for petset services? |
Not required, only if you want to hang till you can resolve yourself, before allowing anyone else to start. If you don't care about that property (i.e etcd will just wait if started with a config of peers etcd1,2,3 till they come online, galera will crash saying I couldn't find these peers). If you specify a readiness probe, you need to pass it to show up in dns, and you can't pass it without finishing up your init container. there are workarounds that involve a fatty entrypoint, but in the long run I think we will probably end up forking DNS so it inserts records for unreadyEndpoints as well, and kubeproxy will continue respecting just readyEndpoints. |
Just to provide a little closure, we talked on slack last week and concluded that it would be best for CockroachDB to start using tolerate-unready-endpoints. That means I'm no longer blocked on this, but it still may be best to include the service's namespace in the lookup to avoid hitting the DNS errors for empty result sets. |
Issues go stale after 30d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Uh oh!
There was an error while loading. Please reload this page.
While working on an init container to help with a potential edge case in cockroachdb's petset config (cockroachdb/cockroach#10140), I spun up an init container with the following config as part of a petset:
The first pet has been stuck in the init state for more than 10 minutes, and the peer-finder is clearly having a bad time, with its logs containing the same DNS error over and over:
It's expected that no peers would be found, but not that DNS errors would be returned. golang/go#12712 looks like a potential cause, although it was supposedly fixed in 1.6 if you trust the milestone attached to it.
If I open a shell in the init container and play around with similar DNS lookups, this is what I see:
The SERVFAIL for
nslookup cockroachdb
is fairly damning, considering the peer-finder doesn't qualify its lookups with the namespace/suffix (I'd be interested to know why, but that's orthogonal). The kubedns containers in the cluster don't have any logs, but might be interesting with verbose logging enabled?In case it matters, the cluster is at 1.4.0 on GKE using the alpha cluster option.
@bprashanth - is this a known issue, or something that needs further investigation?
The text was updated successfully, but these errors were encountered: