-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
I've been running the scale test a lot lately, and I've found that at higher scales (more stuff deployed), it is much more common for us to timeout waiting for a Service to become "Ready".
I managed to capture an instance of this at low scale (following a large scale test) and the final state was:
$ kubectl get ksvc,configuration,route -nserving-tests
NAME DOMAIN LATESTCREATED LATESTREADY READY REASON
service.serving.knative.dev/scale-00010-000-tsmelnjv scale-00010-000-tsmelnjv.serving-tests.example.com scale-00010-000-tsmelnjv-00001 scale-00010-000-tsmelnjv-00001 True
service.serving.knative.dev/scale-00010-001-vpsuewge scale-00010-001-vpsuewge.serving-tests.example.com scale-00010-001-vpsuewge-00001 scale-00010-001-vpsuewge-00001 True
service.serving.knative.dev/scale-00010-002-nohksgaj scale-00010-002-nohksgaj.serving-tests.example.com scale-00010-002-nohksgaj-00001 scale-00010-002-nohksgaj-00001 True
service.serving.knative.dev/scale-00010-003-ubtmhruc scale-00010-003-ubtmhruc.serving-tests.example.com scale-00010-003-ubtmhruc-00001 scale-00010-003-ubtmhruc-00001 True
service.serving.knative.dev/scale-00010-004-gkwltbwt scale-00010-004-gkwltbwt.serving-tests.example.com scale-00010-004-gkwltbwt-00001 scale-00010-004-gkwltbwt-00001 Unknown RevisionMissing
service.serving.knative.dev/scale-00010-005-hsqmxdhs scale-00010-005-hsqmxdhs.serving-tests.example.com scale-00010-005-hsqmxdhs-00001 scale-00010-005-hsqmxdhs-00001 True
service.serving.knative.dev/scale-00010-006-opchjfyd scale-00010-006-opchjfyd.serving-tests.example.com scale-00010-006-opchjfyd-00001 scale-00010-006-opchjfyd-00001 True
service.serving.knative.dev/scale-00010-007-ckfyvvmw scale-00010-007-ckfyvvmw.serving-tests.example.com scale-00010-007-ckfyvvmw-00001 scale-00010-007-ckfyvvmw-00001 Unknown RevisionMissing
service.serving.knative.dev/scale-00010-008-bpamlmev scale-00010-008-bpamlmev.serving-tests.example.com scale-00010-008-bpamlmev-00001 scale-00010-008-bpamlmev-00001 True
service.serving.knative.dev/scale-00010-009-byvhltnh scale-00010-009-byvhltnh.serving-tests.example.com scale-00010-009-byvhltnh-00001 scale-00010-009-byvhltnh-00001 True
NAME LATESTCREATED LATESTREADY READY REASON
configuration.serving.knative.dev/scale-00010-000-tsmelnjv scale-00010-000-tsmelnjv-00001 scale-00010-000-tsmelnjv-00001 True
configuration.serving.knative.dev/scale-00010-001-vpsuewge scale-00010-001-vpsuewge-00001 scale-00010-001-vpsuewge-00001 True
configuration.serving.knative.dev/scale-00010-002-nohksgaj scale-00010-002-nohksgaj-00001 scale-00010-002-nohksgaj-00001 True
configuration.serving.knative.dev/scale-00010-003-ubtmhruc scale-00010-003-ubtmhruc-00001 scale-00010-003-ubtmhruc-00001 True
configuration.serving.knative.dev/scale-00010-004-gkwltbwt scale-00010-004-gkwltbwt-00001 scale-00010-004-gkwltbwt-00001 True
configuration.serving.knative.dev/scale-00010-005-hsqmxdhs scale-00010-005-hsqmxdhs-00001 scale-00010-005-hsqmxdhs-00001 True
configuration.serving.knative.dev/scale-00010-006-opchjfyd scale-00010-006-opchjfyd-00001 scale-00010-006-opchjfyd-00001 True
configuration.serving.knative.dev/scale-00010-007-ckfyvvmw scale-00010-007-ckfyvvmw-00001 scale-00010-007-ckfyvvmw-00001 True
configuration.serving.knative.dev/scale-00010-008-bpamlmev scale-00010-008-bpamlmev-00001 scale-00010-008-bpamlmev-00001 True
configuration.serving.knative.dev/scale-00010-009-byvhltnh scale-00010-009-byvhltnh-00001 scale-00010-009-byvhltnh-00001 True
NAME DOMAIN READY REASON
route.serving.knative.dev/scale-00010-000-tsmelnjv scale-00010-000-tsmelnjv.serving-tests.example.com True
route.serving.knative.dev/scale-00010-001-vpsuewge scale-00010-001-vpsuewge.serving-tests.example.com True
route.serving.knative.dev/scale-00010-002-nohksgaj scale-00010-002-nohksgaj.serving-tests.example.com True
route.serving.knative.dev/scale-00010-003-ubtmhruc scale-00010-003-ubtmhruc.serving-tests.example.com True
route.serving.knative.dev/scale-00010-004-gkwltbwt scale-00010-004-gkwltbwt.serving-tests.example.com Unknown RevisionMissing
route.serving.knative.dev/scale-00010-005-hsqmxdhs scale-00010-005-hsqmxdhs.serving-tests.example.com True
route.serving.knative.dev/scale-00010-006-opchjfyd scale-00010-006-opchjfyd.serving-tests.example.com True
route.serving.knative.dev/scale-00010-007-ckfyvvmw scale-00010-007-ckfyvvmw.serving-tests.example.com Unknown RevisionMissing
route.serving.knative.dev/scale-00010-008-bpamlmev scale-00010-008-bpamlmev.serving-tests.example.com True
route.serving.knative.dev/scale-00010-009-byvhltnh scale-00010-009-byvhltnh.serving-tests.example.com True
I believe that the last state of the Configuration that the Route observes is one where the Revision is missing, and before it calls Track on the Configuration so that it is enqueued on future updates, the Configuration reaches its final state. Because we read the Configuration's state before we Track and don't requeue when Track sets up a new watch, we have a small window where we could miss an update.
I think the simplest (idiot proof) way to fix this would be to simply have us call i.cb(key) here when _, ok := l[key]; !ok. In words: when a new key starts tracking a particular ref, immediately requeue that key to ensure no updates were missed.