Skip to content

Race during Route readiness #2954

@mattmoor

Description

@mattmoor

I've been running the scale test a lot lately, and I've found that at higher scales (more stuff deployed), it is much more common for us to timeout waiting for a Service to become "Ready".

I managed to capture an instance of this at low scale (following a large scale test) and the final state was:

$ kubectl get ksvc,configuration,route -nserving-tests
NAME                                                   DOMAIN                                               LATESTCREATED                    LATESTREADY                      READY     REASON
service.serving.knative.dev/scale-00010-000-tsmelnjv   scale-00010-000-tsmelnjv.serving-tests.example.com   scale-00010-000-tsmelnjv-00001   scale-00010-000-tsmelnjv-00001   True
service.serving.knative.dev/scale-00010-001-vpsuewge   scale-00010-001-vpsuewge.serving-tests.example.com   scale-00010-001-vpsuewge-00001   scale-00010-001-vpsuewge-00001   True
service.serving.knative.dev/scale-00010-002-nohksgaj   scale-00010-002-nohksgaj.serving-tests.example.com   scale-00010-002-nohksgaj-00001   scale-00010-002-nohksgaj-00001   True
service.serving.knative.dev/scale-00010-003-ubtmhruc   scale-00010-003-ubtmhruc.serving-tests.example.com   scale-00010-003-ubtmhruc-00001   scale-00010-003-ubtmhruc-00001   True
service.serving.knative.dev/scale-00010-004-gkwltbwt   scale-00010-004-gkwltbwt.serving-tests.example.com   scale-00010-004-gkwltbwt-00001   scale-00010-004-gkwltbwt-00001   Unknown   RevisionMissing
service.serving.knative.dev/scale-00010-005-hsqmxdhs   scale-00010-005-hsqmxdhs.serving-tests.example.com   scale-00010-005-hsqmxdhs-00001   scale-00010-005-hsqmxdhs-00001   True
service.serving.knative.dev/scale-00010-006-opchjfyd   scale-00010-006-opchjfyd.serving-tests.example.com   scale-00010-006-opchjfyd-00001   scale-00010-006-opchjfyd-00001   True
service.serving.knative.dev/scale-00010-007-ckfyvvmw   scale-00010-007-ckfyvvmw.serving-tests.example.com   scale-00010-007-ckfyvvmw-00001   scale-00010-007-ckfyvvmw-00001   Unknown   RevisionMissing
service.serving.knative.dev/scale-00010-008-bpamlmev   scale-00010-008-bpamlmev.serving-tests.example.com   scale-00010-008-bpamlmev-00001   scale-00010-008-bpamlmev-00001   True
service.serving.knative.dev/scale-00010-009-byvhltnh   scale-00010-009-byvhltnh.serving-tests.example.com   scale-00010-009-byvhltnh-00001   scale-00010-009-byvhltnh-00001   True

NAME                                                         LATESTCREATED                    LATESTREADY                      READY   REASON
configuration.serving.knative.dev/scale-00010-000-tsmelnjv   scale-00010-000-tsmelnjv-00001   scale-00010-000-tsmelnjv-00001   True
configuration.serving.knative.dev/scale-00010-001-vpsuewge   scale-00010-001-vpsuewge-00001   scale-00010-001-vpsuewge-00001   True
configuration.serving.knative.dev/scale-00010-002-nohksgaj   scale-00010-002-nohksgaj-00001   scale-00010-002-nohksgaj-00001   True
configuration.serving.knative.dev/scale-00010-003-ubtmhruc   scale-00010-003-ubtmhruc-00001   scale-00010-003-ubtmhruc-00001   True
configuration.serving.knative.dev/scale-00010-004-gkwltbwt   scale-00010-004-gkwltbwt-00001   scale-00010-004-gkwltbwt-00001   True
configuration.serving.knative.dev/scale-00010-005-hsqmxdhs   scale-00010-005-hsqmxdhs-00001   scale-00010-005-hsqmxdhs-00001   True
configuration.serving.knative.dev/scale-00010-006-opchjfyd   scale-00010-006-opchjfyd-00001   scale-00010-006-opchjfyd-00001   True
configuration.serving.knative.dev/scale-00010-007-ckfyvvmw   scale-00010-007-ckfyvvmw-00001   scale-00010-007-ckfyvvmw-00001   True
configuration.serving.knative.dev/scale-00010-008-bpamlmev   scale-00010-008-bpamlmev-00001   scale-00010-008-bpamlmev-00001   True
configuration.serving.knative.dev/scale-00010-009-byvhltnh   scale-00010-009-byvhltnh-00001   scale-00010-009-byvhltnh-00001   True

NAME                                                 DOMAIN                                               READY     REASON
route.serving.knative.dev/scale-00010-000-tsmelnjv   scale-00010-000-tsmelnjv.serving-tests.example.com   True
route.serving.knative.dev/scale-00010-001-vpsuewge   scale-00010-001-vpsuewge.serving-tests.example.com   True
route.serving.knative.dev/scale-00010-002-nohksgaj   scale-00010-002-nohksgaj.serving-tests.example.com   True
route.serving.knative.dev/scale-00010-003-ubtmhruc   scale-00010-003-ubtmhruc.serving-tests.example.com   True
route.serving.knative.dev/scale-00010-004-gkwltbwt   scale-00010-004-gkwltbwt.serving-tests.example.com   Unknown   RevisionMissing
route.serving.knative.dev/scale-00010-005-hsqmxdhs   scale-00010-005-hsqmxdhs.serving-tests.example.com   True
route.serving.knative.dev/scale-00010-006-opchjfyd   scale-00010-006-opchjfyd.serving-tests.example.com   True
route.serving.knative.dev/scale-00010-007-ckfyvvmw   scale-00010-007-ckfyvvmw.serving-tests.example.com   Unknown   RevisionMissing
route.serving.knative.dev/scale-00010-008-bpamlmev   scale-00010-008-bpamlmev.serving-tests.example.com   True
route.serving.knative.dev/scale-00010-009-byvhltnh   scale-00010-009-byvhltnh.serving-tests.example.com   True

I believe that the last state of the Configuration that the Route observes is one where the Revision is missing, and before it calls Track on the Configuration so that it is enqueued on future updates, the Configuration reaches its final state. Because we read the Configuration's state before we Track and don't requeue when Track sets up a new watch, we have a small window where we could miss an update.

I think the simplest (idiot proof) way to fix this would be to simply have us call i.cb(key) here when _, ok := l[key]; !ok. In words: when a new key starts tracking a particular ref, immediately requeue that key to ensure no updates were missed.

Metadata

Metadata

Assignees

Labels

area/APIAPI objects and controllersarea/networkingkind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions