Skip to content

Conversation

@nader-ziada
Copy link
Member

add default conditions to PA to avoid potential race conditions
between revision and autoscaler

Fixes #16036

Proposed Changes

  • Add default conditions in PodAutoscaler to avoid any potential race conditions

@knative-prow knative-prow bot requested review from dsimansk and skonto September 10, 2025 17:53
@knative-prow knative-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 10, 2025
@codecov
Copy link

codecov bot commented Sep 10, 2025

Codecov Report

❌ Patch coverage is 18.18182% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.06%. Comparing base (0902825) to head (1bf2b31).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
pkg/reconciler/testing/v1/factory.go 0.00% 18 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16078      +/-   ##
==========================================
- Coverage   80.13%   80.06%   -0.07%     
==========================================
  Files         214      214              
  Lines       16887    16907      +20     
==========================================
+ Hits        13532    13537       +5     
- Misses       2996     3011      +15     
  Partials      359      359              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nader-ziada nader-ziada changed the title add initialize conditions to MakePA to avoid potential race conditions (2nd attempt) add default conditions to PA to avoid potential race conditions (2nd attempt) Sep 10, 2025
@nader-ziada
Copy link
Member Author

/retest

@nader-ziada
Copy link
Member Author

/test all

1 similar comment
@nader-ziada
Copy link
Member Author

/test all

@nader-ziada
Copy link
Member Author

/retest

@nader-ziada nader-ziada force-pushed the pa-race branch 2 times, most recently from f9d8138 to cc8c4c2 Compare September 12, 2025 01:40
@nader-ziada
Copy link
Member Author

/retest

@nader-ziada
Copy link
Member Author

/test all

@nader-ziada
Copy link
Member Author

@dprotaso can you take a look, I think I fixed the issue. ran the tests multiple times to check if its flaky

@dprotaso
Copy link
Member

@nader-ziada what was the issue?

@nader-ziada
Copy link
Member Author

the code was not expecting a condition, so assumed it failed even though it was the pending one, check the 2nd commit

Comment on lines 229 to 230
scaleTargetInitializedCondReason := ps.GetCondition(autoscalingv1alpha1.PodAutoscalerConditionScaleTargetInitialized).GetReason()
if !ps.IsScaleTargetInitialized() && !resUnavailable && ps.ServiceName != "" && scaleTargetInitializedCondReason != autoscalingv1alpha1.PendingReason {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what is happening here? Generally we're not looking at the Reason string but Status.

To get here PA::Ready=False are we trying to set the right error message?

Another way to think about this can we just look at ScaleTargetInitialized=False to simplify this? Technically not IsScaleTargetInitialized is true when it is set to Unknown and False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new situation here is that the condition is now there on the creation of the resource with the generated defaulting, so I wasn't sure how else to differentiate between the default Unknown (with reason Pending) vs Unknown because something happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default Unknown (with reason Pending) vs Unknown because something happened.

I would view these two scenarios as being the same

Comment on lines 275 to 279
cond := pa.Status.GetCondition(autoscalingv1alpha1.PodAutoscalerConditionReady)
if cond.IsUnknown() && cond.GetReason() == autoscalingv1alpha1.PendingReason {
// still at default PA condition, no need to mark anything
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment why we added this? It's not clear why we want to gate setting the Active condition on Ready all of a sudden

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm basically trying to skip the check if its the default condition with Reason = Pending that is now set by default when the resource is created, which is different that just Unknown that would cause the PA to be set to Inactive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is different that just Unknown that would cause the PA to be set to Inactive

Unknown means it's deploying or doing whatever. The reason doesn't really matter. If we were setting the status to Inactive while it was unknown then I'd probably continue doing that.

Was this triggering some failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it was setting the PA ready to false which made the revision get into the "Initial scale was never achieved" error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nader-ziada nader-ziada changed the title add default conditions to PA to avoid potential race conditions (2nd attempt) [wip] add default conditions to PA to avoid potential race conditions (2nd attempt) Sep 15, 2025
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 15, 2025
@nader-ziada nader-ziada force-pushed the pa-race branch 3 times, most recently from c09b0d0 to d390044 Compare September 16, 2025 22:33
@nader-ziada
Copy link
Member Author

/retest

@nader-ziada
Copy link
Member Author

/test all

@nader-ziada
Copy link
Member Author

/retest

@nader-ziada
Copy link
Member Author

/test all

@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 25, 2025
@nader-ziada
Copy link
Member Author

/retest

@nader-ziada
Copy link
Member Author

/test all

@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 26, 2025
@nader-ziada
Copy link
Member Author

/retest

@nader-ziada
Copy link
Member Author

/test istio-latest-no-mesh

@nader-ziada nader-ziada changed the title [wip] add default conditions to PA to avoid potential race conditions (2nd attempt) add default conditions to PA to avoid potential race conditions (2nd attempt) Sep 26, 2025
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2025
@nader-ziada
Copy link
Member Author

@dprotaso @dsimansk i think this is ready for review now, passed the last 4 runs of tests

Comment on lines 913 to 917
func withRevisionConditionsGivenPADefault(r *v1.Revision) {
WithInitRevConditions(r)
r.Status.MarkActiveUnknown("Deploying", "")
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks identical to allUnknownConditions can we clean up the extra function and simplify the diff?

https://github.com/nader-ziada/serving/blob/f1341d21581b613facc929cb3ed0bad197eafd53/pkg/reconciler/revision/table_test.go#L905-L909

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, will remove that


if cond == nil {
sksCondition := ps.GetCondition(autoscalingv1alpha1.PodAutoscalerConditionSKSReady)
if cond == nil || (cond.IsUnknown() && sksCondition != nil && sksCondition.IsUnknown()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we need the extra clauses?

I would think we could simply have cond == nil || cond.IsUnknown()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that there were cases now that the pa has conditions at the start, it was not waiting for the service to get ready, so I was trying to make sure to simulate the previous cases before defaulting pa conditions

// unavailable here, and have no way of recovering later.
// If the ResourcesAvailable is already false, don't override the message.
if !ps.IsScaleTargetInitialized() && !resUnavailable && ps.ServiceName != "" {
if !ps.IsScaleTargetInitialized() && !resUnavailable && ps.ServiceName != "" && !sksCondition.IsUnknown() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on why we want to add an additional check here?

Probably worth updating the comment as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as other comment, I believe that there were cases now that the pa has conditions at the start, it was not waiting for the service to get ready, so I was trying to make sure to simulate the previous cases before defaulting pa conditions

between revision and autoscaler

fix revision lifecyle check for conditions
@nader-ziada
Copy link
Member Author

/retest

@nader-ziada
Copy link
Member Author

@dprotaso added comments

@dprotaso
Copy link
Member

dprotaso commented Oct 2, 2025

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2025
@knative-prow
Copy link

knative-prow bot commented Oct 2, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, nader-ziada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2025
@knative-prow knative-prow bot merged commit 561f348 into knative:main Oct 2, 2025
111 of 113 checks passed
dsimansk pushed a commit to dsimansk/serving that referenced this pull request Oct 3, 2025
knative#16078)

between revision and autoscaler

fix revision lifecyle check for conditions
openshift-merge-bot bot pushed a commit to openshift-knative/serving that referenced this pull request Oct 3, 2025
#1598)

* Add default conditions to create PA to avoid potential race conditions (knative#16078)

between revision and autoscaler

fix revision lifecyle check for conditions

* Regen release files

---------

Co-authored-by: Nader Ziada <[email protected]>
dprotaso added a commit to dprotaso/serving that referenced this pull request Oct 6, 2025
knative-prow bot pushed a commit that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential race condition in PodAutoscaler creation

2 participants