-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-5229: Asynchronous API calls during scheduling #5249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5229: Asynchronous API calls during scheduling #5249
Conversation
macsko
commented
Apr 16, 2025
- One-line PR description: Add KEP-5229
- Issue link: Asynchronous API calls during scheduling #5229
- Other comments:
/cc @dom4ha @sanposhiho I have written some possible approaches to these API calls to start the discussion. I will not be visibly active for the next three weeks, but feel free to comment. |
- `nominatedNodeName` scenario support would require more effort in (1.1) or (1.2). | ||
|
||
|
||
#### 2.2: Make the API calls queued |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like queueing API calls might be a good direction long term and probably won't be that hard to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great summary! I'll take time reading through it and put some comments after that.
This would require implementing advanced logic for queueing API calls in the kube-scheduler and migrating **all** pod-based API calls done during scheduling to this method, | ||
potentially including the binding API call. The new component should be able to resolve any conflicts in the incoming API calls as well as parallelize them properly, | ||
e.g., don't parallelize two updates of the same pod. This requires [making the API calls queued](#22-make-the-api-calls-queued) or | ||
[sending API calls through a kube-scheduler's cache](#23-send-api-calls-through-a-kube-schedulers-cache) to be implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually what you need here is something pretty similar to DeltaFIFO we use in client-go:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go
We wouldn't operate on object, but on their updates, but that's conceptually exactly what you need:
- tracking changes to a given object all together
- a concept of deduping (if I already scheduled a pod but didn't yet sent the failures, don't send them; if I have multiple failures not yet sent to report, is the last one only valid?, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds promising, but after a brief look it's not super clear to me how it works and whether it addresses all the problems.
The logic on the scheduler side may still be quite complex depending on which type of update is pending. For instance I suspect some of them may be blocking re-scheduling, but some not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my comment could have been misleading.
It was supposed to be a meta-comment - i don't think we can really reuse any code from DeltaFIFO.
What I was trying to say is that it's effectively the same conceptual pattern that delta-fifo is using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic on the scheduler side may still be quite complex depending on which type of update is pending. For instance I suspect some of them may be blocking re-scheduling, but some not.
That's interesting - in particular:
-
what are the examples where an API call should prevent further attempts to do something with a given pod?
[The only one I can see is preemption, but it's not even for a given pod, so it doesn't belong to this category] -
is there any case when if I have queued two different API calls for a given pod P - we actually want to send both, not just the last one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what are the examples where an API call should prevent further attempts to do something with a given pod?
[The only one I can see is preemption, but it's not even for a given pod, so it doesn't belong to this category]- is there any case when if I have queued two different API calls for a given pod P - we actually want to send both, not just the last one?
Don't know the answer out of my head, but it's not obvious to me yet that all updates can be skipped and we don't need to wait for updates to be persisted. That's why adding an analysis section and categorization of the API calls is needed.
One example comes to my mind is preemption of just-bound pod. It would be safer to let the binding call happen first before we could start preemption process, to avoid a situation we preempt something not bound yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One example comes to my mind is preemption of just-bound pod. It would be safer to let the binding call happen first before we could start preemption process, to avoid a situation we preempt something not bound yet.
That an interesting scenario - I just don't think that we should let the binding there. We probably need smarter deduping logic.
If we know that we will preempt the pod - there is no point in doing the binding. We need to think what is means from the "pre-binding" perspective (how to ensure we will not leak something), but probably directly moving the pod to a failed state seems better, than letting kubelet start it and bring it down immediately after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One example comes to my mind is preemption of just-bound pod. It would be safer to let the binding call happen first before we could start preemption process, to avoid a situation we preempt something not bound yet.
I guess it actually happens today? Because we're running the binding cycle asynchronously.
Let's say pod-1 goes to the binding cycle (after "assume"), pod-2 triggers the preemption and then pod-2 could preempt pod-1 before pod-1's binding cycle completes.
And, what happens today (I guess) is, because pod-1 is deleted, the binding cycle for pod-1 would be just failing.
But, that is a good example showing the difficulty of this KEP: Possibly, not only API calls for the same object, but also API calls for different objects could be depending on one another...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what are the examples where an API call should prevent further attempts to do something with a given pod?
I haven't found any such API call. Even preemption should not be a problem.
Another thing to consider is how to update the Pod's status in the scheduler's memory. Now, since the API calls are blocking, we don't need to persist the status change before the call, because the event handlers will soon receive the Pod/Update event with the current status and update the Pod object there.
- is there any case when if I have queued two different API calls for a given pod P - we actually want to send both, not just the last one?
We have 3 kinds of API calls for a Pod in scheduler: deletion (preemption), binding and status update (unschedulable or NNN). There is no such case where we would like to send two of these calls for a single Pod (see API calls categorization section).
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/kep.yaml
Outdated
Show resolved
Hide resolved
Making one universal approach of handling API calls in the kube-scheduler could allow these calls to be consistent, as well as better controlling | ||
the number of dispatched goroutines. Asynchronous preemption could also be migrated to this approach. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking from the high level, I think we may have the following goals (although it's not clear we're able to achieve all of them):
- Make scheduling cycle free of blocking api-calls (any async option is fine)
- Skip some type of updates if they soon become irrelevant by consecutive updates
- Prioritize high importance updates (binding) over low importance ones if updates to the api-server gets throttled
Maybe we have more, but I think agreeing first on our goals is important before we can select a solution. One of my doubts here is that we may not know yet all the requirement we may have once we start designing workload-aware scheduling and reservations. More api calls may appear at that time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to it, but I think there is also a question about priorities. In my mental model:
- eliminate blocking calls is P0
- skip soon-to-be irrelevant updates is P1
- prioritization of updates - is nice to have [I would like that but not necessary if that adds a lot of complexity]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm feeling like the third one is the next step, and we can just scope it out for now, at least from this KEP. Of course, we should keep it in mind when discussing this KEP though.
I agree that all of three are important, but it's just that we don't have to solve all of them at once on one KEP here; the third one looks complex enough to deserve being discussed on another KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which seems to be roughly aligned with what I wrote above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, my with the above, the order was also in purpose.
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md
Outdated
Show resolved
Hide resolved
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: macsko The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I added a section with API calls categorization and adjusted the goals. PTAL To go further, we could probably agree on some high-level concept (queuing without blocking a pod?). |
if the newest status is stored in-memory. | ||
- API calls for non-Pod resources (5 - 10) should be further analyzed as they are not likely to consider the Pod-based API calls, | ||
hence implementing those shouldn't block making (1 - 3) calls asynchronous. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any relevance order between calls for different pods?
I think there isn't (i.e. we don't require NNN to be set before preempting the first victim).
But even if there isn't - it would make sense to call it out explicitly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any relevance order between calls for different pods?
I think there is because we need to protect the space freed up by preempted pods for the nominated pod. Sure, pods don't disapear immediately, but we shouldn't rely on it. That obviously puts in question whether we can priorities some updates over the others... which could change the order.
Alternatively we may still want to treat a preemption as one async operation that includes setting nomination and removing pods in the right sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we need to protect the space freed up by preempted pods for the nominated pod.
But, we store NNN to the nominator when we update NNN. i.e., even if we make the API call async, as long as we report NNN to the nominator synchronously, the space should be reserved, and not stealed by other pods' scheduling cycles.
So, I believe we can make this NNN API call in the preemption asynchronous, simply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, if we go with the scheduler cache idea, we don't have to be worried at all about that kind of scenario. Any updates are reflected to the scheduler cache synchronously, and things that happen later can immediately refer to the update on the scheduler cache (regardless of whether API call is asynchronously done or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless scheduler restarts in the meantime and we lose the in-memory state.
Alternatively we may still want to treat a preemption as one async operation that includes setting nomination and removing pods in the right sequence.
which effectively introduces the ordering
The question is whether in-memory state is good enough, at least initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referred only to the ordering of api calls, not cache updates, which in turn can become fully synchronous again (we achieve the non-blocking goal by async api calls only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any relevance order between calls for different pods?
I'd say there isn't. Any two API calls, but for different pods made by scheduler are independent and I don't see any counter-example.
I believe we can make this NNN API call in the preemption asynchronous, simply.
That's right.
The question is whether in-memory state is good enough, at least initially.
WDYM by "initially" here?
- Updating Pod status (1, 2) could be less important and called if there is space for it. | ||
It's worth considering if setting `nominatedNodeName` (3) should have the same priority or higher, | ||
because the higher delay might affect other components like Cluster Autoscaler. | ||
- API calls for non-Pod resources (5 - 10) could be analyzed case by case, but are likely less important than (5) and (4). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 6-10
Also - I'm not sure I agree with it. As an example - volume binding is on a critical path for pod binding - so delaying volume binding delays pod binding - which affects overall pod startup. Which is probably what we should actually optimize for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I actually rather think API calls for non-Pod resources are likely equally important to 4 and 5 (at least for the current default scheduler) because, like described in 6-10 above, most of them are in the binding cycles, which are on the critical path, or PostFilter (similar to preemption - making space for pending pods).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, the difficulty here is that we cannot just say "the API calls for non pods are likely important -> let's prioritize them" because obviously there'll be exceptions especially when it comes to the scheduler with custom plugins.
So, I basically agree with the KEP saying that this is case by case actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it leads to another question: should we consider API calls being made within binding cycles? They're already async, so can we just treat them as a single async group for multiple API calls? And, in this KEP, we only need to add a canceling functionality in the binding cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it leads to another question: should we consider API calls being made within binding cycles?
Maybe I misunderstood your intention here, but if we don't consider them, we will face the races due to these not being coordinated with others. If the pod was earlier unschedulable, but the call was queued for too long, we need to cancel that once binding starts. So we need coordination between these. From that perspective, having a single place to handle everything sounds better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it leads to another question: should we consider API calls being made within binding cycles?
I think we need to consider pod-based API calls (status update, binding, preemption) to be able to effectively cancel/skip them. Another aspect is the 6 - 10 calls, which I think don't need be considered, at least initially. The only concern might be the DRA's PostFilter, which can make blocking API call (but probably we would need then to make async API calls for all ResourceClaim
calls because of similar reasons to Pod ones).
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md
Outdated
Show resolved
Hide resolved
- Simplifies introducing new API calls to the kube-scheduler if the collision handling logic is configured correctly. | ||
|
||
Cons: | ||
- Requires implementing complex, advanced queueing logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it's debatable how complex that really is - my subjective claim is that it's actually not that complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really depends on what we're doing with this queueing logic. If it's just going to be the queue, obviously simple.
But, like mentioned at 2.2, if we are going to merge multiple operations into one, things will be getting complicated.
- Cannot be used for the `nominatedNodeName` scenario, requiring additional effort and separate handling. | ||
|
||
|
||
#### 1.3: Use advanced queue and don't block the pod from being scheduled in the meantime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally strongly for this option - because it allows to handle all calls the same way and optimizes for latency.
I think the additional paid upfront complexity (that btw will be well encapsulated to a single place) it a cost that we should be willing to pay for the fact that it becomes unified and simple to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree that this should be the way to go. If we reach the consensus here, I'd move other options to the Alternatives considered and present one proposal.
- `nominatedNodeName` scenario support would require more effort in (1.1) or (1.2). | ||
|
||
|
||
#### 2.2: Make the API calls queued |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this is should be path forward - the primary reason against 2.3 is that it allows to encapsulate the logic into a single place and the burden is not spread across different (arguable much harder to reason about) places. And allows for avoiding races and optimizations (compared to 2.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand how to handle external changes? The obvious case is disappearance of a pod that was supposed to be preempted. It's probably quite easy case, as we can skip our update, but what if nominated node name changes once we allow external actors to change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things may change behind the scene - so there might be races. But for a particular object relying on optimistic concurrency should just work.
If you're worries about scheduler preempting something and external actor changing NNN in the meantime - that has an inherent race in it no matter what we do - we may try to minimize the window, but it will always be there and I'm not sure I would actually try to optimize for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - after thinking about it I changed my mind and we may prefer 2.3 actually: #5249 (comment)
Both of the above API calls could be migrated to the new mechanism. | ||
|
||
In-tree plugins' operations that involve non-pod API calls during scheduling and could be made asynchronous, | ||
but not necessarily in the first place: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to consider them from the very beginning, because when canceling pod binding, we probably want to cancel any pre-binding calls as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, we need to consider changes planned in the Extended Resource KEP, so it will be additional challenge how to put them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because when canceling pod binding, we probably want to cancel any pre-binding calls as well
When do we cancel pod binding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, I now can guess you meant ↓ described later in this kep.
- Pod deletion caused by preemption (4) should cancel all Pod-based API calls for such a Pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to consider them from the very beginning, because when canceling pod binding, we probably want to cancel any pre-binding calls as well.
Yes, that could be an optimization, but not necessary.
Btw, now when the pod Y is in the PreBind stage and pod X wants to preempt pod Y, pod Y delete API call is sent to the apiserver. If pod Y would be on WaitOnPermit, preemption just cancel it, without sending the API call. Shouldn't we optimize similarly for the PreBind, given that the number of use cases for PreBind (DRA) is growing? Do we have any reason against it?
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md
Outdated
Show resolved
Hide resolved
if the newest status is stored in-memory. | ||
- API calls for non-Pod resources (5 - 10) should be further analyzed as they are not likely to consider the Pod-based API calls, | ||
hence implementing those shouldn't block making (1 - 3) calls asynchronous. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any relevance order between calls for different pods?
I think there is because we need to protect the space freed up by preempted pods for the nominated pod. Sure, pods don't disapear immediately, but we shouldn't rely on it. That obviously puts in question whether we can priorities some updates over the others... which could change the order.
Alternatively we may still want to treat a preemption as one async operation that includes setting nomination and removing pods in the right sequence.
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md
Outdated
Show resolved
Hide resolved
- P0: Make the scheduling cycle free of blocking API calls, i.e., make all API calls asynchronous. | ||
- P0: Make the solution extendable for future use cases. | ||
- P1: Skip some types of updates if they soon become irrelevant by consecutive updates. | ||
- Nice to have: Prioritize high-importance updates (like binding) over low-importance ones if updates to the kube-apiserver get throttled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note before merging this KEP: things like P1 or the nice-to-have one, if we end up not addressing them in this KEP, we can just delete those as non-scope, and have to track them separately, probably just creating a issue at k/k
keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md
Outdated
Show resolved
Hide resolved
Both of the above API calls could be migrated to the new mechanism. | ||
|
||
In-tree plugins' operations that involve non-pod API calls during scheduling and could be made asynchronous, | ||
but not necessarily in the first place: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because when canceling pod binding, we probably want to cancel any pre-binding calls as well
When do we cancel pod binding?
- Allows the pod to be scheduled again even before the API call completes. | ||
- Simplifies introducing new API calls to the kube-scheduler if the collision handling logic is configured correctly. | ||
|
||
Cons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus, Memory consumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all approaches will somehow result in increased memory consumption. For me, it's even hard to say which option will affect it the most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that, more or less each option needs to consume memory, and it's hard to tell which option's memory increase is how much bigger, at least until we go into more details. But, at least, we need to mention it with a rough estimate, like:
- The maximum memory consumption of the queue would be the pending API calls (if we implement it very simply)
- The maximum memory consumption of the cache would just be the number of cached resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The maximum memory consumption of the cache would just be the number of cached resources.
Not really, as I brought up in another comment, we might need more information to merge the objects in cache (when some external change happens), e.g., a delta, similarly to the queue. And, even we might need to buffer some call(s) when previous one for the same pod(s) are in flight. So, I don't want to model it without more details.
- Needs a clear strategy for how to update the in-memory pod object during scheduling. | ||
|
||
|
||
#### 2.3: Send API calls through a kube-scheduler's cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on this side. It's generally close to what we're doing today: assume() mechanism in the scheduling cache, or the nominator mechanism.
Also, it's a lot easier to merge multiple updates into a single one (when they're for the same object). We always synchronously update the scheduler cache, and the cache asynchronously makes an API call under the hood. Even if there are two or more modifications made through the scheduler cache, the async worker can just make an API call with a final object state. And, the scheduler can work as if those modifications are immediately applied by referring to the scheduler cache (which, again, we're doing today.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH, what I don't like about the queue mechanism is, first, how to merge multiple updates into one.
The KEP describes, regarding patch, they can apply the latest one. Really? Always? What if several modifications are conflicting? Also, if several fields are updated, should we compute a new patch diff every time?
Regarding binding, it says "Ignore status update API calls". Can we do that always for sure? For example, what if users want to add some conditions to the pods before binding phase?
Something like that. So, I don't want to make a special rule to ignore/overwrite changes on the object, based on what we have at k/k.
Also, the second thing is what if users want to operate something at some point, and want to use it in another point? NominatedNodeName is a perfect example for this. The scheduler wants to put NNN so that the following scheduling cycles can refer to it. I know the scheduler, even today, has the nominator mechanism to cache it because the event handler doesn't give a NNN update immediately after making an API call. But, if we go with this cache option, we can generally solve that problem.
Rather the queueing option would boost this problem because there could be multiple modifications on the fly, which are invisible to anyone until they're actually applied.
I know I talked too many "what if". But, again though, given this mechanism is a standard general way to make API calls from the scheduler, we shouldn't put any assumption based on the in-tree plugins only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - thinking more about that, I think you're right. I think the main argument is that we want to update scheduler cache with that information anyway. So my argument from #5249 (comment) doesn't really work, because we need to touch the scheduler cache anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, caching approach looks like a cleaner approach. But there are also open questions around it:
Scheduler's cache is not uniform. Pending pods are stored in scheduling queue, bound pods are stored in cache, nominations in nominator, so we would need to unify it or the logic will be spread between multiple components in the scheduler.
We also have to find out how to merge the pod object in cache with updated pod object received in event handler. In queueing approach we could store the updates to the pod that we want to make and apply them on the newest object when making the API call. When using cache, we would also have to know in which order apply the updates, so again store the updates (besides the objects themselves) or make some magic.
Also, the second thing is what if users want to operate something at some point, and want to use it in another point? NominatedNodeName is a perfect example for this. The scheduler wants to put NNN so that the following scheduling cycles can refer to it. I know the scheduler, even today, has the nominator mechanism to cache it because the event handler doesn't give a NNN update immediately after making an API call. But, if we go with this cache option, we can generally solve that problem.
I agree the queuing is also not ideal - we won't see the newest Pod object immediately. But, is there a case where we need to see it, other than setting NNN? Currently, we also could operate on an older objects, even if the calls are synchronous.
The KEP describes, regarding patch, they can apply the latest one. Really? Always? What if several modifications are conflicting?
I have to clarify that part. Obviously, we can't just ignore the previous call. The simplest example is when one call was setting unschedulable status and second one was setting NNN. We should merge these two updates then. How? By keeping the latest entries (fields) to update (e.g., condition to apply and the latest NNN). If there are conflicts we should prefer to apply the latest update, right?
Another question is, if we should consider any (potential) updates to a Pod, even if scheduler is not supposed to make such calls?
Also, if several fields are updated, should we compute a new patch diff every time?
I believe we should compute the patch when making the API call, not earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler's cache is not uniform. Pending pods are stored in scheduling queue, bound pods are stored in cache, nominations in nominator, so we would need to unify it or the logic will be spread between multiple components in the scheduler.
Right, we basically should unify all the resources' cache into one.
But, I guess there'll still be some exceptions even after that; the nominator is the one because it's not a cache exactly, it's actually more like an index to pre-compute which nodes have which nominated pods.
We also have to find out how to merge the pod object in cache with updated pod object received in event handler.
As pointed out at #5249 (comment), "how to handle the conflicts between changes the scheduler wants to make (but awaiting the scheduler to actually make an API call) vs the changes coming from outside": this looks like a general problem to me, regardless of which options we take.
You said:
In queueing approach we could store the updates to the pod that we want to make and apply them on the newest object when making the API call.
but, I doubt that. Because that means the updates stored in the queue could overwrite the latest updates that the external component might make, if updates are to the same fields.
But, again, I agree, even in the scheduler cache way, "how to merge" is a hard problem, which I haven't come up with any good solution.
I agree the queuing is also not ideal - we won't see the newest Pod object immediately. But, is there a case where we need to see it, other than setting NNN?
I was imagining something outside k/k, like, for example, a quota management with CRD at the scheduler (ElasticQuota as an example).
However,
Currently, we also could operate on an older objects, even if the calls are synchronous.
Yes, you're right. As I mentioned, it happens even today, and we had to implement the solution with NNN. ElasticQuota as well, looks like they are computing with the internal cache.
So, I guess the point here would be more like an improvement from the current scheduler: we can generally solve such a problem, which we cannot with the queue idea.
I have to clarify that part. [...] By keeping the latest entries (fields) to update (e.g., condition to apply and the latest NNN).
Isn't that similar to the cache idea? To me, "keeping the latest entries (fields)" sounds like "keeping the latest object" like a cache.
Another question is, if we should consider any (potential) updates to a Pod, even if scheduler is not supposed to make such calls?
I guess the point is more general: should we consider any updates to any objects?
When initially discussing only about NNN several months ago, I was personally thinking that we could just focus on the current use case in the upstream default scheduler.
But, as reviewing many future enhancements lately, my feeling has been shifting: we've seen several KEP changes that might introduce a new API call from the scheduler. Also, even ourselves, we'll have to introduce some new resources to achieve the workload scheduling.
Based on that recent situation, I think we should consider this feature as a general mechanism that might be used by any kind of updates and any kind of resources, even including custom resources.
And, hence, we should try not to form this mechanism too specific to what kube-scheduler currently does.
That's the ideal goal I'm seeing. But, I'm not sure if the ideal is too far or within reach. So, that said, if that's too far to start with, we can start the design only based on what we're doing at k/k, considering only the current API calls that the scheduler makes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have other fields which can be modified by different components? For me it's one of the anti-patterns
Right, on the other hand, it's just that we cannot say there's absolutely no field of any kind of objects (incl custom resources) that might want/need to do that.
As I stated lastly, if we, for now, give up a true "general" solution, and only consider the things in the default kube-scheduler only, then I think it's fine to assume that the scheduler's updating fields aren't conflict with others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we still have to use the queue, to some extent.
I was thinking something like a certain number of workers working behind the cache: When the object in the cache is updated, we have to store the key to the object in the queue (so the queue's nature is different), and then the workers keep checking the queue and make API calls.
Okay, I see many similarities between cache proposal and queue proposal. I made some prototyping with the queue and the main difference with the cache will be the part how to send the change (explicitly through queue or implicitly through cache) and how to store the object (or the delta) for the API call. The further part (API call execution) would be similar. I'll make deeper analysis and add a section to the KEP next week to continue the more detailed discussion on both approaches.
I'm just not sure if that's 100% true and we can put the assumption that all scheduler plugins (incl custom ones) don't update the fields that external components might update, or even if some fields might be updated by external components, the scheduler's update should always win. Even if the scheduler's update was computed a few tens seconds ago.
I think without some assumptions we won't be able to provide any rational merging mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see many similarities between cache proposal and queue proposal. I made some prototyping with the queue and the main difference with the cache will be the part how to send the change (explicitly through queue or implicitly through cache) and how to store the object (or the delta) for the API call.
Right. And the other core point (technically included in your explanation though) is whether to use the objects (that might contain not-yet-applied changes) in the scheduling flow or not.
I think without some assumptions we won't be able to provide any rational merging mechanism.
If we cannot find a good merging mechanism, another option is just to treat such a conflict scenario as a failure case. Actually that might make sense, especially since "single field is managed by several components" is not generally a typical thing in the first place, like @dom4ha mentioned.
So, in summary, we have these options for this conflict issue:
- Only support API calls from the default scheduler. (i.e., we can assume no operation from the scheduler will be conflicting, except nnn?)
- Try to support any kind of API calls of any objects, considering there might be custom plugins that want to use those. And, at the conflict, we have two options:
- Somehow merge the changes. (how-to is unclear, need an investigation/idea)
- Regard it as a failure mode.
Regarding a failure mode though, regardless of how we proceed, we should discuss how to react when the async update fails. Should we keep trying to update until it's successful (if it's a retryable error)? should we somehow propagate the failure to the pending pod that triggered the API call and do something with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- no operation from the scheduler will be conflicting, except nnn?
And setting pod condition if we want to apply the newest.
- Somehow merge the changes
I believe that in this approach we wouldn't be able to merge all possible API calls, but we should allow the custom plugins to provide their own merging/conflicts resolution mechanism. Then, we support built-in scenarios natively, but leave a framework for the custom ones to extend. Ultimately, it depends on how likely it is to have a serious merging conflicts that we won't be able to resolve easily.
Regarding a failure mode though, actually, with any options, we should discuss how to react when the update fails.
Yes, that's another thing worth consideration (I even put it in the KEP "How to handle asynchronous API errors?" not to forget about it). Especially, some calls might prefer different error resolution mechanisms than another ones.
Should we keep trying to update until it's successful?
But, if the failure is because of conflicting API calls, we won't be able to retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should allow the custom plugins to provide their own merging/conflicts resolution mechanism.
This is a good abstraction idea.
Let's say it's called "merger"; we can have one merger per resource. We can implement default mergers for K8s resources that we're using at kube-scheduler, and we allow users to implement their own mergers for their resources, or if they need to update K8s default resources in a different manner, they can still implement custom mergers and disable default ones.
But, if the failure is because of conflicting API calls, we won't be able to retry.
Yes, and also we cannot keep trying it indefinitely even for retryable errors. We need some proper way to handling those.
- Requires the cache to handle and merge updates coming from both the kube-scheduler's internal actions and external API events. | ||
- The cache currently only stores bound pods, requiring integration with the scheduling queue for pending pods. | ||
- Complex logic is needed to handle external updates arriving while an internal update is pending or in progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the first and the third cons: They're mentioned as cons here. However, this is a general problem, not specific to this option since no option gracefully solves how to handle the conflicts between changes the scheduler wants to make (but awaiting the scheduler to actually make an API call) vs the changes coming from outside.
When such a conflict happens, which updates should be prioritized depends on which fields have to be updated on what purpose. That's a problem that I don't see an answer yet.
- Needs a clear strategy for how to update the in-memory pod object during scheduling. | ||
|
||
|
||
#### 2.3: Send API calls through a kube-scheduler's cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, actually this cache idea looks like also part of 1: Where and how to handle API calls in the kube-scheduler
as well?
Maybe it's not clear what to talk in 1 and what to talk in 2. Because, as we see, we're describing the queue thingy twice in both sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, probably the topic separation is not the best (2.2 and 2.3 are interconnected with 1.3). I'm. not sure how we could split it.
@sanposhiho @macsko @wojtek-t
|
This makes sense to me.
RBAC doesn't work on a per-field - but per resource (subresource). That said - it's a mechanism that can allow us to restrict the access to it to some extent.
This is already the case now - the other KEP is not making it worse really.
I don't see how Kueue is different than CA/Karpenter here. The exact same argument works for CA/Karpenter and potentially other out-of-tree components. This is exactly the argument we're making why this makes sense. |
I'm not sure if I understand your words correctly though, it looks like matching with my thought.
Yes. The scheduler's nomination should take precedence, and the external components cannot overwrite NNN except that's set by themselves, regardless of who set it. That's our plan. Please check out the latest KEP for the argument.
I don't agree generally.
Is it? NNN is not a subresource, right?
The nomination by the scheduler preemption may also be incorrect after a moment because the cluster state keeps changing. I agree that the probability of incorrectness is higher with NNN from the external components than NNN from the scheduler, and that's the reason we state the scheduler preemption can overwrite NNN that is set by others. Also, regarding NNN part, can you please move your suggestions to NNN KEP PR, instead of here. |
I did that already and there are more details there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need the PRR questionnaire at least the alpha requirements, PRR freeze is
Thursday 12th June 2025
|
||
Both of the above API calls could be migrated to the new mechanism. | ||
|
||
In-tree plugins' operations that involve non-pod API calls during scheduling and could be made asynchronous, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is a little hard to follow "... and could be made asynchronous but not necessarily in the first place", can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it a bit now
…s with three detailed proposals
I added three proposals to design details section to be able to select the best queueing/caching approach (all are pretty similar). PTAL I also already moved some previous proposals to Alternatives section. |