KEP-5229: Asynchronous API calls during scheduling #5249

macsko · 2025-04-16T15:17:26Z

One-line PR description: Add KEP-5229

Issue link: Asynchronous API calls during scheduling #5229

Other comments:

macsko · 2025-04-16T15:19:39Z

I have written some possible approaches to these API calls to start the discussion. I will not be visibly active for the next three weeks, but feel free to comment.

macsko · 2025-04-16T15:21:44Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- `nominatedNodeName` scenario support would require more effort in (1.1) or (1.2).
+
+
+#### 2.2: Make the API calls queued


I feel like queueing API calls might be a good direction long term and probably won't be that hard to implement.

sanposhiho

Great summary! I'll take time reading through it and put some comments after that.

keps/prod-readiness/sig-scheduling/5229.yaml

wojtek-t · 2025-04-25T13:26:32Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+This would require implementing advanced logic for queueing API calls in the kube-scheduler and migrating **all** pod-based API calls done during scheduling to this method,
+potentially including the binding API call. The new component should be able to resolve any conflicts in the incoming API calls as well as parallelize them properly,
+e.g., don't parallelize two updates of the same pod. This requires [making the API calls queued](#22-make-the-api-calls-queued) or
+[sending API calls through a kube-scheduler's cache](#23-send-api-calls-through-a-kube-schedulers-cache) to be implemented.


Conceptually what you need here is something pretty similar to DeltaFIFO we use in client-go:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go

We wouldn't operate on object, but on their updates, but that's conceptually exactly what you need:

tracking changes to a given object all together

a concept of deduping (if I already scheduled a pod but didn't yet sent the failures, don't send them; if I have multiple failures not yet sent to report, is the last one only valid?, etc.)

Sounds promising, but after a brief look it's not super clear to me how it works and whether it addresses all the problems.

The logic on the scheduler side may still be quite complex depending on which type of update is pending. For instance I suspect some of them may be blocking re-scheduling, but some not.

I think my comment could have been misleading.
It was supposed to be a meta-comment - i don't think we can really reuse any code from DeltaFIFO.
What I was trying to say is that it's effectively the same conceptual pattern that delta-fifo is using.

The logic on the scheduler side may still be quite complex depending on which type of update is pending. For instance I suspect some of them may be blocking re-scheduling, but some not.

That's interesting - in particular:

what are the examples where an API call should prevent further attempts to do something with a given pod?
[The only one I can see is preemption, but it's not even for a given pod, so it doesn't belong to this category]

is there any case when if I have queued two different API calls for a given pod P - we actually want to send both, not just the last one?

what are the examples where an API call should prevent further attempts to do something with a given pod?
[The only one I can see is preemption, but it's not even for a given pod, so it doesn't belong to this category]

is there any case when if I have queued two different API calls for a given pod P - we actually want to send both, not just the last one?

Don't know the answer out of my head, but it's not obvious to me yet that all updates can be skipped and we don't need to wait for updates to be persisted. That's why adding an analysis section and categorization of the API calls is needed.

One example comes to my mind is preemption of just-bound pod. It would be safer to let the binding call happen first before we could start preemption process, to avoid a situation we preempt something not bound yet.

One example comes to my mind is preemption of just-bound pod. It would be safer to let the binding call happen first before we could start preemption process, to avoid a situation we preempt something not bound yet.

That an interesting scenario - I just don't think that we should let the binding there. We probably need smarter deduping logic.
If we know that we will preempt the pod - there is no point in doing the binding. We need to think what is means from the "pre-binding" perspective (how to ensure we will not leak something), but probably directly moving the pod to a failed state seems better, than letting kubelet start it and bring it down immediately after.

One example comes to my mind is preemption of just-bound pod. It would be safer to let the binding call happen first before we could start preemption process, to avoid a situation we preempt something not bound yet.

I guess it actually happens today? Because we're running the binding cycle asynchronously.
Let's say pod-1 goes to the binding cycle (after "assume"), pod-2 triggers the preemption and then pod-2 could preempt pod-1 before pod-1's binding cycle completes.
And, what happens today (I guess) is, because pod-1 is deleted, the binding cycle for pod-1 would be just failing.

But, that is a good example showing the difficulty of this KEP: Possibly, not only API calls for the same object, but also API calls for different objects could be depending on one another...

what are the examples where an API call should prevent further attempts to do something with a given pod?

I haven't found any such API call. Even preemption should not be a problem.

Another thing to consider is how to update the Pod's status in the scheduler's memory. Now, since the API calls are blocking, we don't need to persist the status change before the call, because the event handlers will soon receive the Pod/Update event with the current status and update the Pod object there.

is there any case when if I have queued two different API calls for a given pod P - we actually want to send both, not just the last one?

We have 3 kinds of API calls for a Pod in scheduler: deletion (preemption), binding and status update (unschedulable or NNN). There is no such case where we would like to send two of these calls for a single Pod (see API calls categorization section).

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/kep.yaml

dom4ha · 2025-05-08T00:53:59Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+Making one universal approach of handling API calls in the kube-scheduler could allow these calls to be consistent, as well as better controlling
+the number of dispatched goroutines. Asynchronous preemption could also be migrated to this approach.
+
+### Goals


Looking from the high level, I think we may have the following goals (although it's not clear we're able to achieve all of them):

Make scheduling cycle free of blocking api-calls (any async option is fine)

Skip some type of updates if they soon become irrelevant by consecutive updates

Prioritize high importance updates (binding) over low importance ones if updates to the api-server gets throttled

Maybe we have more, but I think agreeing first on our goals is important before we can select a solution. One of my doubts here is that we may not know yet all the requirement we may have once we start designing workload-aware scheduling and reservations. More api calls may appear at that time.

+1 to it, but I think there is also a question about priorities. In my mental model:

eliminate blocking calls is P0

skip soon-to-be irrelevant updates is P1

prioritization of updates - is nice to have [I would like that but not necessary if that adds a lot of complexity]

I'm feeling like the third one is the next step, and we can just scope it out for now, at least from this KEP. Of course, we should keep it in mind when discussing this KEP though.
I agree that all of three are important, but it's just that we don't have to solve all of them at once on one KEP here; the third one looks complex enough to deserve being discussed on another KEP.

which seems to be roughly aligned with what I wrote above

Agree, my with the above, the order was also in purpose.

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

… the goals

k8s-ci-robot · 2025-05-20T08:53:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: macsko
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
~~keps/sig-scheduling/OWNERS~~ [macsko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

macsko · 2025-05-20T08:58:21Z

I added a section with API calls categorization and adjusted the goals. PTAL

To go further, we could probably agree on some high-level concept (queuing without blocking a pod?).

wojtek-t · 2025-05-28T07:50:13Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+  if the newest status is stored in-memory.
+- API calls for non-Pod resources (5 - 10) should be further analyzed as they are not likely to consider the Pod-based API calls,
+  hence implementing those shouldn't block making (1 - 3) calls asynchronous.
+


Is there any relevance order between calls for different pods?
I think there isn't (i.e. we don't require NNN to be set before preempting the first victim).

But even if there isn't - it would make sense to call it out explicitly here.

Is there any relevance order between calls for different pods?

I think there is because we need to protect the space freed up by preempted pods for the nominated pod. Sure, pods don't disapear immediately, but we shouldn't rely on it. That obviously puts in question whether we can priorities some updates over the others... which could change the order.

Alternatively we may still want to treat a preemption as one async operation that includes setting nomination and removing pods in the right sequence.

because we need to protect the space freed up by preempted pods for the nominated pod.

But, we store NNN to the nominator when we update NNN. i.e., even if we make the API call async, as long as we report NNN to the nominator synchronously, the space should be reserved, and not stealed by other pods' scheduling cycles.
So, I believe we can make this NNN API call in the preemption asynchronous, simply.

Or, if we go with the scheduler cache idea, we don't have to be worried at all about that kind of scenario. Any updates are reflected to the scheduler cache synchronously, and things that happen later can immediately refer to the update on the scheduler cache (regardless of whether API call is asynchronously done or not

Unless scheduler restarts in the meantime and we lose the in-memory state.

Alternatively we may still want to treat a preemption as one async operation that includes setting nomination and removing pods in the right sequence.

which effectively introduces the ordering
The question is whether in-memory state is good enough, at least initially.

I referred only to the ordering of api calls, not cache updates, which in turn can become fully synchronous again (we achieve the non-blocking goal by async api calls only).

Is there any relevance order between calls for different pods?

I'd say there isn't. Any two API calls, but for different pods made by scheduler are independent and I don't see any counter-example.

I believe we can make this NNN API call in the preemption asynchronous, simply.

That's right.

The question is whether in-memory state is good enough, at least initially.

WDYM by "initially" here?

wojtek-t · 2025-05-28T07:52:40Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- Updating Pod status (1, 2) could be less important and called if there is space for it.
+  It's worth considering if setting `nominatedNodeName` (3) should have the same priority or higher,
+  because the higher delay might affect other components like Cluster Autoscaler.
+- API calls for non-Pod resources (5 - 10) could be analyzed case by case, but are likely less important than (5) and (4).


nit: 6-10

Also - I'm not sure I agree with it. As an example - volume binding is on a critical path for pod binding - so delaying volume binding delays pod binding - which affects overall pod startup. Which is probably what we should actually optimize for.

+1. I actually rather think API calls for non-Pod resources are likely equally important to 4 and 5 (at least for the current default scheduler) because, like described in 6-10 above, most of them are in the binding cycles, which are on the critical path, or PostFilter (similar to preemption - making space for pending pods).

But, the difficulty here is that we cannot just say "the API calls for non pods are likely important -> let's prioritize them" because obviously there'll be exceptions especially when it comes to the scheduler with custom plugins.
So, I basically agree with the KEP saying that this is case by case actually.

I guess it leads to another question: should we consider API calls being made within binding cycles? They're already async, so can we just treat them as a single async group for multiple API calls? And, in this KEP, we only need to add a canceling functionality in the binding cycles.

I guess it leads to another question: should we consider API calls being made within binding cycles?

Maybe I misunderstood your intention here, but if we don't consider them, we will face the races due to these not being coordinated with others. If the pod was earlier unschedulable, but the call was queued for too long, we need to cancel that once binding starts. So we need coordination between these. From that perspective, having a single place to handle everything sounds better.

I guess it leads to another question: should we consider API calls being made within binding cycles?

I think we need to consider pod-based API calls (status update, binding, preemption) to be able to effectively cancel/skip them. Another aspect is the 6 - 10 calls, which I think don't need be considered, at least initially. The only concern might be the DRA's PostFilter, which can make blocking API call (but probably we would need then to make async API calls for all ResourceClaim calls because of similar reasons to Pod ones).

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

wojtek-t · 2025-05-28T07:59:41Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- Simplifies introducing new API calls to the kube-scheduler if the collision handling logic is configured correctly.
+
+Cons:
+- Requires implementing complex, advanced queueing logic.


I would say it's debatable how complex that really is - my subjective claim is that it's actually not that complex.

Really depends on what we're doing with this queueing logic. If it's just going to be the queue, obviously simple.
But, like mentioned at 2.2, if we are going to merge multiple operations into one, things will be getting complicated.

wojtek-t · 2025-05-28T08:01:59Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- Cannot be used for the `nominatedNodeName` scenario, requiring additional effort and separate handling.
+
+
+#### 1.3: Use advanced queue and don't block the pod from being scheduled in the meantime


I'm personally strongly for this option - because it allows to handle all calls the same way and optimizes for latency.
I think the additional paid upfront complexity (that btw will be well encapsulated to a single place) it a cost that we should be willing to pay for the fact that it becomes unified and simple to use.

I also agree that this should be the way to go. If we reach the consensus here, I'd move other options to the Alternatives considered and present one proposal.

wojtek-t · 2025-05-28T08:06:08Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- `nominatedNodeName` scenario support would require more effort in (1.1) or (1.2).
+
+
+#### 2.2: Make the API calls queued


I believe that this is should be path forward - the primary reason against 2.3 is that it allows to encapsulate the logic into a single place and the burden is not spread across different (arguable much harder to reason about) places. And allows for avoiding races and optimizations (compared to 2.1)

On the other hand how to handle external changes? The obvious case is disappearance of a pod that was supposed to be preempted. It's probably quite easy case, as we can skip our update, but what if nominated node name changes once we allow external actors to change it?

Things may change behind the scene - so there might be races. But for a particular object relying on optimistic concurrency should just work.

If you're worries about scheduler preempting something and external actor changing NNN in the meantime - that has an inherent race in it no matter what we do - we may try to minimize the window, but it will always be there and I'm not sure I would actually try to optimize for it.

Update - after thinking about it I changed my mind and we may prefer 2.3 actually: #5249 (comment)

dom4ha · 2025-05-28T00:57:00Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+Both of the above API calls could be migrated to the new mechanism.
+
+In-tree plugins' operations that involve non-pod API calls during scheduling and could be made asynchronous,
+but not necessarily in the first place:


I think we may want to consider them from the very beginning, because when canceling pod binding, we probably want to cancel any pre-binding calls as well.

Moreover, we need to consider changes planned in the Extended Resource KEP, so it will be additional challenge how to put them together.

because when canceling pod binding, we probably want to cancel any pre-binding calls as well

When do we cancel pod binding?

Never mind, I now can guess you meant ↓ described later in this kep.

Pod deletion caused by preemption (4) should cancel all Pod-based API calls for such a Pod.

I think we may want to consider them from the very beginning, because when canceling pod binding, we probably want to cancel any pre-binding calls as well.

Yes, that could be an optimization, but not necessary.

Btw, now when the pod Y is in the PreBind stage and pod X wants to preempt pod Y, pod Y delete API call is sent to the apiserver. If pod Y would be on WaitOnPermit, preemption just cancel it, without sending the API call. Shouldn't we optimize similarly for the PreBind, given that the number of use cases for PreBind (DRA) is growing? Do we have any reason against it?

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

dom4ha · 2025-05-28T14:52:46Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+  if the newest status is stored in-memory.
+- API calls for non-Pod resources (5 - 10) should be further analyzed as they are not likely to consider the Pod-based API calls,
+  hence implementing those shouldn't block making (1 - 3) calls asynchronous.
+


Is there any relevance order between calls for different pods?

I think there is because we need to protect the space freed up by preempted pods for the nominated pod. Sure, pods don't disapear immediately, but we shouldn't rely on it. That obviously puts in question whether we can priorities some updates over the others... which could change the order.

Alternatively we may still want to treat a preemption as one async operation that includes setting nomination and removing pods in the right sequence.

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

sanposhiho · 2025-05-28T16:11:44Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- P0: Make the scheduling cycle free of blocking API calls, i.e., make all API calls asynchronous.
+- P0: Make the solution extendable for future use cases.
+- P1: Skip some types of updates if they soon become irrelevant by consecutive updates.
+- Nice to have: Prioritize high-importance updates (like binding) over low-importance ones if updates to the kube-apiserver get throttled.


Note before merging this KEP: things like P1 or the nice-to-have one, if we end up not addressing them in this KEP, we can just delete those as non-scope, and have to track them separately, probably just creating a issue at k/k

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

sanposhiho · 2025-05-28T16:19:20Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+Both of the above API calls could be migrated to the new mechanism.
+
+In-tree plugins' operations that involve non-pod API calls during scheduling and could be made asynchronous,
+but not necessarily in the first place:


because when canceling pod binding, we probably want to cancel any pre-binding calls as well

When do we cancel pod binding?

sanposhiho · 2025-05-28T16:45:23Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- Allows the pod to be scheduled again even before the API call completes.
+- Simplifies introducing new API calls to the kube-scheduler if the collision handling logic is configured correctly.
+
+Cons:


Plus, Memory consumption?

I think all approaches will somehow result in increased memory consumption. For me, it's even hard to say which option will affect it the most.

I agree that, more or less each option needs to consume memory, and it's hard to tell which option's memory increase is how much bigger, at least until we go into more details. But, at least, we need to mention it with a rough estimate, like:

The maximum memory consumption of the queue would be the pending API calls (if we implement it very simply)

The maximum memory consumption of the cache would just be the number of cached resources.

The maximum memory consumption of the cache would just be the number of cached resources.

Not really, as I brought up in another comment, we might need more information to merge the objects in cache (when some external change happens), e.g., a delta, similarly to the queue. And, even we might need to buffer some call(s) when previous one for the same pod(s) are in flight. So, I don't want to model it without more details.

sanposhiho · 2025-05-28T17:04:15Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- Needs a clear strategy for how to update the in-memory pod object during scheduling.
+
+
+#### 2.3: Send API calls through a kube-scheduler's cache


I'm on this side. It's generally close to what we're doing today: assume() mechanism in the scheduling cache, or the nominator mechanism.
Also, it's a lot easier to merge multiple updates into a single one (when they're for the same object). We always synchronously update the scheduler cache, and the cache asynchronously makes an API call under the hood. Even if there are two or more modifications made through the scheduler cache, the async worker can just make an API call with a final object state. And, the scheduler can work as if those modifications are immediately applied by referring to the scheduler cache (which, again, we're doing today.)

OTOH, what I don't like about the queue mechanism is, first, how to merge multiple updates into one.
The KEP describes, regarding patch, they can apply the latest one. Really? Always? What if several modifications are conflicting? Also, if several fields are updated, should we compute a new patch diff every time?
Regarding binding, it says "Ignore status update API calls". Can we do that always for sure? For example, what if users want to add some conditions to the pods before binding phase?

Something like that. So, I don't want to make a special rule to ignore/overwrite changes on the object, based on what we have at k/k.

Also, the second thing is what if users want to operate something at some point, and want to use it in another point? NominatedNodeName is a perfect example for this. The scheduler wants to put NNN so that the following scheduling cycles can refer to it. I know the scheduler, even today, has the nominator mechanism to cache it because the event handler doesn't give a NNN update immediately after making an API call. But, if we go with this cache option, we can generally solve that problem.
Rather the queueing option would boost this problem because there could be multiple modifications on the fly, which are invisible to anyone until they're actually applied.

I know I talked too many "what if". But, again though, given this mechanism is a standard general way to make API calls from the scheduler, we shouldn't put any assumption based on the in-tree plugins only.

OK - thinking more about that, I think you're right. I think the main argument is that we want to update scheduler cache with that information anyway. So my argument from #5249 (comment) doesn't really work, because we need to touch the scheduler cache anyway.

Yes, caching approach looks like a cleaner approach. But there are also open questions around it:

Scheduler's cache is not uniform. Pending pods are stored in scheduling queue, bound pods are stored in cache, nominations in nominator, so we would need to unify it or the logic will be spread between multiple components in the scheduler.

We also have to find out how to merge the pod object in cache with updated pod object received in event handler. In queueing approach we could store the updates to the pod that we want to make and apply them on the newest object when making the API call. When using cache, we would also have to know in which order apply the updates, so again store the updates (besides the objects themselves) or make some magic.

Also, the second thing is what if users want to operate something at some point, and want to use it in another point? NominatedNodeName is a perfect example for this. The scheduler wants to put NNN so that the following scheduling cycles can refer to it. I know the scheduler, even today, has the nominator mechanism to cache it because the event handler doesn't give a NNN update immediately after making an API call. But, if we go with this cache option, we can generally solve that problem.

I agree the queuing is also not ideal - we won't see the newest Pod object immediately. But, is there a case where we need to see it, other than setting NNN? Currently, we also could operate on an older objects, even if the calls are synchronous.

The KEP describes, regarding patch, they can apply the latest one. Really? Always? What if several modifications are conflicting?

I have to clarify that part. Obviously, we can't just ignore the previous call. The simplest example is when one call was setting unschedulable status and second one was setting NNN. We should merge these two updates then. How? By keeping the latest entries (fields) to update (e.g., condition to apply and the latest NNN). If there are conflicts we should prefer to apply the latest update, right?

Another question is, if we should consider any (potential) updates to a Pod, even if scheduler is not supposed to make such calls?

Also, if several fields are updated, should we compute a new patch diff every time?

I believe we should compute the patch when making the API call, not earlier.

Scheduler's cache is not uniform. Pending pods are stored in scheduling queue, bound pods are stored in cache, nominations in nominator, so we would need to unify it or the logic will be spread between multiple components in the scheduler.

Right, we basically should unify all the resources' cache into one.
But, I guess there'll still be some exceptions even after that; the nominator is the one because it's not a cache exactly, it's actually more like an index to pre-compute which nodes have which nominated pods.

We also have to find out how to merge the pod object in cache with updated pod object received in event handler.

As pointed out at #5249 (comment), "how to handle the conflicts between changes the scheduler wants to make (but awaiting the scheduler to actually make an API call) vs the changes coming from outside": this looks like a general problem to me, regardless of which options we take.

You said:

In queueing approach we could store the updates to the pod that we want to make and apply them on the newest object when making the API call.

but, I doubt that. Because that means the updates stored in the queue could overwrite the latest updates that the external component might make, if updates are to the same fields.

But, again, I agree, even in the scheduler cache way, "how to merge" is a hard problem, which I haven't come up with any good solution.

I agree the queuing is also not ideal - we won't see the newest Pod object immediately. But, is there a case where we need to see it, other than setting NNN?

I was imagining something outside k/k, like, for example, a quota management with CRD at the scheduler (ElasticQuota as an example).

However,

Currently, we also could operate on an older objects, even if the calls are synchronous.

Yes, you're right. As I mentioned, it happens even today, and we had to implement the solution with NNN. ElasticQuota as well, looks like they are computing with the internal cache.
So, I guess the point here would be more like an improvement from the current scheduler: we can generally solve such a problem, which we cannot with the queue idea.

I have to clarify that part. [...] By keeping the latest entries (fields) to update (e.g., condition to apply and the latest NNN).

Isn't that similar to the cache idea? To me, "keeping the latest entries (fields)" sounds like "keeping the latest object" like a cache.

Another question is, if we should consider any (potential) updates to a Pod, even if scheduler is not supposed to make such calls?

I guess the point is more general: should we consider any updates to any objects?
When initially discussing only about NNN several months ago, I was personally thinking that we could just focus on the current use case in the upstream default scheduler.
But, as reviewing many future enhancements lately, my feeling has been shifting: we've seen several KEP changes that might introduce a new API call from the scheduler. Also, even ourselves, we'll have to introduce some new resources to achieve the workload scheduling.
Based on that recent situation, I think we should consider this feature as a general mechanism that might be used by any kind of updates and any kind of resources, even including custom resources.
And, hence, we should try not to form this mechanism too specific to what kube-scheduler currently does.
That's the ideal goal I'm seeing. But, I'm not sure if the ideal is too far or within reach. So, that said, if that's too far to start with, we can start the design only based on what we're doing at k/k, considering only the current API calls that the scheduler makes.

do we have other fields which can be modified by different components? For me it's one of the anti-patterns

Right, on the other hand, it's just that we cannot say there's absolutely no field of any kind of objects (incl custom resources) that might want/need to do that.
As I stated lastly, if we, for now, give up a true "general" solution, and only consider the things in the default kube-scheduler only, then I think it's fine to assume that the scheduler's updating fields aren't conflict with others.

Yes, we still have to use the queue, to some extent.
I was thinking something like a certain number of workers working behind the cache: When the object in the cache is updated, we have to store the key to the object in the queue (so the queue's nature is different), and then the workers keep checking the queue and make API calls.

Okay, I see many similarities between cache proposal and queue proposal. I made some prototyping with the queue and the main difference with the cache will be the part how to send the change (explicitly through queue or implicitly through cache) and how to store the object (or the delta) for the API call. The further part (API call execution) would be similar. I'll make deeper analysis and add a section to the KEP next week to continue the more detailed discussion on both approaches.

I'm just not sure if that's 100% true and we can put the assumption that all scheduler plugins (incl custom ones) don't update the fields that external components might update, or even if some fields might be updated by external components, the scheduler's update should always win. Even if the scheduler's update was computed a few tens seconds ago.

I think without some assumptions we won't be able to provide any rational merging mechanism.

Okay, I see many similarities between cache proposal and queue proposal. I made some prototyping with the queue and the main difference with the cache will be the part how to send the change (explicitly through queue or implicitly through cache) and how to store the object (or the delta) for the API call.

Right. And the other core point (technically included in your explanation though) is whether to use the objects (that might contain not-yet-applied changes) in the scheduling flow or not.

I think without some assumptions we won't be able to provide any rational merging mechanism.

If we cannot find a good merging mechanism, another option is just to treat such a conflict scenario as a failure case. Actually that might make sense, especially since "single field is managed by several components" is not generally a typical thing in the first place, like @dom4ha mentioned.

So, in summary, we have these options for this conflict issue:

Only support API calls from the default scheduler. (i.e., we can assume no operation from the scheduler will be conflicting, except nnn?)

Try to support any kind of API calls of any objects, considering there might be custom plugins that want to use those. And, at the conflict, we have two options:

Somehow merge the changes. (how-to is unclear, need an investigation/idea)

Regard it as a failure mode.

Regarding a failure mode though, regardless of how we proceed, we should discuss how to react when the async update fails. Should we keep trying to update until it's successful (if it's a retryable error)? should we somehow propagate the failure to the pending pod that triggered the API call and do something with it?

no operation from the scheduler will be conflicting, except nnn?

And setting pod condition if we want to apply the newest.

Somehow merge the changes

I believe that in this approach we wouldn't be able to merge all possible API calls, but we should allow the custom plugins to provide their own merging/conflicts resolution mechanism. Then, we support built-in scenarios natively, but leave a framework for the custom ones to extend. Ultimately, it depends on how likely it is to have a serious merging conflicts that we won't be able to resolve easily.

Regarding a failure mode though, actually, with any options, we should discuss how to react when the update fails.

Yes, that's another thing worth consideration (I even put it in the KEP "How to handle asynchronous API errors?" not to forget about it). Especially, some calls might prefer different error resolution mechanisms than another ones.

Should we keep trying to update until it's successful?

But, if the failure is because of conflicting API calls, we won't be able to retry.

we should allow the custom plugins to provide their own merging/conflicts resolution mechanism.

This is a good abstraction idea.
Let's say it's called "merger"; we can have one merger per resource. We can implement default mergers for K8s resources that we're using at kube-scheduler, and we allow users to implement their own mergers for their resources, or if they need to update K8s default resources in a different manner, they can still implement custom mergers and disable default ones.

But, if the failure is because of conflicting API calls, we won't be able to retry.

Yes, and also we cannot keep trying it indefinitely even for retryable errors. We need some proper way to handling those.

sanposhiho · 2025-05-28T17:11:32Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- Requires the cache to handle and merge updates coming from both the kube-scheduler's internal actions and external API events.
+- The cache currently only stores bound pods, requiring integration with the scheduling queue for pending pods.
+- Complex logic is needed to handle external updates arriving while an internal update is pending or in progress.


Regarding the first and the third cons: They're mentioned as cons here. However, this is a general problem, not specific to this option since no option gracefully solves how to handle the conflicts between changes the scheduler wants to make (but awaiting the scheduler to actually make an API call) vs the changes coming from outside.

When such a conflict happens, which updates should be prioritized depends on which fields have to be updated on what purpose. That's a problem that I don't see an answer yet.

sanposhiho · 2025-05-28T17:34:52Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+- Needs a clear strategy for how to update the in-memory pod object during scheduling.
+
+
+#### 2.3: Send API calls through a kube-scheduler's cache


By the way, actually this cache idea looks like also part of 1: Where and how to handle API calls in the kube-scheduler as well?
Maybe it's not clear what to talk in 1 and what to talk in 2. Because, as we see, we're describing the queue thingy twice in both sections.

Right, probably the topic separation is not the best (2.2 and 2.3 are interconnected with 1.3). I'm. not sure how we could split it.

dom4ha · 2025-05-30T09:35:13Z

@sanposhiho @macsko @wojtek-t
I was thinking about this problem and NNN most of the day and I have a few thoughts.

First of all we do resource reservation in cache in Reserve phase by calling assume and DRA in-memory allocation. We do resource reservation in api-server in-pre-binding as ReserveFor and we plan to set NNN. Shouldn't we do both in Reserve as async operation right after in-memory reservation instead of doing that in pre-bind? (maybe we don't have to do it today, but in the future)
Doing that in Reserve has advantage to be more aligned with internal reservations. It will in fact set it for pods waiting on Permit, so would help in gang-scheduling, but make it less stable, since a placement may change when Permit is denied.
When we put NNN asynchronously, how do we react to external nominations? Ignore them if we have our own nomination, but consume if we haven't had one? I think we should use a different mechanism for nomination hints and keep NominatedNodeName exclusively set by scheduler for a few reasons:

Setting Pod.status.nominatedNodeName is already restricted to scheduler only by RBAC rules and I don't see a reason to change that
We should clearly distinguish hints from nominations because hints may be incorrect. If we allow other components to treat them the same to the scheduler nominations, we imply that wrong decisions are consumed by the system
Nomination hints could be used by Kueue to hint where the pod should be place in addition to putting strong node affinity rules to prevent other placement. Potentially incorrect hint will have the same positive effect on scheduler (make it pick the hint at the first place) but won't let wrong decisions propagate.

wojtek-t · 2025-05-30T10:11:50Z

First of all we do resource reservation in cache in Reserve phase by calling assume and DRA in-memory allocation. We do resource reservation in api-server in-pre-binding as ReserveFor and we plan to set NNN. Shouldn't we do both in Reserve as async operation right after in-memory reservation instead of doing that in pre-bind? (maybe we don't have to do it today, but in the future)

This makes sense to me.

Setting Pod.status.nominatedNodeName is already restricted to scheduler only by RBAC rules and I don't see a reason to change that

RBAC doesn't work on a per-field - but per resource (subresource).
So if a component can update PodStatus - it can also set the NNN.

That said - it's a mechanism that can allow us to restrict the access to it to some extent.

We should clearly distinguish hints from nominations because hints may be incorrect. If we allow other components to treat them the same to the scheduler nominations, we imply that wrong decisions are consumed by the system

This is already the case now - the other KEP is not making it worse really.

Nomination hints could be used by Kueue to hint where the pod should be place in addition to putting strong node affinity rules to prevent other placement. Potentially incorrect hint will have the same positive effect on scheduler (make it pick the hint at the first place) but won't let wrong decisions propagate.

I don't see how Kueue is different than CA/Karpenter here. The exact same argument works for CA/Karpenter and potentially other out-of-tree components. This is exactly the argument we're making why this makes sense.

sanposhiho · 2025-05-30T10:23:31Z

First of all we do resource reservation in cache in Reserve phase by calling assume and DRA in-memory allocation. We do resource reservation in api-server in-pre-binding as ReserveFor and we plan to set NNN. Shouldn't we do both in Reserve as async operation right after in-memory reservation instead of doing that in pre-bind? (maybe we don't have to do it today, but in the future)

I'm not sure if I understand your words correctly though, it looks like matching with my thought.
As long as the in-memory reservation is correctly performed, the following API calls can be async. Or, as I talked at #5249 (comment), if we use the cache option, "update in-memory reservation (in the cache)" could be equal to "send API calls asynchronously".

When we put NNN asynchronously, how do we react to external nominations? Ignore them if we have our own nomination, but consume if we haven't had one?

Yes. The scheduler's nomination should take precedence, and the external components cannot overwrite NNN except that's set by themselves, regardless of who set it. That's our plan. Please check out the latest KEP for the argument.

I think we should use a different mechanism for nomination hints and keep NominatedNodeName exclusively set by scheduler for a few reasons:

I don't agree generally.
You can see the discussion here, but in very short, I don't think we have to distinguish NNN by "who set this", and hence we don't need a different field for external component's hint.

Setting Pod.status.nominatedNodeName is already restricted to scheduler only by RBAC rules and I don't see a reason to change that

Is it? NNN is not a subresource, right?

We should clearly distinguish hints from nominations because hints may be incorrect. If we allow other components to treat them the same to the scheduler nominations, we imply that wrong decisions are consumed by the system

The nomination by the scheduler preemption may also be incorrect after a moment because the cluster state keeps changing. I agree that the probability of incorrectness is higher with NNN from the external components than NNN from the scheduler, and that's the reason we state the scheduler preemption can overwrite NNN that is set by others.

Also, regarding NNN part, can you please move your suggestions to NNN KEP PR, instead of here.

dom4ha · 2025-05-30T10:29:21Z

Also, regarding NNN part, can you please move your suggestions to NNN KEP PR, instead of here.

I did that already and there are more details there.

keps/prod-readiness/sig-scheduling/5229.yaml

BenTheElder

This will need the PRR questionnaire at least the alpha requirements, PRR freeze is
Thursday 12th June 2025

https://www.kubernetes.dev/resources/release/

BenTheElder · 2025-06-02T19:56:37Z

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md

+
+Both of the above API calls could be migrated to the new mechanism.
+
+In-tree plugins' operations that involve non-pod API calls during scheduling and could be made asynchronous,


This sentence is a little hard to follow "... and could be made asynchronous but not necessarily in the first place", can you clarify?

I changed it a bit now

…s with three detailed proposals

macsko · 2025-06-06T12:59:44Z

I added three proposals to design details section to be able to select the best queueing/caching approach (all are pretty similar). PTAL

I also already moved some previous proposals to Alternatives section.

KEP-5229: Asynchronous API calls during scheduling

9272ef0

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 16, 2025

k8s-ci-robot requested review from alculquicondor and jeremyrickard April 16, 2025 15:17

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Apr 16, 2025

github-project-automation bot added this to SIG Scheduling Apr 16, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Apr 16, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 16, 2025

k8s-ci-robot requested review from dom4ha and sanposhiho April 16, 2025 15:19

macsko commented Apr 16, 2025

View reviewed changes

sanposhiho reviewed Apr 16, 2025

View reviewed changes

wojtek-t reviewed Apr 25, 2025

View reviewed changes

wojtek-t self-assigned this Apr 25, 2025

alculquicondor reviewed Apr 25, 2025

View reviewed changes

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/kep.yaml Outdated Show resolved Hide resolved

dom4ha mentioned this pull request May 7, 2025

kep-5278: nominated node name for an expected pod placement #5287

Open

dom4ha reviewed May 8, 2025

View reviewed changes

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md Outdated Show resolved Hide resolved

dom4ha mentioned this pull request May 14, 2025

KEP-5055: DRA: admin-controlled attributes and device taints in 1.33 #5277

Merged

Change KEP approvers, add API calls categorization section and adjust…

3ba17f2

… the goals

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 20, 2025

Extend API calls categorization section

9d6365d

sanposhiho mentioned this pull request May 22, 2025

Implicit tolerations #5282

Open

4 tasks

wojtek-t mentioned this pull request May 28, 2025

Add KEP for DRA: Extended Resource #5136

Open

wojtek-t reviewed May 28, 2025

View reviewed changes

dom4ha reviewed May 28, 2025

View reviewed changes

keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md Outdated Show resolved Hide resolved

sanposhiho reviewed May 28, 2025

View reviewed changes

Apply comments

3d4e0a3

BenTheElder reviewed Jun 2, 2025

View reviewed changes

keps/prod-readiness/sig-scheduling/5229.yaml Show resolved Hide resolved

BenTheElder reviewed Jun 2, 2025

View reviewed changes

jpbetz mentioned this pull request Jun 5, 2025

Asynchronous API calls during scheduling #5229

Open

4 tasks

macsko added 2 commits June 6, 2025 12:54

Move not selected proposals to Alternative section. Add design detail…

aabde29

…s with three detailed proposals

Update toc

759894b

		- `nominatedNodeName` scenario support would require more effort in (1.1) or (1.2).


		#### 2.2: Make the API calls queued

		- Cannot be used for the `nominatedNodeName` scenario, requiring additional effort and separate handling.


		#### 1.3: Use advanced queue and don't block the pod from being scheduled in the meantime

		- Needs a clear strategy for how to update the in-memory pod object during scheduling.


		#### 2.3: Send API calls through a kube-scheduler's cache


		Both of the above API calls could be migrated to the new mechanism.

		In-tree plugins' operations that involve non-pod API calls during scheduling and could be made asynchronous,

KEP-5229: Asynchronous API calls during scheduling #5249

Are you sure you want to change the base?

KEP-5229: Asynchronous API calls during scheduling #5249

Conversation

macsko commented Apr 16, 2025

Uh oh!

macsko commented Apr 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macsko May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k8s-ci-robot commented May 20, 2025

Uh oh!

macsko commented May 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macsko May 20, 2025 •

edited

Loading

sanposhiho May 28, 2025 •

edited

Loading

sanposhiho May 28, 2025 •

edited

Loading

sanposhiho May 28, 2025 •

edited

Loading

sanposhiho May 28, 2025 •

edited

Loading