Fix/optimize Resolve the performance overhead of calling large clusters#54
Conversation
|
Welcome @AutuSnow! |
a5dd6bd to
f99b7a4
Compare
|
/assign @ffromani |
ffromani
left a comment
There was a problem hiding this comment.
thanks for the fix, this was indeed one of our TODOs.
Please verify this change is covered by tests, and feel free to add more coverage if not.
I wonder, in general, if and how we can assess the performance gains/changes of PRs.
| configs map[types.UID]PodCPUAssignments | ||
| mu sync.RWMutex | ||
| configs map[types.UID]PodCPUAssignments | ||
| sharedCPUContainers map[types.UID]struct{} |
There was a problem hiding this comment.
this is used as set, so it should be used a set (k8s.io/apimachinery/pkg/util/sets)
There was a problem hiding this comment.
I have completed the modifications and added benchmark tests
f99b7a4 to
a283610
Compare
|
@ffromani Can I take another look |
ffromani
left a comment
There was a problem hiding this comment.
thanks for the PR and sorry it took me so long to review again. Except micro-benchmarks, I'm not sure about the actual gain when operating the plugins, but the code becomes IMO a bit nicer, so we should move forward and merge. Please rebase.
I'm inclined into accept into 0.1.0, pending final review
| s.allocatedCPUs = cpuset.New() | ||
| for _, cpus := range s.resourceClaimAllocations { | ||
| s.allocatedCPUs = s.allocatedCPUs.Union(cpus) | ||
| } |
There was a problem hiding this comment.
perhaps use s.allocatedCPUs = s.allocatedCPUs.Difference(cpus) like you do below in GetSharedCPUs?
| @@ -61,10 +64,23 @@ func (s *PodConfigStore) SetContainerState(podUID types.UID, state *ContainerSta | |||
| s.mu.Lock() | |||
| defer s.mu.Unlock() | |||
|
|
|||
| // Check if there's an existing state for this container that we need to clean up from cache. | |||
There was a problem hiding this comment.
why can't we just override? what's the risk here?
There was a problem hiding this comment.
why can't we just override? what's the risk here?
When a container restarts, SetContainerState is called again with the same containerName but a new containerUID. If we just override without cleanup, the old containerUID would remain as a stale entry in the sharedCPUContainers set, causing GetContainersWithSharedCPUs() to return UIDs for containers that no longer exist. The cleanup ensures we remove the old UID before inserting the new one.
There was a problem hiding this comment.
thanks, fully makes sense. Please make sure we have a unit test which demonstrates this.
There was a problem hiding this comment.
thanks, fully makes sense. Please make sure we have a unit test which demonstrates this.
I will do it this way
a283610 to
ca351f7
Compare
|
|
ca351f7 to
01ae7a9
Compare
ffromani
left a comment
There was a problem hiding this comment.
please rebase when you resubmit
| @@ -61,10 +64,23 @@ func (s *PodConfigStore) SetContainerState(podUID types.UID, state *ContainerSta | |||
| s.mu.Lock() | |||
| defer s.mu.Unlock() | |||
|
|
|||
| // Check if there's an existing state for this container that we need to clean up from cache. | |||
There was a problem hiding this comment.
thanks, fully makes sense. Please make sure we have a unit test which demonstrates this.
01ae7a9 to
2655227
Compare
| // TODO: CoreID is only unique within a socket. On multi-socket systems where CoreIDs repeat | ||
| // across sockets, this method may return CPUs from multiple sockets for the same CoreID. | ||
| // Consider refactoring core-level methods to use CoreKey{SocketID, CoreID} for correct | ||
| // disambiguation, similar to CPUTopology.CPUsByCore. |
There was a problem hiding this comment.
correct, but this in inherited from cpumanager code. We need to decide if and how drift from cpumanager. It's a post-0.1.0 decision
ffromani
left a comment
There was a problem hiding this comment.
/approve
/lgtm
this is a TODO we had. The fix is nice, contained and was in review since long time. I'm doing an exception and merging before 0.1.0.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AutuSnow, ffromani The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Resolve the performance overhead of calling large clusters
Note: RemoveResourceClaimAllocate() requires rebuilding the cache during deletion, with a complexity of O (n). This is because cpuset does not support efficient differencing operations (one claim may have allocated CPUs that overlap with other claims). However, due to relatively few deletion operations, and the use of Get Shared CPUs ()
Very frequent calls (called every time a container is created/deleted)
Fixes:#53