Retry device enumeration on startup to prevent empty ResourceSlices#1009
Retry device enumeration on startup to prevent empty ResourceSlices#1009kasia-kujawa wants to merge 1 commit into
Conversation
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Solid approach to the GKE-COS / slow-driver-init race — the real-life log evidence in the description is clean. Two structural concerns plus some cleanups, left inline. Requesting changes primarily on the silent empty-slice publish on timeout and the 5-minute synchronous block on startup; the rest are suggestions.
|
@ArangoGutierrez @varunrsekar Thanks for the review! ❤️ I'm not sure if I'll have time this week to address your comments, but I definitely will next week. |
|
Even if not yet fully understood, #1008 really is an intriguing and interesting problem. I don't want to undermine any work here, and please consider my feedback as just one item on the feedback shelf. My gut feeling is that if we have to do any retrying that we should do that in the init container, and not in the plugin startup code. Once we understand what would it take to detect that situation over there I think it's also easy to just "wait a little longer". Such change can be looked at as
Having to retry in the plugin startup code feels like the init container doesn't do its job. |
I initially thought about changes in the init container too but I didn't find good checks to add. For the curious, I'm going to test this kasia-kujawa#8 I'll be back with results ⌛ 🧪 |
24e31d3 to
4c4b718
Compare
✅ Deploy Preview for dra-driver-nvidia-gpu ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kasia-kujawa The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
4c4b718 to
4dba81f
Compare
|
@ArangoGutierrez @varunrsekar Please take another look. I think I addressed all review comments except Scenario 2 mentioned in this comment #1009 (comment) I think we can skip this scenario in this pull request to limit the scope of changes it introduces and either add an implementation for it in another pull request or skip it now and add it in the future if needed. I tried the approach of introducing an init container to check NVML initialization (in Go, in exactly the same way as in the kubelet plugin) but it turned out that successful initialization in the init container does not guarantee successful initialization in another container 😿 I could easily reproduce this state -> successful NVML initialization in the init container, but no GPU discovered in gpu-kubelet-plugin. It seems that there is an issue either with loading the NVML library or with the communication between the NVML library and the kernel 🤔 Most important logs from my tests with additional init container: Full logs from my experiments with additional gpu init container and its implementation is here: kasia-kujawa#9 @varunrsekar Could you make sure if this pull request doesn't break anything in Full logs from the version with the fix introduced in this pull request when the first device discovery fails: |
4dba81f to
6adf190
Compare
varunrsekar
left a comment
There was a problem hiding this comment.
Partial review. Will do a more thorough pass of the changes over time.
6adf190 to
a73cfe6
Compare
| if !state.AllocatableReady() { | ||
| driver.wg.Add(1) | ||
| go driver.backgroundInit(ctx, config) | ||
| } |
There was a problem hiding this comment.
There's no reason to do a lazy retry that reimplements driver initialization. Only device enumeration should be retried and Driver shouldn't initialize if device enumeration failed.
There was a problem hiding this comment.
it's there because of 3113277117, Primary blocker).
Once we can't block, the post-enumeration steps (MIG cleanup, health monitor, publishResources) have to run after the retry succeeds, which is what backgroundInit does.
If you had something else in mind, let me know 🙏
There was a problem hiding this comment.
Thanks for that context. Will review with it in mind.
| } | ||
| return nil, fmt.Errorf("error enumerating all possible devices: %w", err) | ||
| } | ||
| if len(perGPU.allocatablesMap) == 0 { |
There was a problem hiding this comment.
Here, we're assuming that either ALL GPUs are initialized or NONE of the GPUs are initialized. Is it possible for partial initialization? Do we care about it?
There was a problem hiding this comment.
I think we can skip the scenario when some GPUs are not initialized in this pull request to limit the scope of changes it introduces and either add an implementation for it in another pull request or skip it now and add it in the future if needed - I haven't observed the issue with partial iniitialization.
If we want to have this check we can probably check if all GPUs visible as PCI devices are also visible via nvml 🤔
db47153 to
58c2ce4
Compare
That is still somewhat frightening, and we should talk to more people and teams about that.
That is a really important question. For posterity, I've found a related discussion (Azure context): For the scenario where NVML reported at least one GPU in the init container, but zero GPUs in the main container it would be good to confirm explicitly the state of dev nodes in the main container file system. I didn't follow the exchange above in detail, so we may have already done this. (it would be important to confirm if NVML may report zero devices despite the filesystem state looking as expected -- if the filesystem state is unexpected then this may greatly facilitate finding the root cause). Thanks for the great work here! |
varunrsekar
left a comment
There was a problem hiding this comment.
@kasia-kujawa Once you address the changes, please squash the commits
| if len(perGPU.allocatablesMap) == 0 { | ||
| if checkpointHasPreparedDevices(cp) { | ||
| klog.Infof("No GPU devices discovered via NVML but the checkpoint has prepared devices, not retrying (unhealthy device state, retry won't help)") | ||
| return perGPU, nil | ||
| } | ||
| klog.Infof("No GPU devices discovered on enumeration attempt; will retry") | ||
| return nil, nil | ||
| } |
There was a problem hiding this comment.
The spirit of my original comment was that we don't need to worry about this differentiation :). If allocatable is empty, we can simply retry. Unhealthy GPUs are a problem but we dont need to solve it here.
| if len(perGPU.allocatablesMap) == 0 { | |
| if checkpointHasPreparedDevices(cp) { | |
| klog.Infof("No GPU devices discovered via NVML but the checkpoint has prepared devices, not retrying (unhealthy device state, retry won't help)") | |
| return perGPU, nil | |
| } | |
| klog.Infof("No GPU devices discovered on enumeration attempt; will retry") | |
| return nil, nil | |
| } | |
| if len(perGPU.allocatablesMap) == 0 { | |
| // Caveat: we may end up in this state due to unhealthy GPUs. This needs to be revisited in the future | |
| klog.Infof("No GPU devices discovered on enumeration attempt; will retry") | |
| return nil, nil | |
| } |
There was a problem hiding this comment.
Hope that this time I understood your idea 😅 @varunrsekar please check it
@jgehrcke some more observations - I noticed that I can easily reproduce this on NVIDIA T4, one/two retries I will see this issue, in the issue that you linked also T4 is mentioned. For example when I tested using P4 I couldn't see the issue. I was only checking the dev nodes in the init container, trying to see whether this would help us prepare a better init container — it didn’t help :D I can do one more test 🧪 and check the dev nodes in the main container when NVML reports 0 GPUs. |
Signed-off-by: Katarzyna Kujawa <katarzyna@cast.ai>
f1603fc to
eac1262
Compare
@jgehrcke I prepared one more version for debugging and I did the test - nvml found GPU in the additional init container but nvml didn't found GPU in the gpu container but GPU is visible under the most important logs: gpus container: full logs are here with the reference to the code which I used to check it: kasia-kujawa#9 (comment) |
Fixes #1008
Added a retry loop in
NewDeviceState().If the first enumeration returns 0 devices, the plugin retries every 5 seconds for up to 5 minutes before proceeding.
Errors still propagate immediately without retry.
Before the fix, no log was emitted after
Traverse GPU devicesand the emptyResourceSlicewas published silently.With the fix (
nvidiaDriverRoot: /home/kubernetes/bin/nvidia/, GKE COS, Tesla T4):Full logs with the fix from
nvidia-dra-driver-gpu-kubelet-pluginwhen GPU initialization needed slightly more time:https://gist.github.com/kasia-kujawa/1082b48357a0ae80d663f12ee665e34c