-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The GPU-related metrics data exposed by DCMG-EXPORT does not include the laungcher pod with the GPU bound #11660
Comments
/cc @machadovilaca |
Although I don't know in detail how we are handling GPUs in KubeVirt, I think it's clear that we passthrough the device name exactly, and it seems this is related to the NVIDIA metric collector. If you set up your VM requesting a GPU
the resulting
but the NVIDIA exporter, as you mentioned, only cares about resources named exactly example: |
Thanks so much for focusing on this! Here's the yaml for kubevirt binding to the gpu.
I don't need to specify which GPU card to use when binding the GPU directly to the pod. I only need to focus on the number of GPUs bound to the pod, which will be handled by the GPU Opreator, so I just need to configure it as follows If I set the deviceName to Instead, I would prefer that the kubevirt luancher pod set the podResource Name to Of course, a better solution would be to push DCMG Export to change the policy of forcing matching |
Why is it not reasonable? Per my understanding, if you request a specific GPU for the pod, you also won't have NVIDIA monitoring there. If you ask KubeVirt to request a specific |
I think you're right. it's not practical to try to fix this in kubevirt. I have some GPU cards mounted in my cluster and from
if request This appears to be because the DCGM Exporter strictly follows the k8s specification for determining GPU resource. refer k8s device plugin,
This issue is better suited for upstream discussion, and I'll probably cite this issue as a practical example of what I'm looking for. Thanks for the help provided! |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. /close |
@kubevirt-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What happened:
A clear and concise description of what the bug is.
When a VM created with kubevirt is bound to a GPU via passthrough mode, the GPU-related metrics data exposed by DCMG-EXPORT does not include the laungcher-pod with the GPU bound.
With a GPU bound directly to a pod (not created by kubevirt), the DCMG-EXPORT is able to obtain the GPU metrics for that pod.
What you expected to happen:
A clear and concise description of what you expected to happen.
A kubevirt-created launcher bound to a GPU can be captured by DCMG-EXPORT.
How to reproduce it (as minimally and precisely as possible):
Steps to reproduce the behavior.
Additional context:
Add any other context about the problem here.
You can see by the following code that DCMG EXPORT filters based on PodResource, the rules are
resourceName == nvidiaResourceName
orstrings.HasPrefix(resourceName, nvidiaMigResourcePrefix)
The nvidiaResourceName is "nvidia.com/gpu"
https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L142
But in kubevirt, the GPU-related PodResource Name seems to be
gpu.DeviceName
.Environment:
virtctl version
): 1.0.0kubectl version
): N/Auname -a
): N/AThe text was updated successfully, but these errors were encountered: