NVIDIA GPU Driver Kernel Mode Layer Allows for Denial of Service in Multi-Tenant AI Clusters
Overview
A vulnerability was identified in the kernel mode layer of the NVIDIA GPU Display Driver for Linux, posing a significant risk to multi-tenant GPU-accelerated environments, such as those used for training and hosting large AI models. The flaw, tracked as CVE-2024-0091, allows a local user with basic execution privileges to trigger a condition of uncontrolled resource consumption. An attacker, by running a specially crafted program from within a container or a user session, can make a sequence of API calls to the driver that leads to a deadlock or an infinite loop within the kernel module. This effectively monopolizes critical GPU resources or causes the kernel driver to become unresponsive. The immediate impact is a Denial of Service (DoS), as all other processes on the host system—including legitimate ML training jobs, inference services, and even the graphical user interface—are unable to access the GPU. In a cloud or on-premise Kubernetes cluster where GPU resources are shared among multiple users or services, a single compromised pod could disrupt operations for all other tenants on the same physical node. This vulnerability does not allow for privilege escalation but can cause significant operational downtime and financial loss.
Affected Systems
Testing Guide
1. **Check Driver Version**: Determine your currently installed NVIDIA driver version by running: ```bash nvidia-smi ``` 2. **Compare with Bulletin**: Compare your installed version against the fixed versions listed in the official NVIDIA Security Bulletin for February 2024. 3. **Isolate Workloads**: If immediate patching is not possible, isolate untrusted or experimental workloads on separate physical nodes from critical production workloads.
Mitigation Steps
1. **Update Drivers**: Update NVIDIA drivers to the patched versions listed in the NVIDIA security bulletin. For production systems, use the production branch drivers. 2. **Restrict GPU Access**: In multi-tenant environments, use Kubernetes and container runtime security policies to restrict which users and pods can access GPU devices. 3. **Monitor GPU Utilization**: Implement robust monitoring of GPU metrics (memory, utilization, power). Set up alerts for anomalous patterns that could indicate a resource consumption attack. 4. **Apply Principle of Least Privilege**: Ensure that applications and containers accessing the GPU run with the minimum privileges necessary.
Patch Details
Patches are available in driver versions 550.40.07, 545.29.06, 535.154.05, and 470.223.02 and later.