NVIDIA DCGM Privilege Escalation via Uncontrolled Search Path
Overview
A high-severity vulnerability was identified in NVIDIA's Data Center GPU Manager (DCGM), a suite of tools for managing and monitoring NVIDIA GPUs in cluster environments. The vulnerability, tracked as CVE-2024-0084, allows a local user with basic privileges to escalate to root-level privileges on the host system. The flaw exists in a component of the DCGM diagnostics suite that uses an uncontrolled search path when looking for certain executables. An attacker with the ability to place a malicious executable in a location that is searched before the legitimate system path (e.g., the current working directory of the DCGM process) can trick the service into running their code with elevated permissions. In a typical multi-tenant AI/ML environment, where multiple users share access to GPU resources, a single compromised user account could leverage this vulnerability to gain complete control over the entire node. This would allow them to access or tamper with other users' data, interfere with sensitive model training jobs, or pivot to other nodes in the Kubernetes cluster. The vulnerability underscores the security criticality of the underlying infrastructure software that powers modern AI workloads.
Affected Systems
Testing Guide
1. **Check DCGM Version:** On a GPU-enabled node, run `dcgmi --version` or check the installed package version (`dpkg -l | grep dcgm` on Debian/Ubuntu). If the version is below 3.3.5, the system is likely vulnerable. 2. **Simulate Attack (CAUTION):** As a non-privileged user, create a malicious script named `ls` (or another common command) in a temporary directory: `echo '#!/bin/bash touch /tmp/pwned' > /tmp/test/ls; chmod +x /tmp/test/ls`. 3. **Manipulate Path:** Execute a vulnerable DCGM diagnostic command with the malicious directory at the front of the path: `PATH=/tmp/test:$PATH dcgmi diag -r ...`. 4. **Verify Result:** Check if the file `/tmp/pwned` was created with root ownership. If so, the system is vulnerable. **NOTE: Only perform this test on a dedicated, non-production system.**
Mitigation Steps
1. **Update DCGM:** Upgrade the DCGM package to version 3.3.5 or later on all GPU nodes. 2. **Restrict User Permissions:** Follow the principle of least privilege. Limit shell access on GPU nodes and ensure user processes cannot write to directories in the system's `PATH`. 3. **Harden File Permissions:** Ensure that the directories from which DCGM components are executed have secure permissions and are not writable by non-privileged users. 4. **Use Secure Base Images:** When running in containers, ensure the base images are hardened and that the DCGM diagnostics are run in a context with a minimal set of permissions.
Patch Details
Patched in NVIDIA DCGM version 3.3.5 and later releases included in corresponding driver updates.