GPU Operator v1.0.0 Release Notes

GPU Operator v1.0.0 Release Notes#

This release is the first major release of AMD GPU Operator. The AMD GPU Operator simplifies the deployment and management of AMD Instinct™ GPU accelerators within Kubernetes clusters. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications.

Release Highlights#

Manage AMD GPU drivers with desired versions on Kubernetes cluster nodes
Customized scheduling of AMD GPU workloads within Kubernetes cluster
Metrics and statistics monitoring solution for AMD GPU hardware and workloads
Support specialized networking environment like HTTP proxy or Air-gapped network

Hardware Support#

New Hardware Support#

AMD Instinct™ MI300
- Required driver version: ROCm 6.2+
AMD Instinct™ MI250
- Required driver version: ROCm 6.2+
AMD Instinct™ MI210
- Required driver version: ROCm 6.2+

Platform Support#

New Platform Support#

Kubernetes 1.29+
- Supported features:
  - Driver management
  - Workload scheduling
  - Metrics monitoring
- Requirements: Kubernetes version 1.29+

Breaking Changes#

Not Applicable as this is the initial release.

New Features#

Feature Category#

Driver management
- Managed Driver Installations: Users will be able to install ROCm 6.2+ dkms driver on Kubernetes worker nodes, they can also optionally choose to use inbox or pre-installed driver on the worker nodes
- DeviceConfig Custom Resource: Users can configure a new DeviceConfig CRD (Custom Resource Definition) to define the driver management behavior of the GPU Operator
GPU Workload Scheduling
- Custom Resource Allocation “amd.com/gpu”: After the deployment of the GPU Operator a new custom resource allocation will be present on each GPU node, amd.com/gpu, which will list the allocatable GPU resources on the node for which GPU workloads can be scheduled against
- Assign Multiple GPUs: Users can easily specify the number of AMD GPUs required by each workload in the deployment/pod spec and the Kubernetes scheduler wiill automatically take care of assigning the correct GPU resources
Metrics Monitoring for GPUs and Workloads:
- Out-of-box Metrics: Users can optionally enable the AMD Device Metrics Exporter when installing the AMD GPU Operator to enable a robust out-of-box monitoring solution for prometheus to consume
- Custom Metrics Configurations: Users can utilize a configmap to customize the configuration and behavior of Device Metrics Exporter
Specialized Network Setups:
- Air-gapped Installation: Users can install the GPU Operator in a secure air-gapped environment where the Kubernetes cluster has no external network connectivity
- HTTP Proxy Support: The AMD GPU Operator supports usage within a Kubernetes cluster that is behind an HTTP Proxy. Support for HTTPS Proxy will be added in a future version of the GPU Operator.

Known Limitations#

GPU operator driver installs only DKMS package
- Impact: Applications which require ROCM packages will need to install respective packages.
- Affected Configurations: All configurations
- Workaround: None as this is the intended behaviour
When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install
- Impact: Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg
- Affected configurations: Nodes with driver version >= ROCm 6.2.x
- Workaround: Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+
GPU Operator unable to install amdgpu driver if existing driver is already installed
- Impact: Driver install will fail if amdgpu in-box Driver is present/already installed
- Affected Configurations: All configurations
- Workaround: When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. Blacklist in-box driver so that it is not loaded or remove the pre-installed driver
When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server
- Impact: Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU.
- Affected Configurations: All configurations
- Workaround: Restart the Device plugin pod deployed.
Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed
- Impact: Node upgrade will not proceed automatically and requires manual intervention
- Affected Configurations: All configurations
- Workaround: Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off:
```
kubectl cordon <node-name>
```
When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module
- Impact: Driver upgrade is blocked
- Affected Configurations: All configurations
- Workaround: Disable the Metrics Exporter on specific node to allow driver upgrade as follows:
1. Label all nodes with new label:
```
kubectl label nodes --all amd.com/device-metrics-exporter=true
```
2. Patch DeviceConfig to include new selectors for metrics exporter:
```
kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='merge' -p {"spec":{"metricsExporter":{"selector":{"feature.node.kubernetes.io/amd-gpu":"true","amd.com/device-metrics-exporter":"true"}}}}'
```
3. Remove the amd.com/device-metrics-exporter label for the specific node you would like to disable the exporter on:
```
kubectl label node [node-to-exclude] amd.com/device-metrics-exporter-
```