AMD GPU Operator Documentation

AMD GPU Operator Documentation#

The AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications.

Features#

Automated driver installation and management
Easy deployment of the AMD GPU device plugin
Metrics collection and export
Support for Vanilla Kubernetes
Simplified GPU resource allocation for containers
Automatic worker node labeling for GPU-enabled nodes

Compatibility#

Kubernetes: 1.29.0
Please refer to the ROCm documentation for the compatability matrix for the AMD GPU DKMS driver.

Prerequisites#

Helm v3.2.0+
kubectl CLI tool configured to access your cluster

Quick Start#

Add the Helm repository:

helm repo add rocm https://rocm.github.io/gpu-operator
helm repo update

Install the AMD GPU Operator:

helm install amd-gpu-operator rocm/gpu-operator-charts --namespace kube-amd-gpu --create-namespace

Verify the installation:

kubectl get pods -n kube-amd-gpu

Support#

For bugs and feature requests, please file an issue on our GitHub Issues page.