Operator Overview#
The AMD GPU Operator consists of several key components that work together to manage AMD GPUs in Kubernetes clusters. This document provides an overview of each component and its role in the system.
Core Components#
Controller Manager#
The AMD GPU Operator Controller Manager is the central control component that manages the operator’s custom resources. Its primary responsibilities include:
Managing the
DeviceConfig
custom resourceRunning reconciliation loops to maintain desired state
Coordinating driver installation, upgrades, and removal
Managing the lifecycle of dependent components (device plugin, node labeller, metrics exporter)
Node Feature Discovery (NFD)#
The Node Feature Discovery (NFD) component automatically detects and labels nodes with AMD GPU hardware. Key features include:
Detection of AMD GPUs using PCI vendor and device IDs
Automatic node labeling with
feature.node.kubernetes.io/amd-gpu: "true"
Hardware capability discovery and reporting
Kernel Module Management (KMM)#
The Kernel Module Management (KMM) Operator handles the lifecycle of GPU driver kernel modules. Its responsibilities include:
Loading, upgrading, and unloading host kernel modules
Managing containerized driver operations
Coordinating with the Controller Manager for driver lifecycle events
Note
Kubernetes: Use the AMD-optimized KMM Operator provided by the GPU Operator Helm chart
Component Interaction#
The components work together in the following sequence:
NFD identifies worker nodes with AMD GPUs
Controller Manager processes
DeviceConfig
custom resourcesKMM handles driver operations based on configuration
Device Plugin registers
amd.com/gpu
allocatable resources to nodeNode Labeller adds detailed GPU information to node labels
Metrics Exporter provides ongoing monitoring
Plugins and Extensions#
Device Plugin#
The AMD GPU Device Plugin enables GPU resource allocation in Kubernetes:
Implements the Kubernetes Device Plugin API
Registers AMD GPUs as allocatable resources
Enables GPU resource requests and limits in pod specifications
Node Labeller#
The Node Labeller provides detailed GPU information through node labels:
Automatically detects GPU properties
Adds detailed GPU-specific labels to nodes
Enables fine-grained pod scheduling based on GPU capabilities
Metrics Exporter#
The Device Metrics Exporter provides monitoring capabilities:
Exports GPU metrics in Prometheus format
Monitors GPU utilization, temperature, and health
Enables integration with monitoring systems