No More Idle Time: Rightsizing Tools for Efficient Kubernetes Clusters
In the traditional server world, the mantra was: “Better too much RAM than too little.” …

Integrating an NVIDIA H100 or A100 into your cluster today quickly reveals that the classic 1-to-1 allocation (one pod reserves an entire GPU) often results in massive capital waste in a production environment. While training LLMs fully utilizes the hardware, GPUs often idle at 10% utilization during inference operations or in development environments.
To reduce the TCO (Total Cost of Ownership) of your AI infrastructure, we must move beyond simple allocation and delve deep into Resource Management.
To allow multiple pods to share a physical GPU without interfering with each other, there are three established technical approaches today:
MIG allows a GPU to be divided into up to seven independent instances at the hardware level.
Here, Kubernetes uses the classic scheduler approach: Multiple processes use the GPU sequentially in extremely short time intervals.
Multi-Process Service (MPS) allows multiple processes to execute kernels on the GPU simultaneously.
A technical bottleneck in Kubernetes for a long time was the Device Plugin Framework. It treated GPUs as “countable units” (integers). With the introduction of Dynamic Resource Allocation (DRA) in newer K8s versions, the game changes fundamentally.
DRA allows us to define resources much more flexibly. Instead of just saying “I need a GPU,” we can specify complex requirements: “I need a GPU with at least 40GB VRAM and NVLink connectivity to the neighboring node.” This is essential for modern AI Superclusters, where network latency between GPUs (RDMA/RoCE) is as crucial as computational power itself.
Hardware sharing is only half the battle. The other half is Queue Management.
If three teams want to train a model simultaneously, but only two GPUs are available, the cluster must not simply run “Out of Memory.” Here, we rely on Kueue. It acts as a job queue manager on top of Kubernetes, deciding based on priorities and budgets which workload gets access to the expensive hardware and when.
In combination with Karpenter (instead of the standard Cluster Autoscaler), we can also ensure that we provision exactly the node types that are most cost-effective for the specific job – for example, spot instances for non-critical batch jobs.
AI infrastructure in mid-sized businesses today means: Maximizing investment returns. Simply “passing through” GPUs costs too much. Only by combining hardware partitioning (MIG), modern resource scheduling (DRA), and intelligent queue management does your cluster become a true AI factory.
What is the difference between MIG and vGPU? NVIDIA vGPU is a software-based solution often used in virtualization environments (VDI) and requires licenses per user. MIG is a hardware feature of newer Tensor-Core GPUs (Ampere architecture and newer), partitioned directly on the chip, and does not incur additional licensing fees within Kubernetes.
When should I avoid GPU sharing? During large model training (e.g., fine-tuning a Llama-3-70B). Here, you need the full memory bandwidth and entire VRAM of one or more GPUs. Any partitioning would significantly slow down the process or cause it to crash.
How do I monitor actual GPU utilization? Do not rely on standard K8s metrics. You need the NVIDIA DCGM Exporter, which provides metrics like “GPU Utilization,” “FB Memory Usage,” and even temperature directly into your Prometheus/VictoriaMetrics setup.
Is your GPU hardware optimally utilized? The architecture determines your cloud bill. At ayedo, we analyze your workloads and implement the appropriate sharing and scheduling strategies to maximize your performance and minimize costs.
In the traditional server world, the mantra was: “Better too much RAM than too little.” …