Kafka on VMs vs. Kubernetes: Why the 'Operator Approach' is Revolutionizing Streaming
In industrial AI, such as predictive maintenance for sensor data analysis software, data streams …

In many machine learning teams, an unwritten rule prevails: first come, first served. Whoever starts the first training job in the morning occupies the GPU—often for the entire day. The remaining data scientists wait, switch to slow CPU instances, or book expensive shadow IT in the public cloud.
This “Wild West scenario” in hardware usage is not only inefficient, it stifles innovation and causes costs to skyrocket. The solution lies not in more hardware, but in intelligent GPU scheduling and resource quotas.
Without central orchestration, a GPU is viewed as an indivisible unit. This leads to two extreme inefficiencies:
By deploying the NVIDIA GPU Operator on Kubernetes, we transform graphics cards from isolated hardware islands into a shared platform resource.
Instead of always allocating a GPU as a whole, we use technologies like Multi-Instance GPU (MIG) or Multi-Process Service (MPS). This allows physical cards to be divided into logical “slices.”
Not every job is equally important. In Kubernetes, we define Priority Classes:
To prevent a single project from consuming the entire budget, we set quotas at the namespace level. Each team (e.g., “Computer Vision” vs. “NLP”) receives a fixed allocation of GPU hours or slices. Once the allocation is exhausted, jobs must wait or be prioritized. This creates transparency and forces conscious resource planning.
Intelligent GPU management makes the difference between a hobby project and a scalable AI department. When hardware utilization rises from 20% to 80%, it effectively halves the cost per experiment.
For one of our clients, this was exactly the turning point: The hardware remained the same, but the number of parallel experiments tripled—simply through fair rules and technical scheduling.
Why isn’t it enough to just buy more GPUs? Hardware is expensive and often hard to come by. Without scheduling, more hardware only leads to more unused idle time. Only through intelligent sharing (slicing) can you achieve an economy that makes AI projects sustainable in the long term.
What happens if a high-priority job needs a GPU that is occupied? Kubernetes uses “preemption.” It can pause or stop less important jobs (e.g., an experiment) to free up space for the high-priority job (e.g., inference for a customer). The stopped job is automatically restarted as soon as capacity becomes available again.
Does GPU slicing work with any graphics card? True hardware slicing (MIG) requires modern NVIDIA cards (Ampere architecture or newer, e.g., A100, H100). For older or smaller cards, we use software solutions like MPS or time-slicing to achieve similar efficiency gains.
Can data scientists manage their own quotas? Yes, through dashboards (e.g., in Grafana), each team can immediately see how much of their allocation has been used. This promotes self-responsibility and prevents unpleasant surprises at the end of the month.
How does ayedo support the setup of GPU clusters? We configure the entire stack: from the driver to the GPU operator to quotas and monitoring dashboards. Our goal is for your data scientists to focus on the models while we optimize the “engine room” for computing power.
In industrial AI, such as predictive maintenance for sensor data analysis software, data streams …
When an AI model leaves the training phase, the real challenge begins: productive inference …
In traditional high availability scenarios, DNS (Domain Name System) is the standard tool for …