Machine Learning
Use Cases Machine Learning

Machine Learning

From GPU Bottlenecks to Industrial-Scale MLOps: How ayedo Led Sensoriq to a Kubernetes-Based ML Platform

machine-learning mlops predictive-maintenance edge-computing streaming-data gpu-infrastructure cloud-platform

From GPU Bottlenecks to Industrial-Scale MLOps: How ayedo Led Sensoriq to a Kubernetes-Based ML Platform

Predictive Maintenance sounds like “train a model and you’re done.” In practice, many projects fail not because of the model, but because of what comes after: data streams, inference SLAs, reproducible experiments, and an infrastructure that scales without each new customer project triggering a new operational project.

Sensoriq develops AI-based solutions for the manufacturing industry. The software analyzes sensor data in real-time and predicts failures before they occur. The product consists of edge components at the machine, a streaming pipeline, and a cloud platform for training, inference, and visualization.

Several pilot customers were running stably. However, the leap from pilot to scalable product operation failed due to an infrastructure that had grown historically—and was not designed for the next level.


Initial Situation: A Functioning Product That Couldn’t Grow Operationally

The team was well-equipped: data scientists, ML engineers, backend developers, sensor and edge specialists. What was missing was a shared, reproducible engine room for MLOps.

Training ran on a mix of local workstations with NVIDIA GPUs, a small on-prem server with two A100 cards, and occasionally booked cloud GPU instances with a hyperscaler. The streaming pipeline was built on manually managed VMs: Kafka on three machines, plus custom Python consumers. Inference endpoints ran as individual Flask processes on dedicated GPU servers.

This setup is typical for the early phase: fast, pragmatic, product-oriented. However, it collapses as soon as multiple teams work in parallel, onboard multiple customers, and a real SLA emerges.


Where It Squeaked: GPU Bottleneck, No Reproducibility, Fragile Streaming

The first bottleneck was the GPU resources. Two A100 cards for twelve data scientists—without scheduling, without quotas, without isolation. Whoever came first occupied the GPU. Small jobs blocked large models. Large training runs blocked everything. There was no way to cleanly divide GPU capacity or allocate it fairly.

In parallel, experiment reproducibility was not given. Everyone had their own local environment: different CUDA versions, other Python dependencies, different framework versions. A model that ran on one notebook could not be reliably reproduced on another machine. Onboarding became a setup project.

The most critical gap, however, was between experiment and production. Models were created in Jupyter Notebooks, then manually transferred to scripts, packaged in containers, and deployed on GPU servers. This transition took weeks and was a constant source of errors. Every time a notebook was “productized,” deviations—and thus errors—occurred, which only became visible in live operation.

The streaming pipeline was also a risk. Kafka on three VMs could not be elastically scaled. When a new customer with hundreds of sensors was connected, scaling meant: order VM, manually configure Kafka, manually deploy consumers. This is not a scalable process but manual labor in production operations.

And then the inference: Flask process, single server, no autoscaling, no failover. If a process crashed, the prediction failed until someone intervened manually. This is unpleasant for pilot customers. For industrial SLAs, it is unacceptable.

Added to this was a business pain point: uncontrolled cloud costs. Data scientists booked GPU instances in the cloud, forgot to shut them down after training, and no one could trace which experiment caused which costs. The monthly bill fluctuated massively.

The turning point was a major order: an automotive supplier wanted to monitor over 2,000 sensors at three locations simultaneously—with real-time inference and an SLA of under 500 ms. With the existing setup, this was not realistically feasible.


ayedo’s Approach: A Kubernetes-Based ML Platform—Seamlessly from Notebook to Inference

Our goal was not to “modernize” individual components. Our goal was a platform that covers the entire ML lifecycle: interactive development, distributed training, reproducible experiments, standardized deployment, scalable streaming processing, and observability-first operations.

The ayedo Managed Kubernetes platform became the backbone for this—not as a buzzword, but as a declarative operational model that combines scheduling, isolation, automation, and traceability.


GPU Scheduling and Partitioning: GPUs as a True Platform Resource

The biggest operational lever was GPU management. With the NVIDIA GPU Operator and NVIDIA MPS (Multi-Process Service), we established GPUs as schedulable resources in the cluster—including partitioning into smaller slices.

This way, an experimental notebook job doesn’t get “a whole GPU,” but exactly the slice it needs. At the same time, large training runs can receive dedicated full-GPU allocations, including priorities and quotas per team/namespace.

This fundamentally changes the work reality: No one waits for “free GPUs” anymore. No one blocks others through oversized reservations. And GPU utilization becomes plannable and efficient.


JupyterHub as Self-Service: Reproducible Notebook Workspaces in Minutes

Instead of individual local environments, the team now works via JupyterHub on Kubernetes. Each person starts an isolated notebook environment via self-service with defined CUDA and framework version, persistent storage, and configurable GPU access.

The crucial point is reproducibility: Containerized notebook images create identical conditions—regardless of who works or when a notebook is started. Onboarding thus changes from “days of setup” to “minutes of workspace.”


MLflow: Experiment Tracking and Model Registry as a Connecting Layer

To ensure the path from experiment to production no longer relies on manual labor, we introduced MLflow as experiment tracking and model registry.

Each training automatically logs hyperparameters, metrics, data references, and artifacts. Models are registered and receive a lifecycle status—from experiment to staging to production. This makes model management traceable and auditable, instead of disappearing into notebooks and folder structures.


KServe: Standardized Model Serving with Canary, Rollback, and Autoscaling

For inference, we built KServe as a serving layer. Models are deployed as versioned artifacts, not as “new Flask scripts on a server.”

KServe enables canary deployments, A/B tests, and rollbacks—with health checks and autoscaling that can respond to latency and load. This is crucial when an SLA under 500 ms must be met and load profiles fluctuate.


Real-Time Streaming on Kubernetes: Kafka with Strimzi Instead of VM Manual Labor

The sensor streaming pipeline was migrated to Kafka with the Strimzi Operator. This is not just “Kafka in Kubernetes,” but a Kubernetes-native operational model: Topics, partitions, ACLs, and configurations are managed declaratively and rolled out via GitOps.

New customers with hundreds of sensors no longer mean a manual server project, but scalable capacity in the cluster—including clear operational processes and observability.


LLM-Serving for Reports: vLLM Productive, Ollama for Experimentation

Sensoriq uses LLMs to translate anomalies into plain text reports—such as action recommendations for maintenance teams. For this, we have operated vLLM as a high-performance inference layer productively, including autoscaling and efficient GPU memory utilization.

For development and test environments, Ollama was integrated, allowing data scientists to quickly experiment with models without burdening productive resources. The important point: LLM inference remains self-hosted—sensor data does not leave the European infrastructure.


Observability: Making ML Operations Measurable

An MLOps system without observability is a blind flight. Therefore, we integrated VictoriaMetrics, VictoriaLogs, Grafana, and Tempo as a full-stack observability.

Today, Sensoriq monitors GPU utilization, inference latency, streaming throughput, error rates, and even ML-specific signals like model drift. Alerting intervenes before SLA violations occur—not only when the customer reports them.


GitOps as Operational Standard: ArgoCD and Authentik

ArgoCD manages the entire platform via GitOps—from Kafka configurations to KServe deployments to notebook templates. Changes are versioned, reviewable, and auditable.

Authentik forms the central identity and access layer: Data scientists see their workspaces and experiments, customers their dashboards, the ops team the infrastructure—cleanly separated, with one login.


Result: Industrial-Scale Operation Becomes Possible—and Plannable

With the new ML platform, Sensoriq has made the transition from pilot operation to a scalable industrial product.

GPU utilization increased significantly because partitioning and scheduling eliminate idle times and bottlenecks. Data scientists work in parallel without slowing each other down.

The path from experiment to production has been standardized: Notebook → MLflow Registry → Deployment via ArgoCD/KServe. What used to take weeks and was error-prone now happens in days—reproducibly.

The major order from the automotive industry is running stably: Real-time inference is under 200 ms with over 2,000 sensors. Autoscaling and GPU scheduling meet the SLA requirements even during peak loads.

Costs are transparent and plannable. Consumption is recorded per team/namespace, and the proliferation of forgotten cloud GPUs is gone. Overall, infrastructure costs have decreased significantly compared to the previous mix.

And last but not least: Data sovereignty is secured. LLM inference and sensor data processing run entirely on European infrastructure—a crucial selling point in the manufacturing industry.


Why This Approach Works

MLOps rarely fails due to the algorithm. It fails due to a lack of platform logic: no reproducible environments, no standardized model serving, no scheduling for GPUs, no observability for ML-specific signals.

Kubernetes is not “the new standard” here, but the tool to operate ML like software: declaratively, versioned, automated, scalable.

That’s exactly what we implemented for Sensoriq—with a platform model that accelerates the team and industrializes operations.


Call to Action

If your ML teams are fighting over GPUs, experiments are not reproducible, and the path from notebook to production takes weeks, then this is not a team problem—but a platform problem.

ayedo builds Kubernetes-based ML platforms that make GPU resources plannable, enable self-service for data scientists, and reliably operate inference under SLA conditions—including streaming pipelines, model registry, observability, and GitOps operations.

If you want to take the step from pilot project to industrial scale, let’s discuss what your target architecture looks like—and how you can operate it without infrastructure overhead.

Diesen Use Case umsetzen?

Wir helfen Ihnen, diesen Use Case auf Ihrer Infrastruktur zu realisieren – skalierbar, sicher und DSGVO-konform.

Weitere Use Cases

Video Processing

From Bare-Metal Tinkering to Elastic Video Infrastructure: How ayedo Made Streambase Scalable for …

19.02.2026

SaaS Apps

From VM Operation to Platform: How ayedo’s Planwerk Led to Scalable, Auditable SaaS …

19.02.2026

Machine Learning

From GPU Bottlenecks to Industrial-Scale MLOps: How ayedo Led Sensoriq to a Kubernetes-Based ML …

19.02.2026