Beyond Uptime: Why Traditional Monitoring is Blind to Video Quality
In traditional IT, a glance at CPU load or HTTP status code often suffices: If the server responds …

One of the biggest cost drivers in the video business is the gap between provisioned and actually used capacity. Video workloads are extremely “hungry”: A single HD transcoding job or a WebRTC bridge can demand multiple CPU cores. Rigid planning either leads to paying for unused servers (over-provisioning) or risking system crashes during peak loads (under-provisioning).
The solution is a two-tier autoscaling model that precisely adjusts the infrastructure to the application’s needs. We demonstrate how to configure the mechanics behind Kubernetes to achieve both economic and technical harmony.
Imagine you operate 50 servers for your video platform. At 8:00 PM, a major live event ends, and suddenly 200 transcoding jobs are queued.
In a modern video infrastructure, two mechanisms work hand in hand to solve this problem:
The HPA monitors the load of your video services (e.g., CPU load of the WebRTC bridges). Once a threshold is exceeded, it launches new pods.
Important for Video: Don’t rely solely on CPU metrics. For video, the number of active connections (streams) or the queue length in transcoding is often a better control metric.
Eventually, the physical space on the available servers (nodes) is exhausted. New pods cannot be launched and remain in a “Pending” state. This is where the Cluster Autoscaler steps in: It detects the need and automatically orders new servers from the cloud provider (or in the bare-metal pool). Once the load decreases and the pods are deleted, the CA decommissions the empty servers.
To ensure autoscaling works in the demanding video environment, we use three specific tactics:
We define different priorities. Live-streaming pods receive the highest priority. When resources become scarce, a live-stream pod displaces a less critical transcoding job. The transcoding job is paused and resumes once the Cluster Autoscaler provides another server.
Transcoding is ideal for “Spot Instances.” These are surplus capacities from cloud providers, up to 80% cheaper. Since our video pipeline is designed to restart interrupted jobs easily, we can save massively on costs without compromising end-user quality.
Autoscaling takes time (usually 2 to 5 minutes for a new server). For a scheduled major event with 10,000 viewers, we can manually ramp up a “base capacity” via GitOps or the API before the event starts. This prevents bottlenecks during the initial viewer surge.
Economic scaling means that your infrastructure cost curve closely aligns with your revenue curve (or user curve). By combining pod and node autoscaling on Kubernetes, we eliminate waste. For a platform operator, this is the crucial factor to remain competitive against global US giants: A lean, highly automated infrastructure that only incurs costs when it creates value.
How quickly does autoscaling respond to a sudden load spike? Pod autoscaling (HPA) responds within seconds. However, if new servers (nodes) need to be started, this takes about 2 to 4 minutes depending on the provider. For unpredictable load spikes, we always maintain a small buffer (“over-provisioning”) in the cluster.
Can we schedule scaling? Yes, with a “Scheduled Scaler,” you can set it so that, for example, every Monday morning at 9:00 AM, capacity is ramped up for the weekly meetings before the first load spike is measured.
Does frequent scaling up and down cause instability? No, as long as the application is built “Cloud-Native” (stateless). Video engines like LiveKit are designed precisely for instances to come and go. Load balancers seamlessly distribute the load.
Is there a limit to autoscaling? You can (and should) always define “max limits.” This protects you from skyrocketing costs if, due to an error or an attack (DDoS), thousands of servers are suddenly requested.
In traditional IT, a glance at CPU load or HTTP status code often suffices: If the server responds …
In traditional IT monitoring, the binary principle prevailed for a long time: a system is either up …
In the world of data engineering, there’s a saying: “Storing data is easy, querying it …