New Approaches in AI Management: The Gateway API Inference Extension

Modern generative AI and large language models (LLMs) present unique traffic management challenges for Kubernetes. Unlike typical ephemeral, stateless web requests, LLM inference sessions are often lengthy, resource-intensive, and sometimes stateful. For instance, a single GPU-backed model server can maintain multiple inference sessions and manage stored tokens in memory.

Traditional load balancers that focus on HTTP paths or round-robin lack the specialized capabilities required for these workloads. They also do not consider model identity or the criticality of requests (e.g., interactive chat vs. batch jobs). Organizations often cobble together ad-hoc solutions, yet a standardized approach is missing.

Gateway API Inference Extension

Gateway API Inference Extension was developed to fill this gap by building on the existing Gateway API and adding inference-specific routing capabilities while maintaining the familiar model of Gateways and HTTPRoutes. By adding an Inference Extension to your existing Gateway, you effectively transform it into an Inference Gateway, enabling you to self-host GenAI/LLMs with a “Model-as-a-Service” mindset.

The project’s goal is to improve and standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware routes, supporting criticality-based requests, facilitating secure model deployments, and optimizing load distribution based on real-time model metrics. Achieving these aims is intended to reduce latency and improve accelerator (GPU) utilization for AI workloads.

How It Works

The design introduces two new custom resources (CRDs) with distinct responsibilities, each aligning with a specific user persona in the AI/ML serving workflow:

Resource Model

InferencePool Defines a pool of pods (model servers) running on shared compute resources (e.g., GPU nodes). The platform administrator can configure how these pods are provisioned, scaled, and balanced. An InferencePool ensures consistent resource usage and enforces platform-wide policies. An InferencePool is akin to a Service but is specifically tailored to the needs of AI/ML deployment and understands the model serving protocol.
InferenceModel A user-facing model endpoint managed by AI/ML owners. It links a public name (e.g., “gpt-4-chat”) to the actual model within an InferencePool. This allows workload owners to specify which models (and optional fine-tunings) they wish to deploy, along with a policy for traffic splitting or prioritization.

In summary, the InferenceModel API allows AI/ML owners to manage what is deployed, while the InferencePool enables platform operators to manage where and how it is deployed.

The Gateway API Inference Extension thus represents a significant step in overcoming the challenges of deploying AI models in Kubernetes and enhancing resource utilization efficiency. With ayedo as a partner in the Kubernetes space, companies can benefit from this new technology and optimize their AI workloads.

Source: Kubernetes Blog