New Approaches in AI Management: The Gateway API Inference Extension
Modern generative AI and large language models (LLMs) present unique traffic management challenges …
In the world of Kubernetes development, there’s exciting news: JobSet has been introduced, an open-source API specifically designed for managing distributed jobs. The goal of JobSet is to provide a unified API for distributed training of Machine Learning (ML) and High-Performance Computing (HPC) workloads on Kubernetes.
Recent improvements in the Kubernetes batch ecosystem have caught the attention of ML engineers who have found Kubernetes to be excellent for the demands of distributed training workloads.
Large ML models, particularly the so-called LLMs (Large Language Models), often do not fit into the memory of a single host’s GPU or TPU chips. Instead, they are distributed across tens of thousands of accelerator chips spanning thousands of hosts.
In such scenarios, the code for model training is often containerized and executed simultaneously on all these hosts. Distributed computations occur, where both the model parameters and training data are distributed across the target accelerator chips. Collective communication primitives like all-gather and all-reduce are used to perform computations and synchronize gradients between hosts.
These features make Kubernetes an excellent choice for such workloads, as it can efficiently manage and terminate the lifecycles of containerized applications. It is also highly extensible, allowing developers to define their own Kubernetes APIs, objects, and controllers to control the behavior and lifecycle of these objects.
However, existing Kubernetes primitives can no longer keep up with the evolving techniques of distributed ML training. The landscape of Kubernetes APIs for orchestrating distributed training is also fragmented, and each of the existing solutions has specific limitations that make them suboptimal for distributed ML training.
For example, the KubeFlow training operator defines custom APIs for various frameworks (e.g., PyTorchJob, TFJob, MPIJob, etc.), but these job types are specifically tailored to their respective frameworks, leading to different semantics and behaviors.
The Job API has closed many gaps in executing batch workloads, including the indexed completion mode, higher scalability, pod failure patterns, and pod restart policies. However, executing ML training and HPC workloads with the upstream Job API requires additional orchestration to close the following gaps:
JobSet aims to close these gaps by using the Job API as a building block to create a richer API for large-scale distributed HPC and ML use cases.
The introduction of JobSet is a significant step forward for developers and DevOps teams looking to manage complex ML and HPC workloads more efficiently. At ayedo, we are proud to be Kubernetes partners and look forward to helping you implement these new possibilities.
Source: Kubernetes Blog
Modern generative AI and large language models (LLMs) present unique traffic management challenges …
We are excited to announce the general availability of Gateway API v1.3.0! Released on April 24, …
Efficient management of Kubernetes clusters is becoming increasingly important, especially as …