JobSet: The New Solution for Distributed ML and HPC Workloads in Kubernetes
ayedo Redaktion 3 Minuten Lesezeit

JobSet: The New Solution for Distributed ML and HPC Workloads in Kubernetes

Discover JobSet, the innovative API for distributed jobs in Kubernetes, optimizing ML training and HPC.
kubernetes kubernetes-news api

In the world of Kubernetes development, there’s exciting news: JobSet has been introduced, an open-source API specifically designed for managing distributed jobs. The goal of JobSet is to provide a unified API for distributed training of Machine Learning (ML) and High-Performance Computing (HPC) workloads on Kubernetes.

Why JobSet?

Recent improvements in the Kubernetes batch ecosystem have caught the attention of ML engineers who have found Kubernetes to be excellent for the demands of distributed training workloads.

Large ML models, particularly the so-called LLMs (Large Language Models), often do not fit into the memory of a single host’s GPU or TPU chips. Instead, they are distributed across tens of thousands of accelerator chips spanning thousands of hosts.

In such scenarios, the code for model training is often containerized and executed simultaneously on all these hosts. Distributed computations occur, where both the model parameters and training data are distributed across the target accelerator chips. Collective communication primitives like all-gather and all-reduce are used to perform computations and synchronize gradients between hosts.

These features make Kubernetes an excellent choice for such workloads, as it can efficiently manage and terminate the lifecycles of containerized applications. It is also highly extensible, allowing developers to define their own Kubernetes APIs, objects, and controllers to control the behavior and lifecycle of these objects.

However, existing Kubernetes primitives can no longer keep up with the evolving techniques of distributed ML training. The landscape of Kubernetes APIs for orchestrating distributed training is also fragmented, and each of the existing solutions has specific limitations that make them suboptimal for distributed ML training.

For example, the KubeFlow training operator defines custom APIs for various frameworks (e.g., PyTorchJob, TFJob, MPIJob, etc.), but these job types are specifically tailored to their respective frameworks, leading to different semantics and behaviors.

The Job API has closed many gaps in executing batch workloads, including the indexed completion mode, higher scalability, pod failure patterns, and pod restart policies. However, executing ML training and HPC workloads with the upstream Job API requires additional orchestration to close the following gaps:

  • Multi-Template Pods: Most HPC or ML training jobs involve more than one pod type. These pods belong to the same workload but need to run different containers, request different resources, or have different failure patterns. A common example is the driver-worker pattern.
  • Job Groups: Large-scale training workloads span multiple network topologies and run, for example, across multiple racks. Such workloads are latency-sensitive and aim to localize communication and minimize traffic over network latency. To enable this, the workload must be divided into groups of pods, each assigned to a network topology.
  • Inter-Pod Communication: Create and manage resources (e.g., headless Services) necessary to establish communication between the pods of a job.
  • Start Sequence: Some jobs require a specific start sequence of pods; sometimes, the driver is expected to start first (as with Ray or Spark), while in other cases, the workers must be ready before the driver starts (as with MPI).

JobSet aims to close these gaps by using the Job API as a building block to create a richer API for large-scale distributed HPC and ML use cases.

The introduction of JobSet is a significant step forward for developers and DevOps teams looking to manage complex ML and HPC workloads more efficiently. At ayedo, we are proud to be Kubernetes partners and look forward to helping you implement these new possibilities.


Source: Kubernetes Blog

Ähnliche Artikel