Job Scheduling for AI Clusters: SLURM, Kubernetes, Ray, and Knowing When You Need None of Them

A scheduler decides which job runs on which GPUs, when, and at whose expense. Without one, a shared cluster degenerates into people pinging each other on Slack to ask if anyone is using node 3. With the wrong one, you spend more engineer-hours on YAML than on model work.

This article compares the schedulers people actually pick — SLURM, Kubernetes (with Kueue or Volcano), Ray and KubeRay, and the commercial options Run:ai and Determined — and is honest about the case where the right answer is "nothing, just SSH." The audience has read K01, K02, and K03 and is now deciding what to put on top of a 1–32 node cluster.

What a scheduler actually does

Three jobs, roughly in order of difficulty: place work on free resources, queue work that does not fit yet, and coordinate multi-node jobs — distributed training needs all N ranks to start at once (gang scheduling), because a PyTorch dist.barrier() blocks indefinitely if one rank never shows up (see K06).

Anything beyond a single-team cluster also needs multi-tenant quotas with borrowing, fair-share decay across teams, preemption, and proper GPU accounting (so the job sees CUDA_VISIBLE_DEVICES set correctly). Every scheduler below does the first three. Where they diverge is the second list and on what kind of work they assume you are running.

The big split: batch jobs vs long-running services

This is the single most important framing. Batch jobs train for 8 hours and exit, or run a hyperparam sweep of 200 short jobs, or process a dataset overnight — start, end, result, nothing answers HTTP. Long-running services are a vLLM endpoint serving inference 24/7, a scene-memory database, a monitoring stack — supposed to stay up forever, restart on crash.

SLURM was built for batch. Kubernetes was built for long-running services. Everything else is one of the two being stretched to do the other half.

Workload	SLURM	Kubernetes
8-hour training run	Native	Needs Kueue or Volcano
200-job hyperparameter sweep	Native	Awkward
24/7 vLLM inference endpoint	Awkward	Native
Mixed: training + inference + tooling	Painful	Native (with Kueue)
Multi-node distributed training	Native (gang)	Needs gang plugin
Auto-restart crashed services	No	Native

If your cluster is doing only one of those things, the choice is easy. If it is doing both — which is increasingly normal — the question is which scheduler you are willing to pay the operational tax on.

SLURM — the HPC default that still wins for pure training

SLURM runs more than 65% of the TOP500 supercomputers and most published frontier training runs. For workloads that look like "submit a training job, wait for results, get the checkpoint," it has been the right answer for twenty years.

The mental model is simple: a partition is a named pool of nodes (gpu-5090, cpu-only, bigmem); a job is a shell script with #SBATCH directives, submitted with sbatch; GRES (Generic RESources) tracks GPUs — --gres=gpu:rtxpro6000:2 asks for two specific cards; gang scheduling is built in.

A minimal slurm.conf for a 4-node cluster with 8× RTX Pro 6000 Blackwell per node:

ClusterName=kentino-lab
SlurmctldHost=head01

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
GresTypes=gpu

# GPU accounting — without these two lines, fair-share runs on CPU-seconds and is meaningless
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTRES=gres/gpu
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityWeightFairshare=10000
PriorityWeightAge=1000

# cgroups isolate the GPUs a job is allowed to see
ProctrackType=proctrack/cgroup

NodeName=node[01-04] CPUs=128 RealMemory=1024000 Sockets=2 \
    CoresPerSocket=64 ThreadsPerCore=1 Gres=gpu:rtxpro6000:8 State=UNKNOWN

PartitionName=train Nodes=node[01-03] Default=YES MaxTime=48:00:00 State=UP
PartitionName=interactive Nodes=node04 MaxTime=04:00:00 \
    PriorityTier=10 State=UP

The matching gres.conf on every node:

AutoDetect=nvml
NodeName=node[01-04] Name=gpu Type=rtxpro6000 File=/dev/nvidia[0-7]

AutoDetect=nvml queries the NVIDIA Management Library directly, picks up MIG slices automatically, and saves you hand-rolling device file paths.

A user submits:

#!/bin/bash
#SBATCH --job-name=qwen-finetune
#SBATCH --partition=train
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:rtxpro6000:8
#SBATCH --time=12:00:00
#SBATCH --output=logs/%j.out

srun python -m torch.distributed.run \
    --nnodes=2 --nproc-per-node=8 \
    --rdzv-backend=c10d \
    --rdzv-endpoint=$(scontrol show hostnames | head -1):29500 \
    train.py --config configs/qwen72b.yaml

PyTorch Elastic and SLURM compose cleanly: SLURM owns the allocation, srun launches one process per task, torch.distributed.run plus the rendezvous backend handles rank assignment. On a node drop, --max-restarts on torchrun plus --requeue on #SBATCH recovers from the last checkpoint. Everybody runs it this way.

What you get for free: gang scheduling, fair-share decay, backfill, accounting (every GPU-second logged to sacct), and simple semantics — no CRDs, no controller, no YAML object graph. What you do not get: long-running services with auto-restart, HTTP ingress, rolling updates, sidecar containers, declarative state. SLURM is a batch scheduler. Treating it like Kubernetes will hurt.

SLURM is right when: you run training, batch inference, or simulation; users write shell scripts; one part-time SRE can operate it.

Where SLURM hits its ceiling

Three places. Web-style services — no concept of "run forever, restart on death, expose port 8000." You can shoehorn vLLM into a long-running SLURM job, but health-checking, rolling updates, and load balancing become your problem. Heterogeneous workloads — SLURM assumes one resource pool; a cluster running training + inference + Jupyter + monitoring + data prep wants finer-grained controls. Strict multi-tenant isolation — cgroups isolate CPU and memory; GPU isolation relies on CUDA_VISIBLE_DEVICES and the user's code respecting it. No namespaces, no per-job network policy, no container-grade tenancy. Fine for labs; not fine for a hosted multi-customer service.

When you hit any of those, the right move is usually adding a second layer — Kubernetes for the services — rather than torturing SLURM. CoreWeave's SUNK and 2026 managed offerings run SLURM on Kubernetes so the same cluster does both. At Kentino scale, two separate small clusters are often simpler.

Kubernetes for AI: the cloud-native path

Kubernetes was built to keep stateless web services running. For AI clusters, "everything else" gets bolted on. The 2026 stack: NVIDIA GPU Operator (drivers, CUDA, NCCL, DCGM, device plugin, MIG config), Kueue (queue, quota, admission), Volcano (batch-aware scheduler), Kubeflow Trainer (PyTorchJob, TFJob, MPIJob CRDs), KubeRay (Ray on K8s), and Karpenter or Cluster Autoscaler for node provisioning.

Why naive Kubernetes is wrong for AI: the default kube-scheduler is FIFO with no gang semantics. It is happy to start 7 of 8 pods of a distributed-training job and leave the eighth Pending while the running 7 hold GPUs idle. This is not theoretical — it is the single most common mistake teams make when they try to run training on a cluster they built for inference. The fixes are Kueue (job-level admission) and Volcano (pod-level gang placement). Current best practice is both: Kueue at the top, Volcano underneath.

Volcano

Volcano is a CNCF batch scheduler that installs alongside or instead of kube-scheduler. It adds true gang scheduling with minAvailable semantics (admit zero or N, never partial), queue priorities and preemption, fair-share / binpack / topology-aware strategies, and first-class support for PyTorchJob, TFJob, MPIJob, RayJob, and SparkApplication.

A minimal queue plus a gang-scheduled PyTorch training job:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: training
spec:
  weight: 4
  capability:
    nvidia.com/gpu: 32
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: qwen-finetune
spec:
  schedulerName: volcano
  minAvailable: 16          # gang: all 16 ranks or zero
  queue: training
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 16
      name: worker
      template:
        spec:
          containers:
            - name: pytorch
              image: kentino/pytorch:2.5-cuda13
              resources:
                limits:
                  nvidia.com/gpu: 1
              command: ["torchrun", "--nnodes=16", "--nproc-per-node=1", "train.py"]

When you do not need Volcano: a pure inference cluster running independent vLLM Deployments. Each pod owns its GPUs, no cross-pod coordination, default kube-scheduler is fine. Volcano earns its keep the moment you mix distributed training into the cluster.

Kueue and quota borrowing

Kueue handles a layer Volcano does not: who is allowed to consume how much of the cluster, and what happens when one team is idle.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: rtxpro6000
spec:
  nodeLabels:
    nvidia.com/gpu.product: "RTX-PRO-6000-Blackwell"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-research
spec:
  cohort: "shared-gpu"
  resourceGroups:
    - coveredResources: ["nvidia.com/gpu"]
      flavors:
        - name: rtxpro6000
          resources:
            - name: "nvidia.com/gpu"
              nominalQuota: 16
              borrowingLimit: 8

Both teams in the cohort get 16 GPUs guaranteed and can borrow up to 8 from the other's idle pool. High-priority work preempts borrowed capacity. Kueue's 2026 roadmap focuses on MultiKueue (multi-cluster dispatch) and cooperative preemption — a checkpointing job can be told "save state and yield in 60 seconds" rather than killed outright.

MIG and GPU sharing

The NVIDIA device plugin defaults to whole-GPU allocation. For training and large inference that is correct; for small inference, notebooks, or dev work it is wasteful. Three sharing modes: MIG (hardware memory and fault isolation, production multi-tenant inference), MPS (cooperative, no isolation), time-slicing (no isolation, dev only).

MIG exists on H100/H200, A100, and B200 — not on any GPU in the Kentino lineup. RTX 5090, RTX 4090, RTX Pro 6000 Blackwell, L40, L4 do not support MIG. If your design requires hardware-isolated GPU partitions, that constrains your hardware to SXM/datacenter parts Kentino does not build. Time-slicing advertises N virtual GPUs per physical GPU and the scheduler has no idea they are oversubscribed — use it for dev only.

What Kubernetes is good at: long-running inference services (vLLM Deployments scaled by HPA on queue depth, rolling updates, health checks — SLURM has nothing comparable), heterogeneous workloads on one cluster, ecosystem (Prometheus, Grafana, Argo). What you pay: a working K8s + GPU Operator + Kueue + Volcano + Kubeflow stack is at minimum 20 components. Realistic estimate: Kubernetes is 5–10× the operational load of SLURM for pure batch training.

Kubernetes is the right choice when: the cluster runs inference services alongside training, you have a platform team, declarative everything matters, or you need multi-cluster portability.

Ray and KubeRay — Python-native distributed compute

Ray is a different beast. It is not really a cluster scheduler in the SLURM/K8s sense — it is a distributed Python runtime that ships with a scheduler, an object store, and an autoscaler. You write Python with @ray.remote decorators and Ray figures out where to run each task.

Where Ray fits: hyperparameter tuning (Ray Tune — hundreds of trials in parallel, pruning bad ones early; the use case Ray was built for and still the best in class), reinforcement learning (RLlib — environments and learners as actors, which maps onto Ray cleanly and onto SLURM batch jobs poorly), distributed training (Ray Train, wrapping PyTorch DDP / FSDP / DeepSpeed), model serving (Ray Serve, Python-native, custom multi-model pipelines), and data preprocessing (Ray Data). What Ray is not: a multi-tenant batch scheduler for a shared cluster. Ray assumes one application owns the cluster.

KubeRay is the operator that runs Ray on Kubernetes. Three CRDs: RayCluster (long-lived), RayJob (one-shot), RayService (Ray Serve, rolling updates). A minimal RayCluster:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: rl-cluster
spec:
  rayVersion: '2.55.0'
  headGroupSpec:
    rayStartParams: { dashboard-host: '0.0.0.0' }
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray:2.55.0-py311-cu125
            resources:
              limits: { cpu: 8, memory: 32Gi }
  workerGroupSpecs:
    - groupName: gpu-workers
      replicas: 4
      minReplicas: 0
      maxReplicas: 8
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-worker
              image: rayproject/ray:2.55.0-py311-cu125
              resources:
                limits:
                  cpu: 16
                  memory: 128Gi
                  nvidia.com/gpu: 4

KubeRay autoscales the worker group between minReplicas and maxReplicas based on Ray's view of pending tasks. Combined with Kueue (RayJob + Kueue gang scheduling is documented and works), you get multi-tenant Ray on a multi-tenant K8s cluster — the production setup most "Ray shops" actually run.

When Ray standalone (without K8s) is right: a small research lab where one person owns the cluster, dynamism matters more than multi-tenancy, the workload is heavy on RL or hyperparameter search. The hybrid most teams converge on: SLURM or K8s as the base layer that owns nodes and multi-tenancy; Ray launched inside a SLURM job or a K8s namespace for the duration of one user's workload. Do not install Ray as your top-level scheduler.

Run:ai and Determined — the commercial tier

Two paid offerings come up often enough to address. Both target the same pain point: K8s + Volcano + Kueue + GPU Operator is a lot of YAML, and some organisations would rather buy than build.

NVIDIA Run:ai (acquired 2024, rebranded 2025) is a K8s-native GPU scheduler. GPU fractioning splits a single GPU between workloads at the memory level — 0.5 GPU requests, dynamic resizing, bin-packing. Run:ai earns its price in enterprise environments with 10+ ML teams competing for shared GPUs. Below that, Kueue + GPU Operator covers most of the same ground for free.

Determined.AI (HPE, 2021) is a managed training platform — experiment tracking, hyperparameter search, checkpoint management, distributed training in one product. Best fit for research teams who want a polished dashboard without integrating five tools themselves.

The interactive notebook problem

A pattern that catches almost every shared GPU cluster: researchers want JupyterHub access to GPUs for development, and that has to coexist with long-running training jobs that hold whole nodes. The naive answer — one dedicated GPU per notebook — wastes 80% of the cluster. The strict answer — make researchers sbatch everything — drives them onto laptops with toy models.

Three workable patterns: SLURM salloc + Jupyter on the allocation (salloc --gres=gpu:1 --time=4:00:00, start a Jupyter server on the allocated node, tunnel in; hard time limit, GPU properly accounted — cleanest answer for a SLURM cluster), Kubernetes + JupyterHub + Kueue (notebook pods through a low-priority queue, training jobs preempt idle notebooks), or Run:ai with GPU fractioning (notebooks get 0.25 / 0.5 GPU each). Whichever you pick, set a time limit on interactive sessions — a notebook with no timeout is a GPU permanently retired from the cluster.

The honest small-team reality

Most scheduler content online assumes a 256-GPU cluster and a platform team. The realistic Kentino customer is closer to 1–4 nodes, 4–32 GPUs total, 2–6 users who all know each other and Slack-coordinate. For that setup, you do not need a scheduler.

# user 1
ssh node01
tmux new -s my-training
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py
# Ctrl-B D to detach

# user 2 — coordinates on Slack first
ssh node01
nvidia-smi                              # which GPUs are free?
tmux new -s other-training
CUDA_VISIBLE_DEVICES=4,5,6,7 python train.py

That is the entire job-scheduling system. SLURM starts paying for itself around 16 GPUs or 8 users, whichever comes first. Below that, the operational overhead exceeds the value. Install a scheduler when human coordination is failing, not before.

When each tool fits — the summary

Scenario	Recommendation
1–2 nodes, 4–16 GPUs, 2–6 users who talk to each other	SSH + tmux + nvidia-smi. No scheduler.
Research lab, 4–32 nodes, batch training jobs	SLURM. Boring, proven, fits.
Inference platform serving customer traffic	Kubernetes + GPU Operator. No batch scheduler.
Mixed cluster: training + inference + tooling	Kubernetes + Kueue + Volcano + Kubeflow.
Heavy distributed Python: RL, hyperparam search	Ray (or KubeRay) on top of SLURM or K8s.
10+ ML teams competing for GPUs, budget for tooling	Run:ai on Kubernetes.
Managed training experiments + tracking	Determined.AI.
200+ GPUs, multi-team, multi-workload	Federated: SLURM for batch, K8s for services.

What to do next — the decision tree

Walk this in order. The first "yes" ends the conversation.

Fewer than 16 GPUs and fewer than 8 users who talk to each other? Yes → no scheduler. Document conventions in a markdown file. Revisit when coordination breaks.
Cluster running long-running inference services as well as training? Yes → Kubernetes is the base. Add GPU Operator, then Kueue, then Volcano if distributed training is in scope. No → SLURM is the base.
Have a platform team or budget for one (0.5–1.0 FTE for six months, 0.25 FTE steady-state)? No → stay on SLURM regardless of workload mix. The K8s tax is real.
>10 teams competing for GPU time with strict quota needs? Yes → evaluate Run:ai. No → Kueue + cohorts gets you 80% of the way.
Workload heavy on RL, hyperparameter search, or distributed Python? Yes → Ray (KubeRay on K8s, or salloc + Ray on SLURM). No → leave Ray out.
Hardware MIG-capable (datacenter cards)? No on Kentino lineup → plan for whole-GPU allocation or time-slicing in dev only. Do not design around MIG.

Most Kentino conversations end at step 1, 2, or 3. The rest is decoration.

Companion articles: distributed training in K02, inference clusters in K03, cluster storage in K04, failure handling in K06, the PCIe-bandwidth ceiling in K07.

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

商品已加入购物车