k10s devlog

k10s is two things:

kitty: a Daemonset that lives on your Kubernetes cluster that collects node-level GPU + Network telemetry
k10s: (kittens) a tui/cli that shows the ML training jobs in your cluster and surfaces ranks that are misbehaving

The outcomes being:

You get alerted when GPUs are idle (straggler ranks) or misbehaving
Your ML workloads are stabe
You don't have to leave your terminal

Installation

Helm (recommended)

helm repo add k10s https://shvbsle.github.io/k10s
helm repo update
helm install kitty k10s/kitty

This creates the k10s namespace and deploys the kitty daemonset with GPU node tolerations out of the box. See all available values: helm show values k10s/kitty

Add one env var to your training pods and these metrics appear at :9100/metrics, labeled by rank:

env:
  - name: KITTY_WATCH
    value: "1"

See Metrics Reference for the full list and how each metric works under the hood.

We are a bunch of nerds who are into Kubernetes and GPUs. Read the dev log to know what we are up to! I recommend starting here.