k10s is two things:
- kitty: a Daemonset that lives on your Kubernetes cluster that collects node-level GPU + Network telemetry
- k10s: (kittens) a tui/cli that shows the ML training jobs in your cluster and surfaces ranks that are misbehaving
The outcomes being:
- You get alerted when GPUs are idle (straggler ranks) or misbehaving
- Your ML workloads are stabe
- You don't have to leave your terminal
Installation
Helm (recommended)
helm repo add k10s https://shvbsle.github.io/k10s
helm repo update
helm install kitty k10s/kitty
This creates the k10s namespace and deploys the kitty daemonset with GPU node tolerations out of the box. See all available values: helm show values k10s/kitty
Add one env var to your training pods and these metrics appear at :9100/metrics, labeled by rank:
env:
- name: KITTY_WATCH
value: "1"
See Metrics Reference for the full list and how each metric works under the hood.
We are a bunch of nerds who are into Kubernetes and GPUs. Read the dev log to know what we are up to! I recommend starting here.