k10s devlog

k10s is two things:

The outcomes being:


Installation

helm repo add k10s https://shvbsle.github.io/k10s
helm repo update
helm install kitty k10s/kitty

This creates the k10s namespace and deploys the kitty daemonset with GPU node tolerations out of the box. See all available values: helm show values k10s/kitty

Add one env var to your training pods and these metrics appear at :9100/metrics, labeled by rank:

env:
  - name: KITTY_WATCH
    value: "1"

See Metrics Reference for the full list and how each metric works under the hood.


We are a bunch of nerds who are into Kubernetes and GPUs. Read the dev log to know what we are up to! I recommend starting here.