Designing k10s

03 May, 2026

I spent the first few weeks building k10s without really knowing what it was supposed to show.

I knew the problem: k9s is great for Kubernetes operations but it's blind to GPU cluster health. An H100 at 0% utilization costs $3/hr[1] and k9s will tell you the node is Ready. Something had to change. But I started building before I understood the design and that was a mistake I paid for later.

Eventually I landed on a concept that reframed everything. I'm gonna call it the design atom.

What is a design atom?

Every navigation tool picks a fundamental unit. The object it reasons about, sorts, filters, and lets you drill into. In k9s the atom is the Kubernetes resource (pods, deployments, services, nodes). That choice made total sense when k9s was built. Kubernetes in 2016 was for microservices. The things you cared about were: is my service running? Which pod is crashing? How many replicas?

Pod-centric tooling was the right design for that world.

The question I had to answer for k10s: what is the right atom for a GPU cluster running distributed training in 2026?

Spoiler: it's not pods.

What happens when you use the wrong atom

I learned this by actually running training on a small cluster. 10 g4dn.xlarge instances (one T4 each, about 5 Gbps ethernet). I was using PyTorch DDP (Distributed Data Parallel, the standard setup for multi-GPU training).

Here's how DDP works. Each rank runs compute on its own mini-batch, then all ranks do an all-reduce to average gradients across the cluster. The all-reduce is a synchronization barrier. Every rank has to contribute before any rank can proceed to the next step.

This means the slowest rank sets the pace. Always.

The problem: the slow rank is invisible. Every rank reports the same elapsed time per step because they all crossed the barrier at the same moment. The healthy ranks are idle at the barrier waiting, but nothing in the logs shows this. kubectl logs from each pod looks identical. kubectl get pods looks healthy. The cluster looks fine. It's burning money quietly.

I had a run where one rank was consistently dragging. To find it I had to:

SSH into each node separately
Run nvidia-smi on each and compare power draw (GPU-Util was useless here -- NCCL communication kernels count as "utilized" even when tensor cores are idle, so power draw is the real signal)
kubectl exec -it pod-N -- ss -ti into each pod, find the NCCL connections, compare rtt/minrtt ratios and retrans counts across all pods

That's 20+ commands across 10 nodes to answer one question: which rank is the bottleneck?

It gets worse. On a different run, a rank OOM'd mid-training. 15.2 GB on a 15.4 GB VRAM budget. The other 9 ranks entered the all-reduce and waited. NCCL has a watchdog timer (default: 10 minutes). So for 10 minutes, 9 GPUs just sat there burning money doing nothing. Then the watchdog fired and printed an error containing an IP address (not a rank number, not a pod name, just an IP you have to reverse-lookup against kubectl get pods -o wide), and killed everything.

kubectl get pods: 0/10 completions, no explanation. The TrainJob restarted all 10 pods. The same rank OOM'd again. Crashloop. And kubectl get nodes said all nodes were Ready the entire time.

This is what pod-centric tooling misses. It's looking at the right layer for microservices. It's looking at the wrong layer for distributed training.

The right atoms

Once I understood the actual pain, the design decision became obvious.

The atoms in k10s are GPUs and Ranks.

A GPU is the hardware unit that costs money when idle. A rank is the logical unit of a distributed training job. These are the objects you actually think about when something goes wrong. "Rank 5 is slow." "This node's H100s have been idle for 4 hours." Not "pod gpt2-job-worker-5 is in Error state."

This also defines the information join that k10s needs to do. The stuff you need is scattered across five places:

k8s API (pod status, events, scheduling decisions, taints)
DCGM[2] (real GPU metrics: SM activity, tensor core utilization, memory)
Training CRD status (PyTorchJob, RayJob, MPIJob, JobSet)
Kueue[3] (queue state, admission, preemption)
nvidia-smi power draw (the signal GPU-Util hides)

No existing tool joins these. k9s doesn't know about DCGM. kubectl doesn't know about Kueue admission state. DCGM dashboards don't know about training rank topology. Everyone is showing you one slice. k10s needs to be the terminal-native join.

Three views

From the atom decision, three views fell out naturally.

Fleet View is the default landing. GPU nodes as first-class objects, sorted idle-first. An H100 at 0% sorts to the top, highlighted amber. After 6 hours it escalates to red. The convention is inverted from normal ops tooling. Idle is loud, not quiet. An idle H100 is a problem. It should look like one.

NODE                        MODEL      GPUs  UTIL              MEM    TEMP   WORKLOAD
ip-172-31-27-62             H100-80GB  8x    [          ]  0%   0%    38C   IDLE 4h12m
ip-172-31-26-30             H100-80GB  8x    [##.       ] 18%  12%    45C   inference-gemma
ip-172-31-17-242            H100-80GB  8x    [########. ] 87%  71%    68C   pytorch-job-47
ip-172-31-5-66              H100-80GB  8x    [##########] 99%  89%    74C   training-run-kappa

Jobs View groups by training CRD, not by namespace or label. The rows are ranks, not pods. Per-rank GPU utilization shows as sparklines so you can spot stragglers without running anything. Straggler detection is rule-based: if any rank's rolling-window utilization is more than 2 standard deviations below the median, it gets flagged automatically.

Queue View shows Kueue state if Kueue is present. LocalQueue to ClusterQueue hierarchy, pending workloads, admitted jobs. If Kueue isn't deployed, it says so instead of crashing. Graceful degradation is mandatory throughout -- DCGM absent shows "---", not a stack trace.

The `y` diagnostic

y on any row answers the question I was spending 20 commands on: why is this GPU idle? Why did this rank fail?

Rule-based checks against everything k10s can see simultaneously:

Is the node tainted and does the workload tolerate it?
Does the workload fit the available GPU memory?
Is the queue at capacity?
Did the last pod OOM? What was the memory headroom?
Are there scheduling failure events in the last 10 minutes?

For the crashloop scenario above: rank OOM'd, 9 ranks waited 10 minutes, everything died. y surfaces: "Rank 9 OOM at step 247 (15.2GB / 15.4GB). 9 ranks hit NCCL watchdog timeout (10m). Reduce batch size or enable gradient checkpointing."

Boom. One keypress.

Keep an eye out at the k10s repo.

Footnotes:

H100 on-demand: ~$3.10/hr on Lambda, ~$2.50/hr on CoreWeave
DCGM: NVIDIA Data Center GPU Manager, gives you real GPU metrics beyond what nvidia-smi surfaces
Kueue: kubernetes-sigs/kueue, batch workload queueing for Kubernetes

#design #distributed-training #gpu #k10s #kubernetes #tui