Nvidia GPUs on a Bare-Metal Kubernetes Cluster with NixOS: A Rabbit Hole Adventure
To scale his machine learning framework, MAZE, the author attempted to enable Nvidia GPU support on his Kubernetes cluster, comprising three mini-PCs and a retired workstation. This proved far more challenging than anticipated, involving hurdles such as configuring the Nvidia device plugin, navigating the complexities of a NixOS environment, and deploying PKI certificates. He ultimately succeeded, sharing his experiences deploying a Kubernetes cluster using NixOS, Ansible, and Sops, alongside a deep dive into CRI, CDI, nvidia-container-toolkit, and more. He also developed nix-playground, a tool to simplify patching and building open-source projects, and leveraged Grok 3 for debugging. Along the way, he encountered further challenges like PyCharm issues with WSL NixOS and Kubernetes RuntimeClass configuration. The entire journey, akin to Alice's Adventures in Wonderland, highlights the author's impressive execution power and problem-solving skills.