Skip to content

docs: update GPU operator guide with Civo-tuned flags and H100 NVLink…#211

Merged
jokestax merged 1 commit intomainfrom
docs/gpu-operator-h100-nvlink
Apr 20, 2026
Merged

docs: update GPU operator guide with Civo-tuned flags and H100 NVLink…#211
jokestax merged 1 commit intomainfrom
docs/gpu-operator-h100-nvlink

Conversation

@jokestax
Copy link
Copy Markdown
Contributor

Summary

Updates the GPU Clusters doc (kubernetes/advanced/gpu-config.md) so customers get a reliable install on first try on Civo's GPU image, and have clear guidance for the single-GPU H100 case that currently fails without a workaround.

What changed

  • Install command tuned for Civo's image. Replaced the bare helm install --generate-name snippet with a helm upgrade --install that passes toolkit.enabled=false (the NVIDIA container toolkit is already baked into Civo's GPU image), plus driver.enabled=true, devicePlugin.enabled=true, gfd.enabled=true, operator.defaultRuntime=containerd, and validator.cuda.runtimeClassName=nvidia. Added a table explaining what each flag does.
  • New section: Single-GPU H100 nodes — NVLink workaround. On a node with only one H100, the driver fails to load because NVLink has no peer. Documented the nvidia-kernel-config ConfigMap with NVreg_NvLinkDisable=1 and the matching driver.kernelModuleConfig.name helm flag. Includes a warning not to apply this on multi-H100 nodes.
  • B200 added to the supported GPU list.
  • GPU Operator version pinned — called out chart/app v25.10.1 as the version Civo has validated end-to-end, with a --version 25.10.1 tip.
  • Troubleshooting expanded — H100 CrashLoopBackOff now points at the NVLink section; added guidance for pods stuck Pending on nvidia.com/gpu.

Why

The previous install example produced a working-but-suboptimal setup on Civo clusters (the Operator would layer its own container toolkit on top of the one Civo already provides), and single-GPU H100 nodes failed with no customer-facing explanation. These edits bring the doc in line with what our internal testing has validated.

Copy link
Copy Markdown
Member

@hlts2 hlts2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jokestax jokestax merged commit 5060ebe into main Apr 20, 2026
3 checks passed
@jokestax jokestax deleted the docs/gpu-operator-h100-nvlink branch April 20, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants