docs: update GPU operator guide with Civo-tuned flags and H100 NVLink…#211
Merged
docs: update GPU operator guide with Civo-tuned flags and H100 NVLink…#211
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Updates the GPU Clusters doc (
kubernetes/advanced/gpu-config.md) so customers get a reliable install on first try on Civo's GPU image, and have clear guidance for the single-GPU H100 case that currently fails without a workaround.What changed
helm install --generate-namesnippet with ahelm upgrade --installthat passestoolkit.enabled=false(the NVIDIA container toolkit is already baked into Civo's GPU image), plusdriver.enabled=true,devicePlugin.enabled=true,gfd.enabled=true,operator.defaultRuntime=containerd, andvalidator.cuda.runtimeClassName=nvidia. Added a table explaining what each flag does.nvidia-kernel-configConfigMap withNVreg_NvLinkDisable=1and the matchingdriver.kernelModuleConfig.namehelm flag. Includes a warning not to apply this on multi-H100 nodes.--version 25.10.1tip.CrashLoopBackOffnow points at the NVLink section; added guidance for pods stuckPendingonnvidia.com/gpu.Why
The previous install example produced a working-but-suboptimal setup on Civo clusters (the Operator would layer its own container toolkit on top of the one Civo already provides), and single-GPU H100 nodes failed with no customer-facing explanation. These edits bring the doc in line with what our internal testing has validated.