Node Agent

node-agent monitors the health of Kubernetes nodes and can automatically reboot VM instances when necessary. A reboot is triggered when a node fails one or more health checks (e.g. NodeReady, GPU count, Cilium, DiskPressure) for a configured threshold.

By default it runs in monitor-only mode, logging recovery actions without executing them. Set monitorOnly=false to enable actual reboots.

Prerequisites: `civo-api-access` Secret

The civo-api-access secret is automatically provisioned by Civo in the kube-system namespace of every Civo Kubernetes cluster. It contains the API credentials and cluster identity used by node-agent:

Key	Description
`api-key`	Civo API key used for reboot operations.
`api-url`	Civo API URL.
`cluster-id`	The ID of this Civo Kubernetes cluster.
`region`	The Civo region this cluster runs in.

No manual setup is required — node-agent reads these values directly from the existing secret.

NVIDIA GPU Operator (GPU clusters only)

The GPU health check relies on the nvidia.com/gpu.count label added by the NVIDIA GPU Feature Discovery component. Follow the Civo documentation to install the NVIDIA GPU Operator on your cluster:

Installing the NVIDIA GPU Operator

Install `node-agent` chart

You will need to clone this repository in order to have access to the charts directory. In your terminal, change directory to your cloned node-agent repo directory, then run:

helm upgrade -n kube-system --install node-agent ./charts

To enable active recovery (actually reboot nodes):

helm upgrade -n kube-system --install node-agent ./charts --set monitorOnly=false

Configuration

Helm values (`values.yaml`)

Value	Default	Description
`nodePoolIDs`	`""`	Comma-separated node pool IDs to watch. Empty means all nodes.
`rebootWaitMinutes`	`10`	Minutes to wait after rebooting a standard node before retrying.
`gpuRebootWaitMinutes`	`40`	Minutes to wait after rebooting a GPU node before retrying.
`maxRebootRetries`	`5`	Maximum reboot attempts before the node transitions to `Failed` (no further reboots).
`monitorOnly`	`true`	If `true`, log recovery actions without executing them. Set `false` to enable reboots.
`metricsPort`	`9625`	Port for the Prometheus metrics endpoint.

Health checkers

Checker	Condition	Threshold
`NodeReady`	`NodeReady == True`	5 min
`DiskPressure`	`DiskPressure != True`	30 min
`CiliumAgent`	`NetworkUnavailable == False` with reason `CiliumIsUp` (skipped for non-Cilium CNI)	10 min
`GPU`	`allocatable["nvidia.com/gpu"]` equals `nvidia.com/gpu.count` label (skipped for non-GPU nodes)	10 min

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github/workflows		.github/workflows
charts		charts
pkg		pkg
.env.example		.env.example
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Node Agent

Prerequisites: `civo-api-access` Secret

NVIDIA GPU Operator (GPU clusters only)

Install `node-agent` chart

Configuration

Helm values (`values.yaml`)

Health checkers

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Node Agent

Prerequisites: civo-api-access Secret

NVIDIA GPU Operator (GPU clusters only)

Install node-agent chart

Configuration

Helm values (values.yaml)

Health checkers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Prerequisites: `civo-api-access` Secret

Install `node-agent` chart

Helm values (`values.yaml`)

Packages