Skip to content

feat(admin): /admin/healthz/leader for load-balancer leader probes#689

Merged
bootjp merged 2 commits intomainfrom
feat/admin-leader-healthz
Apr 27, 2026
Merged

feat(admin): /admin/healthz/leader for load-balancer leader probes#689
bootjp merged 2 commits intomainfrom
feat/admin-leader-healthz

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 27, 2026

Summary

  • Add /admin/healthz/leader so an upstream load balancer (Caddy, HAProxy, ...) can route admin traffic only to the current Raft leader, skipping the follower→leader AdminForward proxy hop on writes.
  • Mirrors the /healthz/leader endpoints already shipped on the S3 (adapter/s3.go:serveS3LeaderHealthz) and DynamoDB (adapter/dynamodb.go:serveDynamoLeaderHealthz) adapters — same semantics, same body shape, same method allowlist — so a multi-protocol load balancer probing any of the three sees identical responses.
  • New LeaderProbe interface in internal/admin; NewRouterWithLeaderProbe constructor (NewRouter delegates with nil probe, preserving existing callers); ServerDeps.LeaderProbe; newAdminLeaderProbe in main_admin.go consults the local Raft group runtimes.

Behaviour matrix

Path Probe state Response
/admin/healthz (n/a) 200 ok
/admin/healthz/leader wired, leader 200 ok
/admin/healthz/leader wired, follower 503 not leader
/admin/healthz/leader not wired 404 not_found (JSON)
/admin/healthz/leader POST (any) 405 method_not_allowed

Self-review (CLAUDE.md 5 lenses)

  1. Data loss — None. Read-only handler; no Raft proposals.
  2. ConcurrencyVerifyLeader is bounded at 2 s, called only on nodes that already claim leadership via Status(). No new shared state.
  3. Performance — Cheap path: followers short-circuit via Status() (no Raft round-trip). Leader path: one ReadIndex per probe per group; load balancers typically probe every few seconds, so the overhead is bounded and well below the existing background HLC lease + ReadIndex traffic.
  4. Data consistencyVerifyLeader is the same ReadIndex round-trip the lease-read path uses; a stale-leader follower mid-silent-leadership-change cannot return 200.
  5. Test coverage — six router-level tests: 200-when-leader, 503-when-follower, HEAD on both branches, POST rejection, nil-probe → 404, classify ordering (path is not swallowed by the SPA fallback).

Test plan

  • go test -race ./internal/admin/... — passes (1.3s)
  • go test -race -run TestStartAdminServer . — passes
  • golangci-lint run ./... — clean
  • CI: full suite under -race

@claude review

Add a leader-only health endpoint at /admin/healthz/leader so an
upstream load balancer (Caddy, HAProxy, ...) can route admin
traffic only to the current Raft leader, skipping the
follower->leader AdminForward proxy hop on writes. Mirrors the
/healthz/leader endpoints already shipped on the S3
(adapter/s3.go:serveS3LeaderHealthz) and DynamoDB
(adapter/dynamodb.go:serveDynamoLeaderHealthz) adapters so a
multi-protocol load balancer probing any of the three sees
identical semantics.

Wiring:

- internal/admin: new LeaderProbe interface + LeaderProbeFunc
  adapter; ServerDeps.LeaderProbe; new
  NewRouterWithLeaderProbe constructor (NewRouter delegates with
  nil probe, preserving callers).
- Router.classify: new routeHealthzLeader matched BEFORE the SPA
  catch-all and BEFORE the /admin/healthz equality check so the
  longer path never resolves to index.html.
- Router.serveLeaderHealth: 200 + "ok\n" when probe returns true;
  503 + "not leader\n" otherwise. text/plain, Cache-Control:
  no-store, GET/HEAD only (mirrors S3/Dynamo).
- main_admin.go: newAdminLeaderProbe iterates the local raft
  group runtimes, short-circuits on Status().State !=
  StateLeader, and only nodes claiming leadership pay a 2 s
  ReadIndex VerifyLeader round-trip. "Leader of any local
  runtime" is treated as yes so a multi-group deployment that
  distributes leadership still surfaces a useful answer; in the
  common single-group / co-located case this collapses to "is
  this node THE leader". Returns nil for an empty runtimes slice
  so the router answers /admin/healthz/leader with a JSON 404
  ("feature off"), matching the Tables / Buckets / Queues
  pattern.

Behaviour matrix:

| Path                       | Probe state        | Response                   |
|----------------------------|--------------------|----------------------------|
| /admin/healthz             | (n/a)              | 200 ok                     |
| /admin/healthz/leader      | wired, leader      | 200 ok                     |
| /admin/healthz/leader      | wired, follower    | 503 not leader             |
| /admin/healthz/leader      | not wired          | 404 not_found (JSON)       |
| /admin/healthz/leader POST | (any)              | 405 method_not_allowed     |

Test coverage:

- TestRouter_HealthzLeader_ReturnsOKWhenLeader (200 + body shape)
- TestRouter_HealthzLeader_Returns503WhenNotLeader (503 + body)
- TestRouter_HealthzLeader_HeadOmitsBody (HEAD on both branches)
- TestRouter_HealthzLeader_RejectsPost (method allowlist)
- TestRouter_HealthzLeader_NilProbeReturns404 ("feature off")
- TestRouter_HealthzLeader_NotSwallowedBySPA (classify ordering)

Self-review (CLAUDE.md 5 lenses):
1. Data loss -- None. Read-only handler; no Raft proposals.
2. Concurrency -- VerifyLeader is bounded at 2 s, called only on
   nodes that already claim leadership via Status(). No new
   shared state.
3. Performance -- Cheap path: followers short-circuit via
   Status() (no Raft round-trip). Leader path: one ReadIndex per
   probe per group; load balancers typically probe every few
   seconds, so the overhead is bounded and well below the
   existing background HLC lease + ReadIndex traffic.
4. Data consistency -- VerifyLeader is the same ReadIndex round-
   trip the lease-read path uses; a stale-leader follower mid-
   silent-leadership-change cannot return 200.
5. Test coverage -- six router-level tests covering both happy
   paths, the wired-but-not-leader case, the not-wired case, the
   method allowlist, and the classify ordering that prevents the
   SPA fallback from swallowing the path.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 57 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d50530ef-aa89-4140-8ee7-414575aa0d35

📥 Commits

Reviewing files that changed from the base of the PR and between a3d312f and 689f2e4.

📒 Files selected for processing (5)
  • internal/admin/router.go
  • internal/admin/router_test.go
  • internal/admin/server.go
  • main_admin.go
  • main_admin_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/admin-leader-healthz

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 1m 57s —— View job


Code Review: feat(admin): /admin/healthz/leader

  • Read changed files
  • Review internal/admin/router.go
  • Review internal/admin/router_test.go
  • Review internal/admin/server.go
  • Review main_admin.go
  • Review main_admin_test.go
  • Cross-check against S3/DynamoDB leader probe patterns
  • Post review findings

Overall the implementation is clean, well-structured, and consistent with the existing patterns in the codebase. I have three findings — one minor RFC compliance gap that's pre-existing but now the endpoint claims parity with S3/DynamoDB (which do set the header), one dead-code nit, and one body-text divergence from S3 worth documenting.


Finding 1 — Missing Allow header on 405 (minor, pre-existing gap)

File: internal/admin/router.go:237–240

RFC 7231 §6.5.5 requires that a 405 Method Not Allowed response MUST include an Allow header listing the supported methods. Both reference implementations set it:

// adapter/s3.go:2404
w.Header().Set("Allow", "GET, HEAD")
w.WriteHeader(http.StatusMethodNotAllowed)

// adapter/dynamodb.go:399
w.Header().Set("Allow", "GET, HEAD")
http.Error(w, http.StatusText(http.StatusMethodNotAllowed), http.StatusMethodNotAllowed)

The new serveLeaderHealth (and all existing handlers in router.go) omit this. Since the PR explicitly claims "same semantics" as S3/DynamoDB, this gap is now more visible. A load balancer or health-check framework that inspects the Allow header will get an incomplete response.

This is pre-existing across serveHealth, serveAsset, and serveSPA too — fixing it in this PR would be the right time since the claim of parity is being made. Fix this →


Finding 2 — Redundant nil guard inside serveLeaderHealth (nit, dead code)

File: internal/admin/router.go:242

func (rt *Router) serveLeaderHealth(w http.ResponseWriter, r *http.Request) {
    ...
    if rt.leader == nil || !rt.leader.IsVerifiedLeader() {  // rt.leader == nil can never be true here

dispatch() already short-circuits a nil probe to rt.notFound before serveLeaderHealth is ever called:

case routeHealthzLeader:
    if rt.leader == nil {
        return rt.notFound   // exits here when nil
    }
    return http.HandlerFunc(rt.serveLeaderHealth)  // only reached when non-nil

The rt.leader == nil branch inside serveLeaderHealth is dead code. It can only mislead a future reader into thinking it's doing something load-bearing when it isn't. Removing it makes the precondition explicit. Fix this →


Finding 3 — Body text diverges from S3 reference (documentation nit)

The PR description states "same body shape" as both S3 and DynamoDB, but there's a small difference:

Adapter Success body Failure body
S3 "ok" "not leader"
DynamoDB "ok\n" "not leader\n"
Admin "ok\n" "not leader\n"

Admin matches DynamoDB but not S3 (S3 omits the trailing newline). This won't affect load balancer behavior since LBs probe status codes, not bodies — but the PR description's "identical responses" claim is technically only true vs. DynamoDB. A one-liner update to the PR description would be sufficient.


Positive observations

  • Interface design is correct. LeaderProbe / LeaderProbeFunc mirrors ClusterInfoFunc. The nil-probe → 404 contract is clean and distinguishes "feature off" from "503 not leader" as intended.
  • Classify ordering is correct. pathHealthzLeader is evaluated before the SPA fallback (routeSPA catches all /admin/* paths). The test TestRouter_HealthzLeader_NotSwallowedBySPA locks this down well.
  • Short-circuit is correct. Status().State != raftengine.StateLeader returns early without the VerifyLeader ReadIndex round-trip for followers, matching the S3/DynamoDB pattern exactly.
  • Timeout is appropriate. The 2s adminLeaderProbeTimeout with context.Background() is the right choice here — there's no incoming request context to inherit and 2s matches the S3/DynamoDB probe reference.
  • Multi-group "any leader" semantics are documented inline. The logic is sound for the common single-group case and the multi-group edge is explained in the comment.
  • Test coverage is thorough. All six cases in the behavior matrix are covered, including the SPA regression guard. The TestRouter_HealthzLeader_HeadOmitsBody test correctly checks status code on the follower HEAD case (503 with empty body), which is the tricky part.
  • Wiring in main_admin.go is clean. Follows the existing bridge pattern; startAdminServer signature extension with nil default in tests is minimal.

Summary: Two small fixes worth making before merge (the Allow header for RFC compliance, and removing the dead nil guard); one documentation-only nit about S3 body parity. The core logic is sound and ready to ship.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new leader health check endpoint, /admin/healthz/leader, to facilitate load balancer routing to the Raft leader. It adds a LeaderProbe interface and integrates it into the admin router and server components. Review feedback recommends correcting a misleading comment about route classification, removing a redundant nil check in the health handler, and utilizing the server's lifecycle context within the leader probe to ensure proper task cancellation during shutdown.

Comment thread internal/admin/router.go Outdated
Comment on lines +153 to +155
// /admin/healthz/leader must be checked before /admin/healthz so
// the longer path does not fall through the equality check and
// resolve to the SPA fallback.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment regarding path ordering is misleading. Since both /admin/healthz/leader and /admin/healthz are matched using exact equality (==), the order between them does not matter for correctness. The critical requirement is that both are checked before the SPA fallback (lines 162-164), which uses a prefix match and would otherwise swallow these routes.

Suggested change
// /admin/healthz/leader must be checked before /admin/healthz so
// the longer path does not fall through the equality check and
// resolve to the SPA fallback.
// Both healthz paths must be checked before the SPA catch-all
// so they are not swallowed by the index.html fallback.

Comment thread internal/admin/router.go Outdated
return
}
status, body := http.StatusOK, "ok\n"
if rt.leader == nil || !rt.leader.IsVerifiedLeader() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check rt.leader == nil is redundant here. The dispatch method (lines 203-205) already ensures that rt.leader is non-nil before calling this handler. If it were nil, the router would have already returned a 404 JSON response via rt.notFound. Removing this check simplifies the logic and avoids suggesting that a nil probe would result in a 503 response.

Suggested change
if rt.leader == nil || !rt.leader.IsVerifiedLeader() {
if !rt.leader.IsVerifiedLeader() {

Comment thread main_admin.go Outdated
return errors.Wrap(err, "build admin leader forwarder")
}
_, err = startAdminServer(ctx, lc, eg, cfg, staticCreds, clusterSrc, tablesSrc, bucketsSrc, queuesSrc, forwarder, keyvizSampler, keyvizFanoutCfg, buildVersion())
leaderProbe := newAdminLeaderProbe(runtimes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Pass the server's lifecycle context to newAdminLeaderProbe to enable proper cancellation of background Raft verification tasks during shutdown.

Suggested change
leaderProbe := newAdminLeaderProbe(runtimes)
leaderProbe := newAdminLeaderProbe(ctx, runtimes)
References
  1. When designing interfaces, use context.Context for managing deadlines and cancellation to ensure proper resource management and graceful shutdown.

Comment thread main_admin.go Outdated
Comment on lines +167 to +188
func newAdminLeaderProbe(runtimes []*raftGroupRuntime) admin.LeaderProbe {
if len(runtimes) == 0 {
return nil
}
return admin.LeaderProbeFunc(func() bool {
for _, rt := range runtimes {
if rt == nil || rt.engine == nil {
continue
}
if rt.engine.Status().State != raftengine.StateLeader {
continue
}
ctx, cancel := context.WithTimeout(context.Background(), adminLeaderProbeTimeout)
err := rt.engine.VerifyLeader(ctx)
cancel()
if err == nil {
return true
}
}
return false
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The newAdminLeaderProbe function should accept and use the server's lifecycle context. Using context.Background() for VerifyLeader means that these operations will not be cancelled if the server is shutting down. Since VerifyLeader involves network I/O and Raft processing, respecting the server's context is a best practice for graceful shutdown and resource management.

func newAdminLeaderProbe(ctx context.Context, runtimes []*raftGroupRuntime) admin.LeaderProbe {
	if len(runtimes) == 0 {
		return nil
	}
	return admin.LeaderProbeFunc(func() bool {
		for _, rt := range runtimes {
			if rt == nil || rt.engine == nil {
				continue
			}
			if rt.engine.Status().State != raftengine.StateLeader {
				continue
			}
			tCtx, cancel := context.WithTimeout(ctx, adminLeaderProbeTimeout)
			err := rt.engine.VerifyLeader(tCtx)
			cancel()
			if err == nil {
				return true
			}
		}
		return false
	})
}
References
  1. When designing interfaces, use context.Context for managing deadlines and cancellation to ensure proper resource management and graceful shutdown.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f50223275

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main_admin.go Outdated
Comment on lines +172 to +176
for _, rt := range runtimes {
if rt == nil || rt.engine == nil {
continue
}
if rt.engine.Status().State != raftengine.StateLeader {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Base admin leader probe on default raft group

newAdminLeaderProbe returns healthy as soon as any local runtime verifies leadership, but admin write paths still key off the coordinator’s default group (kv/sharded_coordinator.go IsLeader/VerifyLeader) and SQS admin deletes have no follower-forwarding fallback (adapter/sqs_admin.go), so in multi-group deployments a node that leads a non-default group can return 200 here while still rejecting admin writes with 503 (or forcing extra forward hops). That breaks the endpoint’s stated load-balancer contract of selecting the current admin-write leader.

Useful? React with 👍 / 👎.

Address findings from PR #689 round-1 review:

- [Codex P1] Scope newAdminLeaderProbe to the **default Raft group**
  via coordinate.IsLeader / coordinate.VerifyLeader instead of the
  prior "any local runtime" loop. Admin write paths
  (kv/sharded_coordinator.go, adapter/sqs_admin.go) all key off the
  default group, so a node leading only a non-default group could
  return 200 here while still rejecting admin writes -- breaking the
  endpoint's stated load-balancer contract. Routing through the
  coordinator keeps the healthz answer aligned with what the
  AdminForward proxy and SQS admin write path treat as authoritative.

  newAdminLeaderProbe(runtimes []*raftGroupRuntime) becomes
  newAdminLeaderProbe(coordinate kv.Coordinator). The change drops
  the per-runtime engine.VerifyLeader fan-out and the
  adminLeaderProbeTimeout constant -- coordinate.VerifyLeader is the
  same ReadIndex round-trip lease reads use, with the engine's own
  bounded deadline, so an outer 2 s timeout is no longer needed.

- [Claude / Gemini] Add the RFC 7231 §6.5.5 Allow: GET, HEAD header
  to every router-served 405 response (healthz, healthz/leader,
  asset, SPA). Pulled the body shape into a writeMethodNotAllowed
  helper so the four call sites stay uniform; load balancers and
  synthetic-monitor tools now see the same Allow header the S3 and
  DynamoDB /healthz/leader handlers already advertise. New
  TestRouter_405_AllowHeader sweeps every 405-emitting path so a
  future handler that bypasses the helper would regress this test.

- [Claude / Gemini] Drop the redundant rt.leader == nil check inside
  serveLeaderHealth. dispatch() already short-circuits the nil case
  to rt.notFound before this handler is ever invoked; the in-body
  check was dead code that misled readers about the precondition.
  Documented the precondition explicitly on the function.

- [Gemini] Rewrite the misleading "ordering before /admin/healthz"
  comment in classify(). Both healthz paths are exact-match equality
  checks; their relative order does not matter for correctness. The
  load-bearing rule is that BOTH must run before the SPA prefix
  branch -- now spelled out that way.

Self-review (CLAUDE.md 5 lenses):
1. Data loss -- None. Read-only handlers; no Raft proposals touched.
2. Concurrency -- The probe now serialises through the coordinator's
   default-group view, which is the same path lease reads use.
   Behaviour during leadership change is unchanged: VerifyLeader's
   ReadIndex returns an error, the probe returns false, the LB
   takes the node out of rotation.
3. Performance -- Cheaper than the prior per-runtime loop in the
   common single-group case (one IsLeader / one VerifyLeader vs N
   per-runtime Status checks plus M VerifyLeader calls).
4. Data consistency -- coordinate.VerifyLeader is the same ReadIndex
   round-trip the lease-read path uses; the load balancer can now
   key off the same liveness signal admin writes do.
5. Test coverage -- new TestRouter_405_AllowHeader. Existing six
   /admin/healthz/leader tests still cover happy / 503 / HEAD /
   POST / nil-probe / SPA-no-swallow. The redundant-nil-guard
   removal is covered by TestRouter_HealthzLeader_NilProbeReturns404
   (still passes via dispatch()'s nil check).

Body-text parity nit: the PR description's "identical responses to
S3 and DynamoDB" is technically true vs DynamoDB but not S3 (S3
omits the trailing newline; this admin path matches DynamoDB). Not
worth a behaviour change -- LBs key off status codes, not body
strings -- but worth noting in case a future PR aligns the three.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 3m 21s —— View job


Code Review: feat(admin): /admin/healthz/leader (round 2)

  • Read changed files and git log
  • Verify round-1 findings were addressed
  • Trace coordinate.VerifyLeader() call chain for shutdown behavior
  • Post findings

The fix commit 689f2e4 addressed all round-1 findings. Here's the status and one new observation surfaced while tracing the implementation.


Round-1 findings: all addressed ✅

Finding Status
RFC 7231 §6.5.5 Allow header missing on 405 ✅ Fixed — writeMethodNotAllowed helper + TestRouter_405_AllowHeader sweep
Redundant rt.leader == nil guard in serveLeaderHealth ✅ Removed — precondition now documented on the function
Codex P1 — "any group" leader could return 200 while admin writes still 503 ✅ Fixed — newAdminLeaderProbe now takes kv.Coordinator (default group)
Misleading "ordering before /admin/healthz" comment in classify() ✅ Rewritten accurately

New finding — coordinate.VerifyLeader() uses context.Background() internally (informational)

File: kv/raft_engine.go:43-45, main_admin.go:177

The fix commit's commit message claims "coordinate.VerifyLeader is the same ReadIndex round-trip lease reads use, with the engine's own bounded deadline, so a stalled cluster surfaces 503 here on its own without the probe needing an outer timeout." This is worth verifying, because it's the stated justification for not passing the server lifecycle context (Gemini's suggestion).

Tracing the call chain:

main_admin.go:177  coordinate.VerifyLeader()
kv/coordinator.go:616  → verifyLeaderEngine(c.engine)
kv/raft_engine.go:43   → verifyLeaderEngineCtx(context.Background(), engine)
raft/engine.go:742     → engine.VerifyLeader(context.Background())
raft/engine.go:1021    → submitRead(context.Background(), false)

submitRead has two selects: the first (<-e.doneCh) returns early if the engine is already stopped before the read can be enqueued — good. But the second select (waiting for <-req.done) has no <-e.doneCh arm:

// kv/raft_engine.go — submitRead (second select)
select {
case <-ctx.Done():       // context.Background() — never fires
case res := <-req.done:  // waits indefinitely; no e.doneCh escape
}

So if the engine stops after the read request is enqueued (req written to readCh), the probe goroutine can block until the raft engine's own shutdown logic delivers a result to req.done. Whether it does so promptly depends on the engine's teardown path — and that's not a new risk introduced here, since verifyLeaderEngine is called from other places too.

Practical impact: During graceful shutdown, a health probe that is mid-VerifyLeader when the server context is cancelled won't be interrupted. In the worst case it blocks until the raft engine stops naturally, with adminShutdownTimeout = 5 s as the outer backstop. Not a correctness issue and not introduced by this PR — it's pre-existing behavior in verifyLeaderEngine. The fix commit's reasoning is directionally correct (the engine stops → the read unblocks); it's just not instant.

Gemini's suggestion (pass the server lifecycle context to the probe) would actually eliminate this — a context.WithTimeout(ctx, 2s) wrapper, or exposing a VerifyLeaderCtx(ctx) variant on Coordinator, would let the probe respect the shutdown signal. But that's a broader change to the Coordinator interface and not required for this PR to be correct.

Verdict: Informational only. No action needed here unless you want to address shutdown latency precisely — and that belongs in a follow-up that also fixes the other verifyLeaderEngine callers consistently.


Nit — TestRouter_HealthzRejectsPost doesn't assert Allow header

File: internal/admin/router_test.go:59–66

The existing TestRouter_HealthzRejectsPost test was not updated to assert Allow: GET, HEAD — it only checks the status code. The new TestRouter_405_AllowHeader sweep test covers this path, so there's no regression risk. But TestRouter_HealthzLeader_RejectsPost (line 125–132) does assert the header inline, creating a small inconsistency in the test style. Minor, and arguably the sweep test is the right place to assert this invariant. Not a blocker. Fix this →


Positive observations

  • writeMethodNotAllowed helper is a clean DRY abstraction. The TestRouter_405_AllowHeader table sweep is the right way to lock this down — a future handler that bypasses the helper will regress the test rather than silently shipping a non-compliant 405.
  • Default-group scoping (coordinate.IsLeader / coordinate.VerifyLeader) is the correct fix for the Codex P1 finding. The comment at main_admin.go:155–160 clearly explains the "why" so a future reader understands the tradeoff.
  • verifyLeaderEngine short-circuit: followers still exit without a ReadIndex round-trip (IsLeader() returns false → return early). Only nodes claiming leadership pay the VerifyLeader cost.
  • TestRouter_405_AllowHeader is well structured as a parallel table test; the failure message "path %s missing RFC 7231 Allow header on 405" gives immediate signal on regressions.
  • All six behavior matrix cases are covered and the SPA regression guard is still in place.

Summary: Two commits, clean implementation. All round-1 findings resolved; one informational observation about shutdown latency in the existing verifyLeaderEngine path (pre-existing, not introduced here); one test nit. Ready to merge.

@bootjp bootjp enabled auto-merge April 27, 2026 12:53
@bootjp bootjp merged commit 67cf028 into main Apr 27, 2026
9 checks passed
@bootjp bootjp deleted the feat/admin-leader-healthz branch April 27, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant