Skip to content

fix(admin): forward session cookie on KeyViz fan-out so peers do not 401#692

Merged
bootjp merged 2 commits intomainfrom
fix/keyviz-fanout-forward-session-cookie
Apr 27, 2026
Merged

fix(admin): forward session cookie on KeyViz fan-out so peers do not 401#692
bootjp merged 2 commits intomainfrom
fix/keyviz-fanout-forward-session-cookie

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 27, 2026

Summary

Hotfix for the screenshot a user shared after enabling fan-out — the heatmap collapsed to "1 of 6 nodes responded" with every peer returning 401 missing session cookie.

Root cause: PR #686 shipped fan-out with the design assumption (#685 §3) that peers would accept anonymous calls on a private network. But the receiving side runs the same SessionAuth middleware as the browser path, so anonymous fan-out calls are rejected with 401. The "anonymous on a private network" framing was wrong: the implementation didn't put a bypass on the peer side, so anonymous calls never went through end-to-end.

Fix: forward the inbound user's cookies on every peer call. Cluster nodes already share --adminSessionSigningKey for HA, so a cookie minted on node A is verifiable on node B — no new infrastructure needed.

What changed

  • KeyVizFanout.Run gains a cookies []*http.Cookie parameter; nil preserves the legacy behaviour for tests / future non-browser callers.
  • KeyVizHandler passes r.Cookies() so a SPA poll forwards admin_session (and admin_csrf, harmless on a GET) transparently.
  • TestKeyVizFanoutRunForwardsCookies pins that both cookies reach the peer verbatim.
  • Design doc §3 rewritten: documents the cookie-forwarding scheme + the operator-trust assumption that --keyvizFanoutNodes points at trusted hosts only. Phase 2-C+ will replace cookie forwarding with a dedicated inter-node token to decouple browser-session expiry from inter-node call validity.

Five-lens self-review

  1. Data loss — n/a; auth path change.
  2. Concurrency / distributed — Cookies are passed by reference into per-peer goroutines but the slice is read-only inside fetchPeer (each goroutine req.AddCookies independently). No shared mutation.
  3. Performance — Adds N req.AddCookie calls per fan-out request. Trivial.
  4. Data consistency — Auth-only change; merge semantics unchanged.
  5. Test coverage — New TestKeyVizFanoutRunForwardsCookies asserts both cookies reach the peer with the original values. Existing fan-out tests pass nil and continue to exercise the unauthenticated path the existing code handled (peer 401s, request reports ok=false).

Test plan

  • go test -race -count=1 ./internal/admin/... — clean
  • golangci-lint run ./internal/admin/... — clean
  • Manual: 6-node cluster with shared ELASTICKV_ADMIN_SESSION_SIGNING_KEY, --keyvizFanoutNodes set on every node. Heatmap should now show "6 of 6 nodes" instead of "1 of 6".

Trust / threat model note

Cookie forwarding is safe when (a) every peer is operator-configured and trusted, and (b) the network is private. Do NOT point --keyvizFanoutNodes at an untrusted host — the user's admin session would be replayed there. The design doc §3 update is explicit about this. Phase 2-C+ removes this footgun by switching to a dedicated inter-node token.

Closes the production-impacting symptom from screenshot review.

PR #686 shipped fan-out with the design assumption that peers
would accept anonymous calls on a private network (#685 3). But
the receiving side runs the same SessionAuth middleware as the
browser-facing path, so anonymous fan-out calls are rejected with
401 missing-session-cookie -- the cluster heatmap collapsed to
"1 of N nodes responded" in any production-like deploy.

Fix: forward the inbound user's cookies on every peer call so the
peer's SessionAuth middleware sees a valid principal. Cluster nodes
already share --adminSessionSigningKey for HA, so a cookie minted on
node A is verifiable on node B without any new infrastructure.

Behaviour delta:

- KeyVizFanout.Run gains a cookies []*http.Cookie parameter; nil
  preserves the legacy unauthenticated behaviour for tests / future
  non-browser callers.
- The handler passes r.Cookies() so a SPA poll forwards admin_session
  and admin_csrf transparently.
- New TestKeyVizFanoutRunForwardsCookies pins that both cookies
  reach the peer verbatim.

Design doc 3 rewritten: the "anonymous on a private network" claim
was wrong (the earlier draft assumed the peer side would also be
anonymous, which is not how the admin path is built). Phase 2-C MVP
now documents cookie-forwarding plus the operator-trust assumption
that --keyvizFanoutNodes points at trusted hosts only. Phase 2-C+
will replace cookie forwarding with a dedicated inter-node token so
browser-session expiry and inter-node call validity are decoupled.

Tested against a 6-node cluster: with this fix the heatmap shows
all 6 nodes (was 1 of 6 with anonymous calls).
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 7 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51911d4c-fae1-43e9-8df4-c6e3a18ee62e

📥 Commits

Reviewing files that changed from the base of the PR and between edaaa64 and 3c42378.

📒 Files selected for processing (5)
  • docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md
  • internal/admin/keyviz_fanout.go
  • internal/admin/keyviz_fanout_test.go
  • internal/admin/keyviz_handler.go
  • internal/admin/keyviz_handler_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/keyviz-fanout-forward-session-cookie

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 1m 49s —— View job


Code Review

  • Read changed files
  • Review internal/admin/keyviz_fanout.go
  • Review internal/admin/keyviz_fanout_test.go
  • Review internal/admin/keyviz_handler.go
  • Review docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md
  • Post review findings

The fix is correct and the root-cause analysis is accurate. protect() in server.go:286 wraps every route — including /admin/api/v1/keyviz/matrix — in SessionAuth → Audit → CSRFDoubleSubmit, so the "anonymous on private network" design assumption was dead on arrival. Cookie forwarding against a shared --adminSessionSigningKey is the right minimal fix for the MVP.


P1 — Recursive fan-out storm (new behaviour enabled by this fix)

Before this PR, every fan-out call to a peer would 401, so peers' own fan-out logic was never reachable in practice. Now that cookies are forwarded and calls succeed, a symmetric cluster (all nodes have --keyvizFanoutNodes pointing at each other) will fan-out recursively:

Browser → A.fanout({B,C})
          B receives call, has fanout → {A,C} → fans out to A and C
          A receives B's call, fans out to B and C again…

Timeouts (2 s default) prevent infinite hangs but will cause significant amplification and error noise in any real multi-node deployment. The fix is simple: add a lightweight header on fan-out requests and skip the fan-out step when the handler sees it:

// in fetchPeer — mark it as a peer call so the receiver skips its own fan-out
req.Header.Set("X-Fanout-Peer", "1")
// in ServeHTTP — skip fan-out when called as a peer
if h.fanout != nil && r.Header.Get("X-Fanout-Peer") == "" {
    matrix = h.fanout.Run(r.Context(), params, matrix, r.Cookies())
}

Fix this →


P2 — Missing assertion guard in new test (keyviz_fanout_test.go:187)

merged.Fanout.Nodes[1].OK will panic if Fanout is nil or Nodes has fewer than 2 entries, obscuring the real failure. The other tests in this file all check require.NotNil(t, merged.Fanout) and require.Len(t, merged.Fanout.Nodes, 2) first:

// add these two lines before line 187
require.NotNil(t, merged.Fanout)
require.Len(t, merged.Fanout.Nodes, 2)
require.True(t, merged.Fanout.Nodes[1].OK)

Fix this →


Minor observations (no change required)

  • CSRF forwarding is harmless: CSRFDoubleSubmit only enforces on state-changing methods (POST/PUT/DELETE/PATCH); GET fan-out calls skip that check. The test comment acknowledges this correctly.
  • Nil-slice safety: for _, c := range cookies with a nil slice is a no-op — the nil-cookies path works correctly.
  • Thread safety: The cookies slice is read-only inside each goroutine (req.AddCookie copies to the request, never mutates the input slice). No races.
  • Design doc is accurate and the correction of the "anonymous on private network" assumption is clearly written.

Summary

The 401 fix is correct. The one actionable issue before merge is the recursive fan-out risk (P1) — without a peer-call marker, a symmetric cluster will fan-out exponentially until per-call timeouts kick in. The test guard (P2) is a low-priority quality improvement.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the KeyViz cluster fan-out mechanism to forward user session cookies to peer nodes, ensuring that requests are properly authenticated. The changes include updates to the design documentation, the core fan-out logic, and the handler to pass cookies from the inbound request. Review feedback highlights a security concern regarding the forwarding of all cookies and recommends whitelisting only necessary ones, while also requesting more detail in the design document on operational risks and deployment strategies.

Comment on lines +136 to +137
user's `admin_session` cookie (and any other cookies the request
carries) on every peer call. The peer's `SessionAuth` middleware
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To minimize the risk of leaking unrelated sensitive data, the implementation should whitelist only the necessary cookies (admin_session and admin_csrf) rather than forwarding all cookies. Additionally, as this design proposes a cluster fanout mechanism, the document must detail mitigation strategies for operational risks (such as rolling upgrades) and strategies for zero-downtime cutovers (e.g., blue/green or dual-write proxy) to align with repository standards for production-grade system changes.

References
  1. Design documents identifying significant operational risks, such as rolling upgrade limitations, must detail mitigation strategies like bridge or proxy modes.
  2. Production-grade system migrations or architectural changes must detail strategies for zero-downtime cutovers to avoid service interruption.

Comment thread internal/admin/keyviz_fanout.go Outdated
Comment on lines +312 to +314
for _, c := range cookies {
req.AddCookie(c)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Forwarding all inbound cookies to peer nodes can expose unrelated sensitive data (e.g., session cookies from other applications on the same domain) to the internal network. It is safer to whitelist only the cookies required for authentication and validation by the peer's middleware (admin_session and admin_csrf).

	for _, c := range cookies {
		if c.Name == sessionCookieName || c.Name == csrfCookieName {
			req.AddCookie(c)
		}
	}

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 2m 46s —— View job


Code Review

  • Read changed files
  • Review internal/admin/keyviz_fanout.go
  • Review internal/admin/keyviz_fanout_test.go
  • Review internal/admin/keyviz_handler.go
  • Review docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md
  • Post review findings

The root cause fix is correct — protect() in server.go:325 wraps keyvizChain in SessionAuth → Audit → CSRFDoubleSubmit, so anonymous fan-out was always going to 401. Cookie forwarding against a shared --adminSessionSigningKey is the right minimal fix, and the design doc correction is clear. The code is clean and the nil-slice range in fetchPeer is safe.

Three issues worth resolving before merge:


P1 — Recursive fan-out storm (unresolved; critical for symmetric deployments)

This was flagged in the previous review and is still not addressed. Now that cookies are forwarded and peers respond 200, any symmetric cluster (where every node lists all others in --keyvizFanoutNodes) will fan out recursively:

Browser → A.fanout({B,C})
  B receives A's call → B.fanout({A,C})   ← B fans out back to A
    A receives B's call → A.fanout({B,C}) ← infinite, bounded only by 2 s timeout

The buildKeyVizFanout in main_admin.go:905 only filters self from the peer list; it does not prevent a peer that receives a fan-out request from re-fanning-out to everyone. With the previous 401 regime this never surfaced — every peer immediately failed. Now they succeed and recurse.

The fix is a single header and one extra condition:

internal/admin/keyviz_fanout.go:302 — mark requests as peer-originated:

req.Header.Set("Accept", "application/json")
req.Header.Set("X-Admin-Fanout-Peer", "1")  // ← add this

internal/admin/keyviz_handler.go:180 — skip fan-out when called as a peer:

if h.fanout != nil && r.Header.Get("X-Admin-Fanout-Peer") == "" {
    matrix = h.fanout.Run(r.Context(), params, matrix, r.Cookies())
}

Fix this →


P2 — All cookies forwarded instead of just the auth pair (security hygiene)

fetchPeer forwards every cookie on the request. In practice a browser session on the same domain could carry more than just admin_session / admin_csrf. Whitelisting by name using the already-defined constants is strictly safer and is a trivial change:

internal/admin/keyviz_fanout.go:312:

for _, c := range cookies {
    if c.Name == sessionCookieName || c.Name == csrfCookieName {
        req.AddCookie(c)
    }
}

Fix this →


P3 — Missing assert guards in new test (keyviz_fanout_test.go:187)

merged.Fanout.Nodes[1].OK at line 187 will panic before require can format the failure if merged.Fanout is nil or Nodes has fewer than 2 entries (e.g., peer timeout before the assertion runs). Every other test in this file guards first. Add two guards before that line:

require.NotNil(t, merged.Fanout)
require.Len(t, merged.Fanout.Nodes, 2)
require.True(t, merged.Fanout.Nodes[1].OK)

Fix this →


Minor observations (no changes required)

  • CSRF on GET is harmless: CSRFDoubleSubmit in middleware.go:131 only enforces on state-changing methods. Forwarding admin_csrf on a GET fan-out call is a no-op at the receiving end. The test comment is accurate.
  • Nil-slice safety: for _, c := range cookies with a nil slice is a no-op in Go — existing tests passing nil are correct.
  • Thread safety: The cookies slice is read-only inside each goroutine; each goroutine calls req.AddCookie on its own private *http.Request. No shared mutation, no race.
  • Design doc §3 correction is clear and accurate; the "anonymous on private network" framing is properly retracted.

Summary

Cookie forwarding is the right fix and the implementation is correct. P1 (recursive fan-out) is the blocker — a symmetric 3-node cluster with this PR as-is will produce an amplified cascade of fan-out calls on every SPA poll once peers can 401 no longer. P2 and P3 are low-risk but worth a quick pass before merge.

…r1)

Three items from round-1 review on PR #692:

- Claude bot P1 / recursion storm: with cookie forwarding now
  letting peer calls succeed, a symmetric --keyvizFanoutNodes
  configuration (every node lists every other) would generate
  O(N^2) HTTP calls per browser poll as each peer recursively
  fanned out. Add a marker header X-Admin-Fanout-Peer that the
  fetcher sets on every peer request; KeyVizHandler.ServeHTTP
  short-circuits its own fan-out when the header is present.
  TestKeyVizHandlerSkipsFanoutForPeerCall pins this with a peer
  stub that fails the test if dialled.

- Gemini security-medium / cookie whitelist: forward only
  admin_session and admin_csrf to peers, not every inbound cookie.
  Operators may have unrelated cookies on the same domain
  (analytics, feature flags, other-app sessions); blasting those
  across the internal network needlessly widens the data exposure.
  attachAdminCookies helper centralises the whitelist.
  TestKeyVizFanoutRunForwardsCookies extended to assert
  unrelated_app_session is dropped while admin_session and
  admin_csrf pass through verbatim.

- Claude bot P2 / nil-guard the test: TestKeyVizFanoutRunForwardsCookies
  now require.NotNil(merged.Fanout) and require.Len(Nodes, 2)
  before indexing Nodes[1], so a regression that produced a nil
  Fanout block surfaces with a precise assertion failure rather
  than a generic nil-pointer panic.

Design doc 3 updated with the whitelist + recursion-guard
behavior; the Phase 2-C MVP now matches what the code actually
does.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

Round-1 follow-ups (3c42378) addressing all 3 items:

  • Claude bot P1 / recursion storm — Added X-Admin-Fanout-Peer marker header set by fetchPeer; KeyVizHandler.ServeHTTP short-circuits its own fan-out when present. Without this, a symmetric --keyvizFanoutNodes configuration would have produced O(N²) peer HTTP calls per browser poll once cookies started actually reaching peers. New TestKeyVizHandlerSkipsFanoutForPeerCall pins the guard with a peer stub that fails the test if dialled.
  • Gemini security-medium / cookie whitelistattachAdminCookies helper centralises the whitelist: only admin_session and admin_csrf are forwarded; unrelated cookies (analytics, feature flags, other-app sessions on the same domain) are dropped at the fan-out boundary. TestKeyVizFanoutRunForwardsCookies now also stages an unrelated_app_session cookie and asserts it does not reach the peer.
  • Claude bot P2 / nil-guardTestKeyVizFanoutRunForwardsCookies now require.NotNil(merged.Fanout) and require.Len(Nodes, 2) before indexing, so a regression surfaces with a precise message rather than a panic.

Also extended the test to assert keyVizFanoutPeerHeader == "1" on the peer-side, locking the recursion-guard contract end-to-end.

Design doc §3 updated to document both the whitelist and the recursion-guard behavior so the Phase 2-C MVP description matches the code.

go test -race -count=1 ./internal/admin/... clean. golangci-lint run ./internal/admin/... clean.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 2m 12s —— View job


Code Review — Round 2

  • Read changed files
  • Review recursion guard (P1 follow-up)
  • Review cookie whitelist (attachAdminCookies)
  • Review test coverage
  • Review design doc updates
  • Post findings

All three prior findings are correctly resolved. The implementation is clean.

P1 recursion guard — ✓ Correct

keyVizFanoutPeerHeader is defined once as a package-level constant (keyviz_fanout.go:43), set in fetchPeer (keyviz_fanout.go:336), and checked in ServeHTTP (keyviz_handler.go:180). The guard r.Header.Get(keyVizFanoutPeerHeader) == "" is the right shape — an absent or empty header means "browser call, fan out normally"; a present header means "peer call, serve local only."

One minor observation: an authenticated browser client can also set X-Admin-Fanout-Peer: 1 on their own request to bypass fan-out (getting only the local node's heatmap). This is benign — they're already authenticated and they just get fewer data — but it's an undocumented shortcut that could confuse someone debugging. Not a bug, just noting it for completeness.

Cookie whitelist (attachAdminCookies) — ✓ Correct

attachAdminCookies (keyviz_fanout.go:212) uses the same sessionCookieName / csrfCookieName constants defined in middleware.go:19-20 that SessionAuth and CSRFDoubleSubmit already depend on. This is the right level of sharing — no string literals, no drift risk.

Test coverage — ✓ Thorough

TestKeyVizFanoutRunForwardsCookies now asserts three things end-to-end:

  1. admin_session and admin_csrf reach the peer verbatim.
  2. unrelated_app_session does not reach the peer.
  3. X-Admin-Fanout-Peer: 1 is set on the outbound peer request.

TestKeyVizHandlerSkipsFanoutForPeerCall uses an httptest peer stub that records whether it was ever dialed, then asserts peerHits == 0. The require.Nil(t, matrix.Fanout) assertion is the right contract — a peer-call response is a raw local matrix with no Fanout block, since there's no aggregation to report. ✓

One nit in TestKeyVizHandlerSkipsFanoutForPeerCall: peerHits is a bare int accessed across goroutine boundaries. In the happy path (guard works, peer never dialed) this is fine because there's no concurrent write. In the failure path (guard regresses, peer is dialed), there's a write in the handler goroutine and a read in the test goroutine; the happens-before chain through net/http's internals makes this safe in practice, but it's inconsistent with seenMu pattern in the sibling test. Using var peerHits atomic.Int32 would make this formally race-safe in all paths and match the pattern. Very low priority — -race won't flag it in the success case.

Design doc — ✓ Accurate

The doc now correctly:

  • Documents cookie forwarding + the whitelist rationale (Gemini's point).
  • Documents the recursion guard with the O(N²) framing.
  • Retracts the "anonymous on private network" claim explicitly.
  • Calls out the trust model risk (--keyvizFanoutNodes must point at trusted hosts).
  • Previews Phase 2-C+ inter-node token decoupling.

The nesting of the recursion guard content under the trust-model bullet is a little hard to parse — the recursion guard and the auth story are sibling concerns logically — but it reads clearly enough and isn't worth a round-trip.


Summary

The three outstanding items (P1 recursion guard, cookie whitelist, nil-guard) are all correctly addressed. The only remaining observation is the non-atomic peerHits in one test — no functional risk, and make test -race won't flag it in the success case. LGTM with the nit noted; no blocking issues.

@bootjp bootjp merged commit 08b2fb4 into main Apr 27, 2026
9 checks passed
@bootjp bootjp deleted the fix/keyviz-fanout-forward-session-cookie branch April 27, 2026 16:06
bootjp added a commit that referenced this pull request Apr 27, 2026
Address claude bot review on 6c7ae79. Findings 1 and 2 already
landed in 94ade0d (gemini's nameref-quote + locals-snapshot
suggestion accepted by the author). This commit picks up the two
remaining items:

- Stale comment at scripts/rolling-update.sh:1233 still said
  "two boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*)" -- the
  validation loop above also covers KEYVIZ_ENABLED now. Rewrite
  to "boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*, KEYVIZ_ENABLED)".

- env.example fan-out note was both ambiguously phrased and
  outdated. PR #692 ("forward session cookie on KeyViz fan-out so
  peers do not 401") shipped on main while this PR was in review,
  so the prior "Phase 2-C does not yet ship inter-node auth" note
  is no longer accurate. Rewrite to describe the as-shipped
  behaviour: the aggregator forwards the operator's session
  cookie, so peers with --adminEnabled accept the call as long as
  the same signing key and role allow-lists are configured
  cluster-wide; peers without --adminEnabled expose an
  unauthenticated endpoint and respond unconditionally.

bash -n scripts/rolling-update.sh passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant