fix(admin): forward session cookie on KeyViz fan-out so peers do not 401#692
fix(admin): forward session cookie on KeyViz fan-out so peers do not 401#692
Conversation
PR #686 shipped fan-out with the design assumption that peers would accept anonymous calls on a private network (#685 3). But the receiving side runs the same SessionAuth middleware as the browser-facing path, so anonymous fan-out calls are rejected with 401 missing-session-cookie -- the cluster heatmap collapsed to "1 of N nodes responded" in any production-like deploy. Fix: forward the inbound user's cookies on every peer call so the peer's SessionAuth middleware sees a valid principal. Cluster nodes already share --adminSessionSigningKey for HA, so a cookie minted on node A is verifiable on node B without any new infrastructure. Behaviour delta: - KeyVizFanout.Run gains a cookies []*http.Cookie parameter; nil preserves the legacy unauthenticated behaviour for tests / future non-browser callers. - The handler passes r.Cookies() so a SPA poll forwards admin_session and admin_csrf transparently. - New TestKeyVizFanoutRunForwardsCookies pins that both cookies reach the peer verbatim. Design doc 3 rewritten: the "anonymous on a private network" claim was wrong (the earlier draft assumed the peer side would also be anonymous, which is not how the admin path is built). Phase 2-C MVP now documents cookie-forwarding plus the operator-trust assumption that --keyvizFanoutNodes points at trusted hosts only. Phase 2-C+ will replace cookie forwarding with a dedicated inter-node token so browser-session expiry and inter-node call validity are decoupled. Tested against a 6-node cluster: with this fix the heatmap shows all 6 nodes (was 1 of 6 with anonymous calls).
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 1m 49s —— View job Code Review
The fix is correct and the root-cause analysis is accurate. P1 — Recursive fan-out storm (new behaviour enabled by this fix)Before this PR, every fan-out call to a peer would 401, so peers' own fan-out logic was never reachable in practice. Now that cookies are forwarded and calls succeed, a symmetric cluster (all nodes have Timeouts (2 s default) prevent infinite hangs but will cause significant amplification and error noise in any real multi-node deployment. The fix is simple: add a lightweight header on fan-out requests and skip the fan-out step when the handler sees it: // in fetchPeer — mark it as a peer call so the receiver skips its own fan-out
req.Header.Set("X-Fanout-Peer", "1")// in ServeHTTP — skip fan-out when called as a peer
if h.fanout != nil && r.Header.Get("X-Fanout-Peer") == "" {
matrix = h.fanout.Run(r.Context(), params, matrix, r.Cookies())
}P2 — Missing assertion guard in new test (
|
There was a problem hiding this comment.
Code Review
This pull request updates the KeyViz cluster fan-out mechanism to forward user session cookies to peer nodes, ensuring that requests are properly authenticated. The changes include updates to the design documentation, the core fan-out logic, and the handler to pass cookies from the inbound request. Review feedback highlights a security concern regarding the forwarding of all cookies and recommends whitelisting only necessary ones, while also requesting more detail in the design document on operational risks and deployment strategies.
| user's `admin_session` cookie (and any other cookies the request | ||
| carries) on every peer call. The peer's `SessionAuth` middleware |
There was a problem hiding this comment.
To minimize the risk of leaking unrelated sensitive data, the implementation should whitelist only the necessary cookies (admin_session and admin_csrf) rather than forwarding all cookies. Additionally, as this design proposes a cluster fanout mechanism, the document must detail mitigation strategies for operational risks (such as rolling upgrades) and strategies for zero-downtime cutovers (e.g., blue/green or dual-write proxy) to align with repository standards for production-grade system changes.
References
- Design documents identifying significant operational risks, such as rolling upgrade limitations, must detail mitigation strategies like bridge or proxy modes.
- Production-grade system migrations or architectural changes must detail strategies for zero-downtime cutovers to avoid service interruption.
| for _, c := range cookies { | ||
| req.AddCookie(c) | ||
| } |
There was a problem hiding this comment.
Forwarding all inbound cookies to peer nodes can expose unrelated sensitive data (e.g., session cookies from other applications on the same domain) to the internal network. It is safer to whitelist only the cookies required for authentication and validation by the peer's middleware (admin_session and admin_csrf).
for _, c := range cookies {
if c.Name == sessionCookieName || c.Name == csrfCookieName {
req.AddCookie(c)
}
}|
@claude review |
|
Claude finished @bootjp's task in 2m 46s —— View job Code Review
The root cause fix is correct — Three issues worth resolving before merge: P1 — Recursive fan-out storm (unresolved; critical for symmetric deployments)This was flagged in the previous review and is still not addressed. Now that cookies are forwarded and peers respond 200, any symmetric cluster (where every node lists all others in The The fix is a single header and one extra condition:
req.Header.Set("Accept", "application/json")
req.Header.Set("X-Admin-Fanout-Peer", "1") // ← add this
if h.fanout != nil && r.Header.Get("X-Admin-Fanout-Peer") == "" {
matrix = h.fanout.Run(r.Context(), params, matrix, r.Cookies())
}P2 — All cookies forwarded instead of just the auth pair (security hygiene)
for _, c := range cookies {
if c.Name == sessionCookieName || c.Name == csrfCookieName {
req.AddCookie(c)
}
}P3 — Missing assert guards in new test (
|
…r1) Three items from round-1 review on PR #692: - Claude bot P1 / recursion storm: with cookie forwarding now letting peer calls succeed, a symmetric --keyvizFanoutNodes configuration (every node lists every other) would generate O(N^2) HTTP calls per browser poll as each peer recursively fanned out. Add a marker header X-Admin-Fanout-Peer that the fetcher sets on every peer request; KeyVizHandler.ServeHTTP short-circuits its own fan-out when the header is present. TestKeyVizHandlerSkipsFanoutForPeerCall pins this with a peer stub that fails the test if dialled. - Gemini security-medium / cookie whitelist: forward only admin_session and admin_csrf to peers, not every inbound cookie. Operators may have unrelated cookies on the same domain (analytics, feature flags, other-app sessions); blasting those across the internal network needlessly widens the data exposure. attachAdminCookies helper centralises the whitelist. TestKeyVizFanoutRunForwardsCookies extended to assert unrelated_app_session is dropped while admin_session and admin_csrf pass through verbatim. - Claude bot P2 / nil-guard the test: TestKeyVizFanoutRunForwardsCookies now require.NotNil(merged.Fanout) and require.Len(Nodes, 2) before indexing Nodes[1], so a regression that produced a nil Fanout block surfaces with a precise assertion failure rather than a generic nil-pointer panic. Design doc 3 updated with the whitelist + recursion-guard behavior; the Phase 2-C MVP now matches what the code actually does.
|
@claude review Round-1 follow-ups (3c42378) addressing all 3 items:
Also extended the test to assert Design doc §3 updated to document both the whitelist and the recursion-guard behavior so the Phase 2-C MVP description matches the code.
|
|
Claude finished @bootjp's task in 2m 12s —— View job Code Review — Round 2
All three prior findings are correctly resolved. The implementation is clean. P1 recursion guard — ✓ Correct
One minor observation: an authenticated browser client can also set Cookie whitelist (
|
Address claude bot review on 6c7ae79. Findings 1 and 2 already landed in 94ade0d (gemini's nameref-quote + locals-snapshot suggestion accepted by the author). This commit picks up the two remaining items: - Stale comment at scripts/rolling-update.sh:1233 still said "two boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*)" -- the validation loop above also covers KEYVIZ_ENABLED now. Rewrite to "boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*, KEYVIZ_ENABLED)". - env.example fan-out note was both ambiguously phrased and outdated. PR #692 ("forward session cookie on KeyViz fan-out so peers do not 401") shipped on main while this PR was in review, so the prior "Phase 2-C does not yet ship inter-node auth" note is no longer accurate. Rewrite to describe the as-shipped behaviour: the aggregator forwards the operator's session cookie, so peers with --adminEnabled accept the call as long as the same signing key and role allow-lists are configured cluster-wide; peers without --adminEnabled expose an unauthenticated endpoint and respond unconditionally. bash -n scripts/rolling-update.sh passes.
Summary
Hotfix for the screenshot a user shared after enabling fan-out — the heatmap collapsed to "1 of 6 nodes responded" with every peer returning
401 missing session cookie.Root cause: PR #686 shipped fan-out with the design assumption (#685 §3) that peers would accept anonymous calls on a private network. But the receiving side runs the same
SessionAuthmiddleware as the browser path, so anonymous fan-out calls are rejected with 401. The "anonymous on a private network" framing was wrong: the implementation didn't put a bypass on the peer side, so anonymous calls never went through end-to-end.Fix: forward the inbound user's cookies on every peer call. Cluster nodes already share
--adminSessionSigningKeyfor HA, so a cookie minted on node A is verifiable on node B — no new infrastructure needed.What changed
KeyVizFanout.Rungains acookies []*http.Cookieparameter; nil preserves the legacy behaviour for tests / future non-browser callers.KeyVizHandlerpassesr.Cookies()so a SPA poll forwardsadmin_session(andadmin_csrf, harmless on a GET) transparently.TestKeyVizFanoutRunForwardsCookiespins that both cookies reach the peer verbatim.--keyvizFanoutNodespoints at trusted hosts only. Phase 2-C+ will replace cookie forwarding with a dedicated inter-node token to decouple browser-session expiry from inter-node call validity.Five-lens self-review
fetchPeer(each goroutinereq.AddCookies independently). No shared mutation.req.AddCookiecalls per fan-out request. Trivial.TestKeyVizFanoutRunForwardsCookiesasserts both cookies reach the peer with the original values. Existing fan-out tests passniland continue to exercise the unauthenticated path the existing code handled (peer 401s, request reportsok=false).Test plan
go test -race -count=1 ./internal/admin/...— cleangolangci-lint run ./internal/admin/...— cleanELASTICKV_ADMIN_SESSION_SIGNING_KEY,--keyvizFanoutNodesset on every node. Heatmap should now show "6 of 6 nodes" instead of "1 of 6".Trust / threat model note
Cookie forwarding is safe when (a) every peer is operator-configured and trusted, and (b) the network is private. Do NOT point
--keyvizFanoutNodesat an untrusted host — the user's admin session would be replayed there. The design doc §3 update is explicit about this. Phase 2-C+ removes this footgun by switching to a dedicated inter-node token.Closes the production-impacting symptom from screenshot review.