docs: KeyViz cluster fan-out proposal (Phase 2-C)#685
Conversation
Documents the MVP path for cluster-wide heatmaps: static node list via --keyvizFanoutNodes, per-node parallel GETs against the existing /admin/api/v1/keyviz/matrix, and a deliberately conservative merge rule (reads sum, writes max with conflict flag) that ships the operator-visible value without yet requiring the proto extension parent design 9.1 needs for the canonical merge. Scopes: - PR 1: server-side aggregator + flag plumbing (next). - PR 2: SPA degraded banner + conflict hatching. - PR 3 (Phase 2-C+): proto + JSON extension with raftGroupID + leaderTerm; replace max-merge with the canonical (bucketID, raftGroupID, leaderTerm, windowStart) rule. Explicitly defers GetClusterOverview membership discovery, the cmd/elastickv-admin standalone binary, and TLS for fan-out — each gets its own follow-up if and when needed.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5064129ec4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| divergence the operator should investigate; we surface it as a | ||
| warning in the per-node status payload but do not block the |
There was a problem hiding this comment.
Define a wire field for catalog-divergence warnings
Section 4.4 says Start/End mismatches should be surfaced as a warning in the per-node status payload, but the wire format in §5 only defines node, ok, and error for each fanout.nodes item. Without a dedicated warning field, implementations either cannot report this divergence at all or must overload error (which incorrectly implies node failure), so operators can miss a real consistency issue or see false degraded-state signals.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request proposes a design for KeyViz Cluster Fan-out (Phase 2-C), enabling a cluster-wide heatmap view by aggregating data from multiple nodes. The design includes merge rules for metrics, a static node configuration, and a phased implementation plan. Feedback on the design suggests detailing rolling upgrade and zero-downtime migration strategies, resolving internal inconsistencies regarding the 'conflict' flag, and clarifying inter-node authentication. Further recommendations include rounding timestamps to handle clock skew, limiting response sizes to prevent OOM, and caching DNS lookups to improve performance.
| --- | ||
| status: proposed | ||
| phase: 2-C | ||
| parent_design: docs/admin_ui_key_visualizer_design.md | ||
| date: 2026-04-27 | ||
| --- | ||
|
|
||
| # KeyViz Cluster Fan-out (Phase 2-C) | ||
|
|
||
| ## 1. Background |
There was a problem hiding this comment.
The design for cluster fanout introduces significant operational changes to node communication. To comply with repository standards, the document must detail a strategy for rolling upgrades, including mitigation strategies like a temporary 'bridge' or 'proxy' mode to handle communication between upgraded and non-upgraded nodes.
References
- Design documents identifying significant operational risks, such as rolling upgrade issues, must detail mitigation strategies like bridge or proxy modes.
| --- | ||
| status: proposed | ||
| phase: 2-C | ||
| parent_design: docs/admin_ui_key_visualizer_design.md | ||
| date: 2026-04-27 | ||
| --- | ||
|
|
||
| # KeyViz Cluster Fan-out (Phase 2-C) | ||
|
|
||
| ## 1. Background |
There was a problem hiding this comment.
As this proposal outlines a migration to a new production-grade cluster aggregation model, the plan must include specific strategies for live or zero-downtime cutovers (e.g., blue/green deployment or dual-write proxy) to ensure no service interruption occurs during the transition.
References
- Production-grade system migrations must detail strategies for zero-downtime cutovers to avoid service interruption.
| - **Writes**: max across nodes, with a `conflict=true` flag on | ||
| cells where the per-node values disagree (best-effort dedup; | ||
| correct under stable leadership, conservative under leadership | ||
| flip). |
There was a problem hiding this comment.
The definition of conflict=true is inconsistent across the document. Section 2.1 describes it as being per-cell and triggered by any disagreement, while Section 5 and Section 6 specify it as per-row and intended for cases where multiple nodes report non-zero values (e.g., during a leadership flip). Clarifying this to be per-row and based on multiple non-zero reports will ensure the implementation matches the intended MVP behavior and remains usable.
| - The fan-out caller authenticates with the same admin session | ||
| cookie the upstream request carried. The `internal/admin` | ||
| middleware mints a short-lived inter-node token signed with the | ||
| shared admin secret (already used for browser sessions) so a | ||
| compromised browser cookie cannot be replayed beyond its TTL on |
There was a problem hiding this comment.
The authentication mechanism for fan-out requests is unclear. The text mentions both replaying the browser's session cookie and minting a short-lived inter-node token. Replaying user cookies between nodes is generally discouraged. It would be better to clarify that the aggregator uses a dedicated inter-node token for peer requests, as suggested by the mention of the shared admin secret.
| arrays line up modulo clock skew. The aggregator pivots on | ||
| `column_unix_ms[i]` exactly: a column present in node A but | ||
| absent in node B contributes only A's values for that timestamp, | ||
| with B's missing column treated as zero. Operators with NTP | ||
| drift > Step should fix NTP — the heatmap is not designed to | ||
| hide clock skew. |
There was a problem hiding this comment.
Pivoting on exact column_unix_ms values is fragile in a distributed system. Even with NTP, slight offsets or different start times can cause nodes to produce slightly different timestamps for the same logical window, leading to a sparse or jagged heatmap. Consider rounding timestamps to the nearest keyvizStep boundary during aggregation to ensure columns are correctly merged.
| 3. **Performance** — Aggregator cost is O(N × M × C) where N is | ||
| nodes, M is rows per node, C is columns. At the 1024-row | ||
| budget × 3 nodes × 60 columns this is ~180k cells per | ||
| request — well below the SPA's existing render budget. No | ||
| coordinated round trip per `Observe`; the existing hot path is | ||
| not touched. |
There was a problem hiding this comment.
While the aggregator cost is estimated to be low, the design should explicitly require limiting the size of the response bodies accepted from peer nodes (e.g., using http.MaxBytesReader). This prevents a single misconfigured or compromised node from sending an oversized payload that could cause an OOM on the aggregator node.
| addresses? Proposing **DNS lookups happen on every fan-out | ||
| request** (no caching) — small clusters that this MVP targets | ||
| do not need a resolver cache. |
There was a problem hiding this comment.
Performing a DNS lookup for every node on every fan-out request can introduce significant latency and load on the DNS infrastructure. Even for small clusters, transient DNS issues would directly impact the responsiveness of the Admin UI. It is recommended to resolve hostnames once at configuration load or use a short-lived cache for the resolutions.
Seven items from Codex P2 + Gemini medium / security-medium / high: - 2.1 conflict flag definition: clarify per-row scope and signal that 2-C+ moves it to per-cell, removing the inconsistency between 2.1 and 5/6. - 3 auth: replace the conflated cookie-replay-plus-mint-token paragraph with a clean two-tier story. Phase 2-C MVP is anonymous on a private network; Phase 2-C+ adds a short-lived signed inter-node token derived from the existing admin signing key. Adds an explicit do-not-enable-on-untrusted warning. - 4.5 time alignment: bucket-align column timestamps to the keyvizStep boundary before merge so sub-step NTP jitter does not split each window into two adjacent merged columns. - 5 wire format: add per-node warnings array with a catalog_divergence code so the wire form actually carries the catalog-mismatch signal 4.4 promises. - 8 lens 3: document the 64 MiB io.LimitReader cap on peer response bodies (matches the cap implemented in PR #686). - 9 open Q1: resolve the DNS question. Resolution rides the stdlib resolver / nscd cache; no aggregator-local cache. Operators on flaky DNS colocate a resolver. - 1 background: add a Rollout notes paragraph addressing the Gemini high asks. The feature is off by default, mixed-version rollout is correct by construction (no schema migration, no state, forwards-compatible wire format), so a dual-write proxy / blue-green plan is not justified for this scope.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@claude review Round-1 follow-ups (1cd6ed5): Codex P2 — added per-node Gemini medium / consistency — Gemini security-medium / auth — rewrote §3 with a clean two-tier story:
Gemini medium / column drift — §4.5 now bucket-aligns column timestamps to Gemini security-medium / body cap — §8 lens 3 documents the 64 MiB Gemini medium / DNS — §9 Q1 resolved: resolution rides the stdlib resolver and nscd cache; no aggregator-local cache. Per-call deadline already protects against transient DNS hangs. Operators on flaky DNS colocate a caching resolver — same answer for every short-lived HTTP client, not unique to this feature. Gemini high / rollout strategy — pushed back in §1 "Rollout notes". This feature:
A dual-write proxy / blue-green isn't a fit for this scope; mixed-version rollout is correct by construction. The worst-case partial deploy surfaces as a per-node Open to pushback on any of these calls if the trade-offs aren't right. |
|
Claude finished @bootjp's task in 3m 25s —— View job Round-2 ReviewSummary: All round-1 issues are resolved. There are a few targeted follow-up items, mostly around the body-cap math, a 5s/2s timeout discrepancy, and one behavioral precision issue with Round-1 follow-ups: addressed ✓
Round-2 findings1. §8 lens 3 — body-cap arithmetic is internally inconsistent (medium)The current prose:
The math doesn't add up in two ways:
The cap itself (64 MiB) is likely fine for typical deployments, but the prose math should be corrected so a future reader can trust it. Suggested rewording:
2. §8 lens 3 — 4096 is not
|
Seven follow-ups from Claude bot's round-2 review on 1cd6ed5, all documentation-level (no architectural change): - 4.4 row identity: state explicitly that selection is by lowest- indexed node in --keyvizFanoutNodes whose response is non-empty for the row, AFTER all per-node responses have been collected. Selection by config-list position rather than wall-clock arrival is the load-bearing property; without it Start/End would flap between polls. - 8 lens 2: replace "5-second ceiling per node" with "configured per-call timeout (default 2 s, override via --keyvizFanoutTimeout per 3)". Aligns with 3 and 9 Q2. - 8 lens 3: fix the body-cap arithmetic. The earlier "32 B per uint64 ≈ 128 MiB raw, ~32 MiB JSON-encoded" was internally inconsistent (JSON does not compress). Restate as 1024 × 4096 × 8 B/uint64 = ~32 MiB raw, JSON similar size, 64 MiB cap = 2x headroom. Also reference MaxHistoryColumns (100,000) so the reader knows the cap fires at extreme column counts. - 8 lens 3: tighten the over-cap behavioral description. Writing "log a warning rather than silently truncating" implied an automatic detection step; what actually happens is the json.Decoder hits EOF mid-stream (LimitReader-induced), the aggregator records ok=false with the decode error, and emits a WARN-level log. No partial data is accepted. - 8 lens 5: list the three actual scenarios in 7 PR-1 plus the url-builder / per-node-order / over-cap tests rather than claiming "four scenarios" with no fourth scenario in 7. - 9 Q2: mark Resolved with the same shape as Q1.
|
@claude review Round-2 follow-ups (d012f4c) addressing all 7 doc-level items:
Q3 (per-node |
|
Claude finished @bootjp's task in 1m 44s —— View job Round-3 Review
Round-2 follow-ups: all addressed ✓
Round-3 findingsThree genuinely new items (all doc-level) plus one confirmation. 1. §3 — self-detection logic is under-specified (minor)The doc says "
Suggested addition to §3:
2. §5 —
|
| Q | Status |
|---|---|
| Q1 | Resolved ✓ |
| Q2 | Resolved ✓ |
Q3 (generated_at per node) |
Intentionally open, deferred to PR 2 ✓ |
Verdict
All round-2 items are cleanly addressed. The three round-3 items are genuinely minor (one implementation-hint for self-detection, one wire-format behavior clarification, one terminology nit). None of them are blockers. The design is ready to move to PR-1 implementation. Recommend landing the doc on this commit and opening the implementation PR.
| Branch: docs/keyviz-fanout-phase2c
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d012f4c219
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **Writes**: max across nodes, with a row-level `conflict=true` | ||
| flag raised when *any* per-node values disagreed for *any* cell | ||
| in the row (best-effort dedup; correct under stable leadership, |
There was a problem hiding this comment.
Define conflict to ignore leader-vs-follower zeros
The spec currently says to set row-level conflict=true when any per-node values disagree for any cell, but in steady state writes naturally disagree (leader>0, followers 0), so this rule would flag nearly every active write row as conflicted. That contradicts the later stable-leader expectation (no conflict) and would make the UI hatch normal traffic as if it were a leadership anomaly. Please narrow the conflict condition to true disagreement among competing non-zero writers (or an equivalent predicate) so expected follower zeros are not treated as conflicts.
Useful? React with 👍 / 👎.
Three minor doc items + one nit from Claude bot round-3: - 3 self-detection: document the matching shape (string equality against --adminAddress, plus wildcard-bind / loopback handling for the 0.0.0.0:8080 vs 127.0.0.1:8080 asymmetry). Reverse-proxy / DNS-aliased entries fall through to a harmless loopback call. - 5 conflict field: state explicitly that the field is always present, defaults to false in local-only mode. Wire-stable so pattern-match-on-presence SPAs do not regress when fan-out flips. - 4.5 keyvizStep_ms: define the variable in-section so the formula is self-contained. - 8 lens 3 sizing prose nit: replace "similar size for typical small values" with the precise statement (decimal encoding is smaller than 8-byte binary for ~4-digit counters; cap becomes a real ceiling only for traffic-saturated heatmaps with ~20-digit counters).
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@claude review Round-3 follow-ups (0b5bfa0):
|
|
Claude finished @bootjp's task in 1m 53s —— View job Round-4 Review
Round-3 follow-ups: all addressed ✓
Round-4 finding§2.1 vs §4.2 —
|
| Q | Status |
|---|---|
| Q1 | Resolved ✓ |
| Q2 | Resolved ✓ |
Q3 (generated_at per node) |
Intentionally open, deferred to PR 2 ✓ |
Verdict
Three full rounds of careful review, and we're down to one issue: tighten the §2.1 conflict predicate wording to match §4.2's intent (follower zeros are not disagreements). That's a two-sentence doc fix. Everything else — merge rules, wire format, auth story, self-detection, cap math, timeout handling, test scenarios — is solid. Recommend fixing §2.1 and landing.
| Branch: docs/keyviz-fanout-phase2c
The 2.1 wording said conflict fires "when any per-node values
disagreed for any cell," which under stable leadership would
include the leader-says-5 vs follower-says-0 split (the
steady-state shape) and contradicted both 4.2 and the 7 stable-
leader test expectation ("max wins, no conflict").
Narrow the predicate to its actual shape: conflict fires only
when two or more nodes report a non-zero value for the same cell.
Follower zeros against a leader's non-zero value do not count.
This matches the implementation in PR #686 (maxMerge ignores zero
on either side and returns conflict=false).
Mirrored the clarification in 4.2 so the conflict-predicate rule
is explicit in both the scope summary (2.1) and the merge-rule
detail (4.2).
|
@claude review Round-4 follow-up (3035181): tightened the The predicate is now stated as "≥ 2 nodes report a non-zero value for the same cell", not "any disagreement". Follower zeros against a leader's non-zero value do not count — that's the steady-state shape, not a conflict. This matches the implementation in PR #686 ( Both §2.1 (scope summary) and §4.2 (merge-rule detail) now state the rule explicitly so future readers can't reach the looser interpretation. |
Implements the server-side half of the fan-out path proposed in
docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md. When the
operator passes --keyvizFanoutNodes, the existing
/admin/api/v1/keyviz/matrix handler aggregates the local matrix
with peer responses; with the flag empty (default), behaviour is
unchanged.
Aggregator (internal/admin/keyviz_fanout.go):
- Parallel HTTP GETs to each peer with a per-call timeout (default
2 s, --keyvizFanoutTimeout overrides).
- Forward-compatible request shape: the per-peer URL is rebuilt
from the parsed parameters so a peer running an older or newer
server cannot trip on a parameter we never intended to forward.
- Merge rules per design 4: reads / read_bytes sum across nodes
(distinct local serves are independent counts); writes /
write_bytes take the max with a row-level Conflict flag when
inputs disagree (conservative dedup that never overcounts during
a leadership flip; the canonical (raftGroupID, leaderTerm) rule
lands in Phase 2-C+).
- Degraded mode: a peer that times out, errors, or returns non-OK
contributes FanoutNodeStatus{OK: false, Error: ...}. The local
matrix still ships and Responded reflects partial success.
Wire format additions (forward-compatible):
- KeyVizMatrix.Fanout *FanoutResult — Nodes / Responded /
Expected; omitted when fan-out is disabled.
- KeyVizRow.Conflict bool — true when the max-merge collapsed
disagreeing values for any cell in the row.
Flag plumbing (main.go / main_admin.go):
- --keyvizFanoutNodes accepts comma-separated host:port endpoints.
- --keyvizFanoutTimeout sets the per-peer ceiling.
- Self-filter: a peer entry naming this node's own admin listener
(literal match, or 0.0.0.0/:: bind matched against a
127.0.0.1/localhost/::1 entry) is skipped so a symmetric
--keyvizFanoutNodes=node1,node2,node3 stamped onto every host
does not loop back over HTTP.
Tests (internal/admin/keyviz_fanout_test.go):
- mergeKeyVizMatrices table covering sum, stable-leader max,
leadership-flip max with conflict, union-of-columns, and
distinct-row order preservation.
- KeyVizFanout.Run end-to-end against httptest peers: single peer
ok, peer 5xx surfaces in degraded mode, peer hang respected by
the per-call timeout.
- buildKeyVizPeerURL table covering scheme/no-scheme/with-bounds
variants.
- Peer order is operator-supplied with self always first.
SPA changes follow in PR-2 (degraded banner + conflict hatching).
The proto / JSON extension with raftGroupID + leaderTerm follows in
PR-3 (Phase 2-C+).
Two items from Gemini round-1 on PR #686: - security-high: io.LimitReader the JSON decode at 64 MiB. A misbehaving peer that streams gigabytes back at us would otherwise pin a goroutine on json.Decode and balloon memory. The cap is generous enough for the design's worst-case payload (1024 rows x 4096 columns x ~32 bytes per uint64 ~= 128 MiB raw, ~32 MiB JSON-encoded) but still bounded. - medium: log resp.Body.Close errors at debug level via the existing KeyVizFanout.logger field (which was wired but unused). peerResponseExceededLimit reuses the +1 byte from the LimitReader to detect overshoot without parsing the streamed body. Adds TestKeyVizFanoutRunPeerOverlargeBody to pin that an in-cap response decodes promptly and reports OK; the over-cap case is covered by manual inspection of the warning log path.
Two items from Claude bot round-2 on PR #686: - Replace the LimitReader+probe pattern with a countingReader that wraps the LimitReader. The decoder's bufio buffering meant the external one-byte probe could return EOF even when the decoder itself had pulled past the cap into its own buffer; the warning fired only when the read happened to land at the boundary. The countingReader increments on every Read (including the chunks the decoder buffers internally), so the post-decode `n > cap` check fires reliably. The security bound (cap+1 max bytes off the wire) is unchanged; only the *detection* of overshoot is now deterministic. - Add WithResponseBodyLimit so tests can drive the over-cap path without serving 64 MiB. Backed by a new TestKeyVizFanoutRunPeerExceedsBodyLimit test that lowers the cap to 1 KiB and verifies the call completes promptly with a per-node status surfaced. Updated the keyVizPeerResponseBodyLimit comment to be honest about the worst-case sizing: 1024 rows x 4096 columns x 20 bytes/uint64 (max-encoded) is ~80 MiB, slightly over the 64 MiB cap. That is intentional - the operator-visible failure mode is "warning logged, matrix may be truncated", not "DoS". A future flag can expose the cap when an extreme-traffic deployment actually trips it.
… test Two minor follow-ups from Claude bot round-2 on PR #686: - countingReader retains the //nolint:wrapcheck but with a longer comment explaining why: json.Decoder uses err == io.EOF (pointer compare, not errors.Is) to decide whether to keep pulling, so wrapping the underlying io.Reader error with %w would silently break the decoder's stop condition. CLAUDE.md generally says "avoid nolint, refactor instead", but here refactoring is impossible -- the only options are pass-through-and-suppress (this) or wrap-and-break-the-decoder. - Add TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning to cover the warning-fires path: small JSON envelope (~30 B) + 256 B of trailing whitespace against a 100 B cap. Decoder returns success; countingReader.n == 101 > 100 triggers the warning. The warning log itself is best-effort and not asserted from the test (slog has no clean injection point on the package), but the reachability of the success path under cap overrun is now pinned.
|
Claude finished @bootjp's task in 1m 18s —— View job Round-5 Review
Round-4 follow-up: addressed ✓
Both §2.1 and §4.2 now state the same predicate in complementary ways — §2.1 as a scope summary, §4.2 as the detailed rule — and both are internally consistent with the §7 test expectations:
Remaining issuesNone. This is the last outstanding item from across five rounds of review. Open questions status
VerdictAll five rounds of review are complete and every finding has been addressed. The design is consistent end-to-end: merge rules, wire format, auth story, self-detection, cap math, timeout handling, time alignment, test scenarios, and the conflict predicate all agree. Ready to merge and open PR 1. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3035181133
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - Both sides round `column_unix_ms[i]` down to the nearest | ||
| multiple of `keyvizStep_ms` before merging. |
There was a problem hiding this comment.
Round timestamps to nearest step before merge
Section 4.5 says fan-out should absorb routine sub-step jitter, but this rule floors each column_unix_ms to the previous step boundary; with a 1s step, samples for the same logical window at 10.95s and 11.05s still split into different merged columns (10s vs 11s). That recreates the column fragmentation this section is trying to eliminate and can halve apparent traffic in affected windows. Use nearest-boundary rounding (or an explicitly aligned window-start timestamp contract) so ±step/2 skew coalesces correctly.
Useful? React with 👍 / 👎.
| nodes, M is rows per node, C is columns. At the 1024-row | ||
| budget × 3 nodes × 60 columns this is ~180k cells per |
There was a problem hiding this comment.
Merge untrimmed peer rows before top-k selection
This design treats fan-out input as up to 1024 rows per node, but /admin/api/v1/keyviz/matrix already enforces local top-k trimming, so merging those truncated sets can miss globally hot read buckets (for example, a bucket ranked just below 1024 on each node can still be top-ranked after cross-node summation). In high-cardinality workloads this causes false negatives in the cluster heatmap. The aggregator should merge full peer data (or a larger internal limit) and apply row-budget truncation only after cluster-wide merge.
Useful? React with 👍 / 👎.
## Summary Server-side half of Phase 2-C cluster fan-out, per `docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md` (PR #685). With `--keyvizFanoutNodes` set, the existing `/admin/api/v1/keyviz/matrix` handler aggregates the local matrix with peer responses; with the flag empty (default), behaviour is unchanged. This is **PR-1** of the implementation plan in §7 of the design. PR-2 (SPA degraded banner + conflict hatching) and PR-3 (proto extension with `raftGroupID` + `leaderTerm` for the canonical merge) are separate. **Depends on PR #685** (the design doc) for review context — branched off that branch. ## Aggregator (`internal/admin/keyviz_fanout.go`) - Parallel HTTP GETs to each peer with a per-call timeout (default 2 s, `--keyvizFanoutTimeout` overrides). - Forward-compatible request shape: the per-peer URL is rebuilt from the parsed parameters, so a peer running an older or newer server cannot trip on an unrecognised parameter. - **Merge rules per design §4**: - `reads` / `read_bytes` → sum across nodes - `writes` / `write_bytes` → max with row-level `Conflict` flag when inputs disagree - Degraded mode: a peer that times out, 5xx's, or fails to decode contributes `FanoutNodeStatus{OK: false, Error: ...}`. The local matrix still ships. ## Wire format additions (forward-compatible) ```jsonc { "rows": [{ "...": "...", "conflict": false // NEW: row-level disagreement flag }], "fanout": { // NEW: omitted when fan-out is disabled "nodes": [{"node":"...", "ok":true, "error":""}], "responded": 2, "expected": 3 } } ``` Old SPA against fan-out server keeps working — extra fields are additive. ## Flag plumbing - `--keyvizFanoutNodes` accepts comma-separated `host:port` endpoints. - `--keyvizFanoutTimeout` sets the per-peer ceiling (default 2 s). - Self-filter handles the bind-vs-advertise asymmetry: `0.0.0.0`/`::` bound listener matched against `127.0.0.1`/`localhost`/`::1` entries; a symmetric `--keyvizFanoutNodes=node1,node2,node3` stamped on every host does not loop back over HTTP. ## Five-lens self-review 1. **Data loss** — Read-only against the existing sampler; conservative max-merge can under-count during a leadership flip but never over-counts. Conflict flag tells operators when the cell is soft. 2. **Concurrency / distributed** — Per-peer calls run in parallel with a context deadline; the aggregator itself waits for all responses synchronously then merges. Test `TestKeyVizFanoutRunPeerTimeout` confirms a hanging peer cannot extend the wall time. 3. **Performance** — Aggregator cost is O(N × M × C); at the 1024-row × 60-column × 3-node ceiling that's ~180k cells per request — well under the SPA's existing render budget. Hot path (`Observe`) is untouched. 4. **Data consistency** — Merge rules conservative under leadership transitions. Reads exact in steady state and during transitions. The §9.1 canonical rule is preserved as a Phase 2-C+ contract. 5. **Test coverage** — Table-driven over the four canonical scenarios in the design + URL-builder variants + per-node order. `go test -race -count=1 ./internal/admin/...` clean; `golangci-lint run ./...` clean. ## Test plan - [x] `go test -race -count=1 ./internal/admin/...` - [x] `go test -race -count=1 .` (main_admin tests with the new `keyVizFanoutConfig{}` arg) - [x] `go build ./...` - [x] `golangci-lint run ./...` - [ ] Manual: 3-node cluster with `--keyvizFanoutNodes=...` on each, drive write traffic against one node, confirm the heatmap reflects writes regardless of which node the SPA hits - [ ] Manual: kill one node, confirm the response carries `responded: 2, expected: 3` plus the failed node's error string ## What this PR does NOT do - **SPA changes**. Wire-format additions are forwards-compatible; the SPA renders correctly without knowing about `Fanout` or `Conflict`. UI for those lands in PR-2. - **`raftGroupID` / `leaderTerm` proto extension**. PR-3 of the plan. - **Membership discovery** via `GetClusterOverview`. Static `--keyvizFanoutNodes` is the contract for Phase 2-C; cache + refresh-interval lands when clusters grow beyond hand-configured size. - **Inter-node auth**. The fan-out path is currently anonymous; intra-cluster admin traffic is assumed to ride a private network. Auth follows in a follow-up before this hits production.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9ec6ca26c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| req, err := http.NewRequestWithContext(ctx, http.MethodGet, target, http.NoBody) | ||
| if err != nil { | ||
| return nil, pkgerrors.Wrap(err, "new request") | ||
| } | ||
| req.Header.Set("Accept", "application/json") |
There was a problem hiding this comment.
Forward admin auth when calling peer keyviz endpoints
fetchPeer constructs a brand-new GET with only an Accept header, but /admin/api/v1/keyviz/matrix is mounted behind the protected middleware chain (SessionAuth) in buildAPIMux. In normal deployments that means every peer fan-out call is unauthenticated and returns 401, so cluster fan-out degrades to local-only data with persistent peer failures instead of aggregating remote matrices.
Useful? React with 👍 / 👎.
| if isWildcardHost(selfHost) { | ||
| return isLoopbackHost(host) | ||
| } |
There was a problem hiding this comment.
Recognize non-loopback self addresses in fan-out filtering
When --adminListen is a wildcard bind (for example 0.0.0.0:8080), this logic only treats loopback hosts as self, so a node’s own advertised hostname/IP is misclassified as a peer in --keyvizFanoutNodes. That creates a self-call on every request; today it shows up as a spurious failed node (401), and once peer auth is added it can recurse through the same fan-out handler path.
Useful? React with 👍 / 👎.
|
@claude review |
|
Claude finished @bootjp's task in 4m 47s —— View job Round-6 Review
Round-5 follow-ups: addressed ✓All five rounds of doc-level fixes are confirmed in the design doc. Round-6 findingsFour of these are code/doc discrepancies — the design says one thing, the code does another. The first is the most serious. 1.
|
| Q | Status |
|---|---|
| Q1 | Resolved ✓ |
| Q2 | Resolved ✓ |
Q3 (generated_at per node) |
Intentionally open, deferred to PR 2 ✓ |
Verdict
Items 1–4 are code/doc discrepancies that need to be resolved before this PR merges:
- Item 1 is a functional blocker: the feature silently degrades to local-only mode in every production deployment because
SessionAuthrejects unauthenticated peer GETs. - Item 2 is a one-character fix (
conflict,omitempty→conflict) but violates a documented wire-stability invariant. - Items 3 and 4 are implementation gaps where the design promises behaviour the code does not deliver.
Item 5 is a documentation gap that should be acknowledged in §2.2 but does not block merging.
…se 2-C PR-2) (#687) ## Summary Phase 2-C **PR-2**, per the design doc landed in #685 §6: wire the fan-out `Fanout` / `conflict` shapes shipped by PR #686 into the admin SPA so operators can see when a peer failed and which rows are soft due to a leadership flip. Backend is unchanged. PR #686's `KeyVizMatrix.Fanout` block is currently emitted on every fan-out request but the SPA was discarding it. ## What this PR adds | Piece | Behavior | |---|---| | **Header counter** | `cluster view (N of M nodes)` in the per-matrix metadata strip. Hidden when `fanout` is absent (single-node mode, no behavior change). | | **FanoutBanner** | Degraded-mode card above the heatmap when `responded < expected`. Lists every failed node + error string. Hidden when the cluster is healthy. | | **ConflictOverlay** | SVG hatch over rows whose `conflict === true`. Layered inside the scroll container so the hatch tracks horizontal scroll (same idiom as `TimeAxis`). 4 px diagonal stripes, alpha 0.45, inheriting `currentColor` so it works against both light and dark themes. | | **RowDetail pill** | Hovering a conflicting row reveals a `conflict` pill with a hover-tooltip explaining the leadership-flip semantics. | ## Five-lens self-review 1. **Data loss** — n/a; SPA-only change reading existing fields. 2. **Concurrency / distributed** — n/a; single browser tab consuming an existing API. 3. **Performance** — Banner: zero-cost when healthy (`null` returned). Hatch: SVG with at most 1024 `<rect>` elements (matches `KeyVizRow` budget); empirically negligible. Overlay only renders when at least one row has `conflict === true`. 4. **Data consistency** — UI surfaces existing server signals; semantics come from PR #685 design §4.2 and #686 implementation. Not relitigated here. 5. **Test coverage** — `tsc -b --noEmit` and `vite build` clean. UI behavior (banner visibility, hatch presence, header text) is hard to assert in `tsc` alone; manual verification documented below. ## Test plan - [x] `npm run lint` (`tsc -b --noEmit`) — clean - [x] `npm run build` (Vite) — clean, output writes to `internal/admin/dist` - [ ] Manual: `make run` with `--keyvizEnabled --keyvizFanoutNodes=127.0.0.1:8080`. Open `/keyviz`; header reads `cluster view (1 of 1 nodes)`, no banner, no hatch. - [ ] Manual: configure `--keyvizFanoutNodes=127.0.0.1:8080,127.0.0.1:9999` (port 9999 unused). Header reads `1 of 2 nodes`, the failed-node banner appears with `connection refused` for `127.0.0.1:9999`. - [ ] Manual: smoke a leadership flip mid-window (or fake `conflict=true` server-side via `/admin/api/v1/keyviz/matrix?...`); the affected row renders with diagonal hatch overlay; hovering reveals the `conflict` pill. ## Out of scope - **Per-cell hatching** — Phase 2-C+ once the proto extension lands and `conflict` becomes per-cell instead of per-row. - **Per-node `generated_at` skew indicator** — design §9 Q3, deferred. - **Banner auto-dismiss / sound notification** — out of scope; an operator who has the page open will see the banner. CI alert pages are a separate system. Closes the SPA half of Phase 2-C; the proto extension follows in PR-3. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * KeyViz now displays cluster fan-out status with node availability and responsiveness counts. * Added visual indicators for rows experiencing conflicts during leadership transitions. * Row details now include conflict status labels with informative tooltips. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…401 (#692) ## Summary **Hotfix for the screenshot a user shared after enabling fan-out** — the heatmap collapsed to "1 of 6 nodes responded" with every peer returning `401 missing session cookie`. **Root cause**: PR #686 shipped fan-out with the design assumption (#685 §3) that peers would accept anonymous calls on a private network. But the receiving side runs the same `SessionAuth` middleware as the browser path, so anonymous fan-out calls are rejected with 401. The "anonymous on a private network" framing was wrong: the implementation didn't put a bypass on the peer side, so anonymous calls never went through end-to-end. **Fix**: forward the inbound user's cookies on every peer call. Cluster nodes already share `--adminSessionSigningKey` for HA, so a cookie minted on node A is verifiable on node B — no new infrastructure needed. ## What changed - `KeyVizFanout.Run` gains a `cookies []*http.Cookie` parameter; nil preserves the legacy behaviour for tests / future non-browser callers. - `KeyVizHandler` passes `r.Cookies()` so a SPA poll forwards `admin_session` (and `admin_csrf`, harmless on a GET) transparently. - `TestKeyVizFanoutRunForwardsCookies` pins that both cookies reach the peer verbatim. - Design doc §3 rewritten: documents the cookie-forwarding scheme + the operator-trust assumption that `--keyvizFanoutNodes` points at trusted hosts only. Phase 2-C+ will replace cookie forwarding with a dedicated inter-node token to decouple browser-session expiry from inter-node call validity. ## Five-lens self-review 1. **Data loss** — n/a; auth path change. 2. **Concurrency / distributed** — Cookies are passed by reference into per-peer goroutines but the slice is read-only inside `fetchPeer` (each goroutine `req.AddCookie`s independently). No shared mutation. 3. **Performance** — Adds N `req.AddCookie` calls per fan-out request. Trivial. 4. **Data consistency** — Auth-only change; merge semantics unchanged. 5. **Test coverage** — New `TestKeyVizFanoutRunForwardsCookies` asserts both cookies reach the peer with the original values. Existing fan-out tests pass `nil` and continue to exercise the unauthenticated path the existing code handled (peer 401s, request reports `ok=false`). ## Test plan - [x] `go test -race -count=1 ./internal/admin/...` — clean - [x] `golangci-lint run ./internal/admin/...` — clean - [ ] Manual: 6-node cluster with shared `ELASTICKV_ADMIN_SESSION_SIGNING_KEY`, `--keyvizFanoutNodes` set on every node. Heatmap should now show "6 of 6 nodes" instead of "1 of 6". ## Trust / threat model note Cookie forwarding is safe when (a) every peer is operator-configured and trusted, and (b) the network is private. **Do NOT point `--keyvizFanoutNodes` at an untrusted host** — the user's admin session would be replayed there. The design doc §3 update is explicit about this. Phase 2-C+ removes this footgun by switching to a dedicated inter-node token. Closes the production-impacting symptom from screenshot review.
Summary
Phase 2-C proposal — cluster-wide KeyViz heatmaps via fan-out.
Phase 2-A and 2-B ship a heatmap that only renders the local node's view. Per parent design §5.1, reads are recorded by the follower that serves them and writes by the leader, so a single-node SPA shows a partial picture by construction. This proposal scopes a minimum-viable fan-out that ships the operator-visible value without yet requiring the proto extension parent §9.1 calls for.
Key design decisions worth a careful read
--keyvizFanoutNodesflag instead ofGetClusterOverviewdiscovery for the MVP (§3). The membership cache lands when clusters grow beyond hand-configured size.conflictflag instead of the canonical(bucketID, raftGroupID, leaderTerm, windowStart)rule (§4.2). The canonical rule needs proto/JSON extensions and lands in Phase 2-C+. The MVP rule never overcounts (operators react badly to phantom hotspots) but may undercount during a leadership flip — surfaced viaconflict=true.internal/adminaggregator instead of a standalonecmd/elastickv-adminbinary (§2.2). The standalone binary was deferred indefinitely; the embedded path is the go-forward.What this proposal does NOT do
Listed in §2.2 to keep the PR-1 scope reviewable: no membership discovery, no proto extension, no standalone binary, no response caching. Each gets its own follow-up if and when needed.
Implementation plan (per §7)
conflictfrom per-row to per-cellTest plan
--keyvizFanoutNodesflag shape (host:port list, self-included) is the right contract for now