Skip to content

docs: KeyViz cluster fan-out proposal (Phase 2-C)#685

Merged
bootjp merged 11 commits intomainfrom
docs/keyviz-fanout-phase2c
Apr 27, 2026
Merged

docs: KeyViz cluster fan-out proposal (Phase 2-C)#685
bootjp merged 11 commits intomainfrom
docs/keyviz-fanout-phase2c

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 27, 2026

Summary

Phase 2-C proposal — cluster-wide KeyViz heatmaps via fan-out.

Phase 2-A and 2-B ship a heatmap that only renders the local node's view. Per parent design §5.1, reads are recorded by the follower that serves them and writes by the leader, so a single-node SPA shows a partial picture by construction. This proposal scopes a minimum-viable fan-out that ships the operator-visible value without yet requiring the proto extension parent §9.1 calls for.

Key design decisions worth a careful read

  • Static --keyvizFanoutNodes flag instead of GetClusterOverview discovery for the MVP (§3). The membership cache lands when clusters grow beyond hand-configured size.
  • Conservative max-merge for writes with a conflict flag instead of the canonical (bucketID, raftGroupID, leaderTerm, windowStart) rule (§4.2). The canonical rule needs proto/JSON extensions and lands in Phase 2-C+. The MVP rule never overcounts (operators react badly to phantom hotspots) but may undercount during a leadership flip — surfaced via conflict=true.
  • Embedded internal/admin aggregator instead of a standalone cmd/elastickv-admin binary (§2.2). The standalone binary was deferred indefinitely; the embedded path is the go-forward.
  • Forward-compatible wire format (§5). Old SPA against fan-out server keeps working — the new fields are additive.

What this proposal does NOT do

Listed in §2.2 to keep the PR-1 scope reviewable: no membership discovery, no proto extension, no standalone binary, no response caching. Each gets its own follow-up if and when needed.

Implementation plan (per §7)

  • PR 1 — server-side aggregator + flag plumbing + table-driven test
  • PR 2 — SPA degraded banner + conflict hatching + header counter
  • PR 3 (Phase 2-C+) — proto/JSON extension; replace max-merge with the canonical rule; promote conflict from per-row to per-cell

Test plan

  • Design review: confirm the §4.2 conservative max-merge trade is acceptable, or push back with the alternative (e.g., always sum + always conflict-flag if multiple nodes report > 0)
  • Confirm --keyvizFanoutNodes flag shape (host:port list, self-included) is the right contract for now
  • Confirm the timeout shape in §9 open question 2 (proposed: fixed 2 s with operator override)

Documents the MVP path for cluster-wide heatmaps: static node list
via --keyvizFanoutNodes, per-node parallel GETs against the existing
/admin/api/v1/keyviz/matrix, and a deliberately conservative merge
rule (reads sum, writes max with conflict flag) that ships the
operator-visible value without yet requiring the proto extension
parent design 9.1 needs for the canonical merge.

Scopes:
- PR 1: server-side aggregator + flag plumbing (next).
- PR 2: SPA degraded banner + conflict hatching.
- PR 3 (Phase 2-C+): proto + JSON extension with raftGroupID +
  leaderTerm; replace max-merge with the canonical
  (bucketID, raftGroupID, leaderTerm, windowStart) rule.

Explicitly defers GetClusterOverview membership discovery, the
cmd/elastickv-admin standalone binary, and TLS for fan-out — each
gets its own follow-up if and when needed.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 5 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 775cdd0c-e3fa-4d39-86e1-9133f09c7fa2

📥 Commits

Reviewing files that changed from the base of the PR and between df5ea5e and 9ec6ca2.

📒 Files selected for processing (8)
  • docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md
  • internal/admin/keyviz_fanout.go
  • internal/admin/keyviz_fanout_test.go
  • internal/admin/keyviz_handler.go
  • internal/admin/server.go
  • main.go
  • main_admin.go
  • main_admin_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/keyviz-fanout-phase2c

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5064129ec4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +157 to +158
divergence the operator should investigate; we surface it as a
warning in the per-node status payload but do not block the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Define a wire field for catalog-divergence warnings

Section 4.4 says Start/End mismatches should be surfaced as a warning in the per-node status payload, but the wire format in §5 only defines node, ok, and error for each fanout.nodes item. Without a dedicated warning field, implementations either cannot report this divergence at all or must overload error (which incorrectly implies node failure), so operators can miss a real consistency issue or see false degraded-state signals.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a design for KeyViz Cluster Fan-out (Phase 2-C), enabling a cluster-wide heatmap view by aggregating data from multiple nodes. The design includes merge rules for metrics, a static node configuration, and a phased implementation plan. Feedback on the design suggests detailing rolling upgrade and zero-downtime migration strategies, resolving internal inconsistencies regarding the 'conflict' flag, and clarifying inter-node authentication. Further recommendations include rounding timestamps to handle clock skew, limiting response sizes to prevent OOM, and caching DNS lookups to improve performance.

Comment on lines +1 to +10
---
status: proposed
phase: 2-C
parent_design: docs/admin_ui_key_visualizer_design.md
date: 2026-04-27
---

# KeyViz Cluster Fan-out (Phase 2-C)

## 1. Background
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The design for cluster fanout introduces significant operational changes to node communication. To comply with repository standards, the document must detail a strategy for rolling upgrades, including mitigation strategies like a temporary 'bridge' or 'proxy' mode to handle communication between upgraded and non-upgraded nodes.

References
  1. Design documents identifying significant operational risks, such as rolling upgrade issues, must detail mitigation strategies like bridge or proxy modes.

Comment on lines +1 to +10
---
status: proposed
phase: 2-C
parent_design: docs/admin_ui_key_visualizer_design.md
date: 2026-04-27
---

# KeyViz Cluster Fan-out (Phase 2-C)

## 1. Background
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As this proposal outlines a migration to a new production-grade cluster aggregation model, the plan must include specific strategies for live or zero-downtime cutovers (e.g., blue/green deployment or dual-write proxy) to ensure no service interruption occurs during the transition.

References
  1. Production-grade system migrations must detail strategies for zero-downtime cutovers to avoid service interruption.

Comment on lines +41 to +44
- **Writes**: max across nodes, with a `conflict=true` flag on
cells where the per-node values disagree (best-effort dedup;
correct under stable leadership, conservative under leadership
flip).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The definition of conflict=true is inconsistent across the document. Section 2.1 describes it as being per-cell and triggered by any disagreement, while Section 5 and Section 6 specify it as per-row and intended for cases where multiple nodes report non-zero values (e.g., during a leadership flip). Clarifying this to be per-row and based on multiple non-zero reports will ensure the implementation matches the intended MVP behavior and remains usable.

Comment on lines +90 to +94
- The fan-out caller authenticates with the same admin session
cookie the upstream request carried. The `internal/admin`
middleware mints a short-lived inter-node token signed with the
shared admin secret (already used for browser sessions) so a
compromised browser cookie cannot be replayed beyond its TTL on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The authentication mechanism for fan-out requests is unclear. The text mentions both replaying the browser's session cookie and minting a short-lived inter-node token. Replaying user cookies between nodes is generally discouraged. It would be better to clarify that the aggregator uses a dedicated inter-node token for peer requests, as suggested by the mention of the shared admin secret.

Comment on lines +164 to +169
arrays line up modulo clock skew. The aggregator pivots on
`column_unix_ms[i]` exactly: a column present in node A but
absent in node B contributes only A's values for that timestamp,
with B's missing column treated as zero. Operators with NTP
drift > Step should fix NTP — the heatmap is not designed to
hide clock skew.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Pivoting on exact column_unix_ms values is fragile in a distributed system. Even with NTP, slight offsets or different start times can cause nodes to produce slightly different timestamps for the same logical window, leading to a sparse or jagged heatmap. Consider rounding timestamps to the nearest keyvizStep boundary during aggregation to ensure columns are correctly merged.

Comment on lines +260 to +265
3. **Performance** — Aggregator cost is O(N × M × C) where N is
nodes, M is rows per node, C is columns. At the 1024-row
budget × 3 nodes × 60 columns this is ~180k cells per
request — well below the SPA's existing render budget. No
coordinated round trip per `Observe`; the existing hot path is
not touched.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

While the aggregator cost is estimated to be low, the design should explicitly require limiting the size of the response bodies accepted from peer nodes (e.g., using http.MaxBytesReader). This prevents a single misconfigured or compromised node from sending an oversized payload that could cause an OOM on the aggregator node.

Comment on lines +282 to +284
addresses? Proposing **DNS lookups happen on every fan-out
request** (no caching) — small clusters that this MVP targets
do not need a resolver cache.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performing a DNS lookup for every node on every fan-out request can introduce significant latency and load on the DNS infrastructure. Even for small clusters, transient DNS issues would directly impact the responsiveness of the Admin UI. It is recommended to resolve hostnames once at configuration load or use a short-lived cache for the resolutions.

Seven items from Codex P2 + Gemini medium / security-medium / high:

- 2.1 conflict flag definition: clarify per-row scope and signal
  that 2-C+ moves it to per-cell, removing the inconsistency
  between 2.1 and 5/6.
- 3 auth: replace the conflated cookie-replay-plus-mint-token
  paragraph with a clean two-tier story. Phase 2-C MVP is
  anonymous on a private network; Phase 2-C+ adds a short-lived
  signed inter-node token derived from the existing admin
  signing key. Adds an explicit do-not-enable-on-untrusted
  warning.
- 4.5 time alignment: bucket-align column timestamps to the
  keyvizStep boundary before merge so sub-step NTP jitter does
  not split each window into two adjacent merged columns.
- 5 wire format: add per-node warnings array with a
  catalog_divergence code so the wire form actually carries
  the catalog-mismatch signal 4.4 promises.
- 8 lens 3: document the 64 MiB io.LimitReader cap on peer
  response bodies (matches the cap implemented in PR #686).
- 9 open Q1: resolve the DNS question. Resolution rides the
  stdlib resolver / nscd cache; no aggregator-local cache.
  Operators on flaky DNS colocate a resolver.
- 1 background: add a Rollout notes paragraph addressing the
  Gemini high asks. The feature is off by default, mixed-version
  rollout is correct by construction (no schema migration, no
  state, forwards-compatible wire format), so a dual-write proxy
  / blue-green plan is not justified for this scope.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

Round-1 follow-ups (1cd6ed5):

Codex P2 — added per-node warnings array with catalog_divergence code in §5; gives the §4.4 catalog-mismatch signal an actual wire field.

Gemini medium / consistencyconflict flag scope now stated up front in §2.1 as row-level (with the per-cell upgrade explicitly tied to Phase 2-C+); removes the §2.1 vs §5/§6 mismatch.

Gemini security-medium / auth — rewrote §3 with a clean two-tier story:

  • Phase 2-C MVP: anonymous fan-out on a private network with an explicit "do NOT enable across an untrusted network" warning.
  • Phase 2-C+: signed inter-node token derived from the existing admin session-signing key, NOT a replay of the browser cookie. The earlier draft conflated the two; this paragraph supersedes it.

Gemini medium / column drift — §4.5 now bucket-aligns column timestamps to keyvizStep boundaries before merging, so sub-step NTP jitter does not split each window into two adjacent merged columns.

Gemini security-medium / body cap — §8 lens 3 documents the 64 MiB io.LimitReader cap (matches what PR #686 implements).

Gemini medium / DNS — §9 Q1 resolved: resolution rides the stdlib resolver and nscd cache; no aggregator-local cache. Per-call deadline already protects against transient DNS hangs. Operators on flaky DNS colocate a caching resolver — same answer for every short-lived HTTP client, not unique to this feature.

Gemini high / rollout strategy — pushed back in §1 "Rollout notes". This feature:

  • Is off by default (--keyvizFanoutNodes empty)
  • Lives entirely on the admin path (no data-plane impact)
  • Stores nothing on disk (no schema migration)
  • Has a forwards-compatible wire format (old SPA + new server, and old node + new aggregator, both work)

A dual-write proxy / blue-green isn't a fit for this scope; mixed-version rollout is correct by construction. The worst-case partial deploy surfaces as a per-node ok=false in the response, not data loss.

Open to pushback on any of these calls if the trade-offs aren't right.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 3m 25s —— View job


Round-2 Review

Summary: All round-1 issues are resolved. There are a few targeted follow-up items, mostly around the body-cap math, a 5s/2s timeout discrepancy, and one behavioral precision issue with io.LimitReader.


Round-1 follow-ups: addressed ✓

Issue Resolution
Codex P2 — missing warnings wire field §5 adds per-node warnings array with catalog_divergence code ✓
Gemini — conflict scope inconsistency §2.1 now states row-level up front, per-cell tied to Phase 2-C+ ✓
Gemini — auth conflation (cookie vs. inter-node token) §3 two-tier story is clear; Phase 2-C MVP anonymous + warning, Phase 2-C+ signed token from session-signing key ✓
Gemini — column timestamp drift §4.5 bucket-aligns to keyvizStep boundary before pivot ✓
Gemini — no body cap §8 lens 3 documents 64 MiB io.LimitReader cap ✓
Gemini — DNS resolution strategy §9 Q1 resolved: rides stdlib/nscd, per-call deadline protects against hangs ✓
Gemini — rolling upgrade / zero-downtime §1 "Rollout notes" justifies why mixed-version rollout is correct by construction ✓

Round-2 findings

1. §8 lens 3 — body-cap arithmetic is internally inconsistent (medium)

The current prose:

1024 rows × 4096 columns × ~32 B per uint64 ≈ 128 MiB raw, ~32 MiB JSON-encoded

The math doesn't add up in two ways:

  • "32 B per uint64" × 1024 × 4096 = 128 MiB — but then it claims the JSON-encoded result is only 32 MiB, implying 4× compression. JSON doesn't compress; a uint64 serialized as a decimal string is smaller than 8 bytes raw for small values (e.g. "5," is 2 bytes vs 8 bytes binary), not larger.
  • The actual raw binary size is 1024 × 4096 × 8 B = 32 MiB; the JSON representation of the same data would be roughly similar (small traffic counters encode shorter than 8 chars), not 4× larger.

The cap itself (64 MiB) is likely fine for typical deployments, but the prose math should be corrected so a future reader can trust it. Suggested rewording:

1024 rows × 4096 columns × 8 B/uint64 ≈ 32 MiB raw binary; JSON-encoded text is similar in size for small counters. The 64 MiB cap provides ≥2× headroom.

2. §8 lens 3 — 4096 is not MaxHistoryColumns (minor, same section)

The design uses 4096 columns as the example ceiling, but the actual upper bound in the sampler is MaxHistoryColumns = 100_000 (keyviz/sampler.go:92). At 100k columns × 1024 rows × 8 B, a single peer response is ~800 MiB raw — well above the 64 MiB cap. The cap will fire for any request spanning roughly 13,000+ columns at default config (or fewer with sub-second steps), returning a partial/error result.

The document should either:

  • Reference MaxHistoryColumns and acknowledge the cap fires at extreme column counts (treating this as the intended operator contract), or
  • Add a server-side column cap on the fan-out path (analogous to keyVizRowBudgetCap) so a broad time-range query doesn't surprise the aggregator.

Neither path is wrong — but leaving it implicit means the first operator who queries a node with a large ring buffer will get a confusing ok=false without knowing why.

3. §8 lens 2 vs. §3 — timeout value mismatch (minor)

§8 lens 2 says "5-second ceiling per node" but §3 says "Per-call timeout: 2 s default, override via --keyvizFanoutTimeout". These should agree. If the 2 s default is correct (matching §9 Q2's recommendation), §8 should say "per-call timeout (default 2 s, configurable via --keyvizFanoutTimeout)" rather than "5-second ceiling".

4. §9 Q2 still written as open but §3 already resolved it (minor)

§3 already commits to "2 s default, override via --keyvizFanoutTimeout". §9 Q2 still reads as a proposal ("probably the right shape — proposing this"). Mark it resolved the same way Q1 was, e.g.:

Resolved: Fixed 2 s default, operator override via --keyvizFanoutTimeout. See §3.

5. §4.4 — "first non-empty, deterministic by node order" needs precision (minor)

Nodes are fanned out in parallel, so responses arrive in non-deterministic wall-clock order. The "deterministic by node order in --keyvizFanoutNodes" guarantee requires the aggregator to collect all results first and then pick by config-list position — not by arrival order. If the implementation picks on first arrival it's non-deterministic, which means Start/End could change between polls for the same bucket. Worth making the sequential-selection invariant explicit:

After all per-node results are collected, Start/End/... are taken from the lowest-indexed node in --keyvizFanoutNodes whose response is non-empty for the row.

6. §7 / §8 test scenario count mismatch (minor)

§8 references "the four scenarios in §7 PR-1" but §7 lists only three:

  1. stable-leader (max wins, no conflict)
  2. mid-window flip (max wins, conflict=true)
  3. partial failure (one node times out)

Either a fourth scenario was dropped from §7 (e.g., catalog_divergence warning), or the count in §8 should be corrected to three. If catalog_divergence is the missing fourth, add it to the §7 test list.

7. §8 lens 3 — io.LimitReader over-cap behavior description is imprecise (minor)

The document says:

Over-cap responses log a warning rather than silently truncating.

With io.LimitReader, reads stop exactly at the limit — there is no automatic warning. What actually happens is: the JSON decoder hits EOF mid-stream, returns a parse error, and the aggregator treats that peer as ok=false (error: "unexpected EOF" or similar). This is the right behavior but should be described accurately, both to set operator expectations and to ensure the implementation matches:

Responses exceeding the cap cause a JSON-decode error; the peer is treated as ok=false with the decode error in its error field, and a WARN-level log is emitted. Partial data is not used.


Open questions status

  • Q1: Resolved ✓
  • Q2: Effectively resolved (§3), but not marked resolved in §9 — see item 4 above.
  • Q3 (per-node generated_at in response): Still open, deferred to PR 2 — this is fine.

Verdict

The design is solid and ready to move to implementation on PR 1. The round-1 concerns are well addressed. Items 1–7 above are documentation-level fixes (no architectural changes needed). Recommend updating the doc to fix the arithmetic, resolve Q2, and tighten the io.LimitReader description before merging.
| Branch: docs/keyviz-fanout-phase2c

Seven follow-ups from Claude bot's round-2 review on 1cd6ed5, all
documentation-level (no architectural change):

- 4.4 row identity: state explicitly that selection is by lowest-
  indexed node in --keyvizFanoutNodes whose response is non-empty
  for the row, AFTER all per-node responses have been collected.
  Selection by config-list position rather than wall-clock arrival
  is the load-bearing property; without it Start/End would flap
  between polls.
- 8 lens 2: replace "5-second ceiling per node" with "configured
  per-call timeout (default 2 s, override via --keyvizFanoutTimeout
  per 3)". Aligns with 3 and 9 Q2.
- 8 lens 3: fix the body-cap arithmetic. The earlier "32 B per
  uint64 ≈ 128 MiB raw, ~32 MiB JSON-encoded" was internally
  inconsistent (JSON does not compress). Restate as 1024 × 4096 ×
  8 B/uint64 = ~32 MiB raw, JSON similar size, 64 MiB cap = 2x
  headroom. Also reference MaxHistoryColumns (100,000) so the
  reader knows the cap fires at extreme column counts.
- 8 lens 3: tighten the over-cap behavioral description. Writing
  "log a warning rather than silently truncating" implied an
  automatic detection step; what actually happens is the
  json.Decoder hits EOF mid-stream (LimitReader-induced), the
  aggregator records ok=false with the decode error, and emits a
  WARN-level log. No partial data is accepted.
- 8 lens 5: list the three actual scenarios in 7 PR-1 plus the
  url-builder / per-node-order / over-cap tests rather than
  claiming "four scenarios" with no fourth scenario in 7.
- 9 Q2: mark Resolved with the same shape as Q1.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

Round-2 follow-ups (d012f4c) addressing all 7 doc-level items:

  1. §8 lens 3 / cap math — replaced 32 B/uint64 ≈ 128 MiB raw, ~32 MiB JSON (inconsistent: JSON doesn't compress) with 1024 × 4096 × 8 B/uint64 ≈ 32 MiB raw; 64 MiB cap = 2× headroom.
  2. §8 lens 3 / MaxHistoryColumns — explicit reference to MaxHistoryColumns = 100 000 and the ~800 MiB raw upper bound that trips the cap on extreme ring queries.
  3. §8 lens 2 / timeout — replaced 5-second ceiling with configured per-call timeout (default 2 s, override via --keyvizFanoutTimeout per §3).
  4. §9 Q2 / mark resolved — restated as Resolved with the same shape as Q1, citing §3.
  5. §4.4 / first-non-empty precision — explicit "after all per-node responses have been collected" + selection by lowest-indexed-in-flag rather than wall-clock arrival. Notes why (peers respond out-of-order).
  6. §8 lens 5 / scenario count — was "four scenarios" with no fourth listed; now lists the three §7 PR-1 scenarios plus the url-builder / per-node-order / over-cap tests.
  7. §8 lens 3 / over-cap behavior — clarified that io.LimitReader causes a JSON decode error (EOF mid-stream), recorded as ok=false with the error in that node's status entry plus a WARN-level log. No partial data accepted.

Q3 (per-node generated_at) intentionally still open — deferred to PR 2 (SPA UI).

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 1m 44s —— View job


Round-3 Review

  • Read design doc (docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md)
  • Verify all 7 round-2 fixes
  • Check for any remaining issues
  • Post review findings

Round-2 follow-ups: all addressed ✓

# Item Verified
1 §8 cap math — 8 B/uint64 × 1024 × 4096 ≈ 32 MiB raw; 64 MiB = 2× headroom
2 §8 MaxHistoryColumns = 100 000 + ~800 MiB upper-bound call-out
3 §8 lens 2 timeout — "5-second ceiling" → "configured per-call timeout (default 2 s)"
4 §9 Q2 marked Resolved with citation to §3
5 §4.4 explicit post-collection selection + lowest-indexed-in-flag rule with rationale
6 §8 lens 5 scenario count — lists three §7 scenarios plus url-builder / per-node-order / over-cap
7 §8 lens 3 io.LimitReader over-cap → JSON-decode error → ok=false + WARN log, no partial data

Round-3 findings

Three genuinely new items (all doc-level) plus one confirmation.

1. §3 — self-detection logic is under-specified (minor)

The doc says "internal/admin resolves --adminAddress and skips the network round-trip for the local entry." It does not say how self is detected. Two edge cases are worth documenting:

  • If the operator writes 127.0.0.1:8080 in --keyvizFanoutNodes but passes 0.0.0.0:8080 as --adminAddress, a naive string-equality check misses the match and the node fans out to itself.
  • If the admin listener is behind a reverse proxy or DNS alias, the address in the flag differs from --adminAddress and the self-skip never fires, resulting in a loopback round-trip that is harmless but wasteful.

Suggested addition to §3:

Self-detection uses string-equality against the normalized --adminAddress value. Operators who listen on 0.0.0.0 should include the fully-qualified address in --keyvizFanoutNodes (e.g., 10.0.0.1:8080, not 0.0.0.0:8080) so the skip fires. A loopback call is harmless — it degrades to an extra HTTP round-trip — but wastes a goroutine per fan-out request.

2. §5 — conflict field behavior for local-only (fan-out disabled) mode is unstated (minor)

conflict is a row-level field added to every row in the response. When --keyvizFanoutNodes is empty (local-only mode), there is no merge and no possible conflict, yet the field would still appear in the JSON (as false). This is fine from a wire-compatibility standpoint, but the document should state it explicitly so the implementation doesn't accidentally omit the field in fan-out mode and include it in local mode (or vice versa):

conflict is always present in the row object and defaults to false. It is only ever set to true by the fan-out aggregator — in local-only mode it is always false.

This also guards against a future SPA that patterns-matches "conflict" in row rather than checking its value.

3. §4.5 — keyvizStep_ms used but never defined (minor)

The section uses keyvizStep_ms as a variable name in the "Specifics" bullets, but the flag is --keyvizStep and its unit is not stated in §4.5 itself. One sentence is enough:

In the formulas below, keyvizStep_ms denotes the step duration in milliseconds (the value of --keyvizStep converted from its native unit).

4. §8 lens 3 — "similar size for typical small values" is defensible but could be tighter (nit)

The existing prose: "JSON encodes uint64 traffic counters at a similar size for typical small values." This is plausible (counters in the 0–9999 range encode shorter in decimal than 8 bytes binary) but "similar" is doing a lot of work. A precise version avoids any future argument:

For sparse heatmaps where most counters are ≤ 4 digits, JSON decimal encoding is smaller than raw binary; the 64 MiB cap provides comfortable headroom for the 32 MiB raw-binary ceiling at 1024 × 4096.


Open questions status

Q Status
Q1 Resolved ✓
Q2 Resolved ✓
Q3 (generated_at per node) Intentionally open, deferred to PR 2 ✓

Verdict

All round-2 items are cleanly addressed. The three round-3 items are genuinely minor (one implementation-hint for self-detection, one wire-format behavior clarification, one terminology nit). None of them are blockers. The design is ready to move to PR-1 implementation. Recommend landing the doc on this commit and opening the implementation PR.
| Branch: docs/keyviz-fanout-phase2c

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d012f4c219

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +63 to +65
- **Writes**: max across nodes, with a row-level `conflict=true`
flag raised when *any* per-node values disagreed for *any* cell
in the row (best-effort dedup; correct under stable leadership,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Define conflict to ignore leader-vs-follower zeros

The spec currently says to set row-level conflict=true when any per-node values disagree for any cell, but in steady state writes naturally disagree (leader>0, followers 0), so this rule would flag nearly every active write row as conflicted. That contradicts the later stable-leader expectation (no conflict) and would make the UI hatch normal traffic as if it were a leadership anomaly. Please narrow the conflict condition to true disagreement among competing non-zero writers (or an equivalent predicate) so expected follower zeros are not treated as conflicts.

Useful? React with 👍 / 👎.

Three minor doc items + one nit from Claude bot round-3:

- 3 self-detection: document the matching shape (string equality
  against --adminAddress, plus wildcard-bind / loopback handling
  for the 0.0.0.0:8080 vs 127.0.0.1:8080 asymmetry). Reverse-proxy
  / DNS-aliased entries fall through to a harmless loopback call.
- 5 conflict field: state explicitly that the field is always
  present, defaults to false in local-only mode. Wire-stable so
  pattern-match-on-presence SPAs do not regress when fan-out flips.
- 4.5 keyvizStep_ms: define the variable in-section so the formula
  is self-contained.
- 8 lens 3 sizing prose nit: replace "similar size for typical
  small values" with the precise statement (decimal encoding is
  smaller than 8-byte binary for ~4-digit counters; cap becomes a
  real ceiling only for traffic-saturated heatmaps with ~20-digit
  counters).
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

Round-3 follow-ups (0b5bfa0):

  1. §3 self-detection — documented the matching shape: exact host:port match vs --adminAddress, plus wildcard-bind/loopback handling (0.0.0.0:8080 listener with 127.0.0.1:8080 in the flag still self-skips). Reverse-proxy / DNS-aliased entries fall through to a harmless loopback call. The behavior matches the implementation in PR feat(admin): KeyViz cluster fan-out aggregator (Phase 2-C PR-1) #686 (isSelfFanoutNode).
  2. §5 conflict field — explicit "always present, defaults false in local-only mode" so an SPA that pattern-matches "conflict" in row rather than the value doesn't regress when --keyvizFanoutNodes flips.
  3. §4.5 keyvizStep_ms — defined in-section as time.Duration(--keyvizStep) / time.Millisecond so the formula is self-contained.
  4. §8 lens 3 sizing prose — replaced "similar size for typical small values" with the precise statement: decimal encoding is smaller than 8-byte binary for ~4-digit counters; cap only becomes a real ceiling for traffic-saturated heatmaps with ~20-digit counters.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 1m 53s —— View job


Round-4 Review

  • Read design doc
  • Verify round-3 fixes (4 items)
  • Check Codex P1 finding (conflict semantics in steady state)
  • Check for any remaining issues
  • Post findings

Round-3 follow-ups: all addressed ✓

# Item Verified
1 §3 self-detection: exact host:port + wildcard-bind + loopback; reverse-proxy falls through to harmless extra call
2 §5 conflict always present, defaults false in local-only mode, only fan-out aggregator ever sets it true
3 §4.5 keyvizStep_ms defined in-section as time.Duration(--keyvizStep) / time.Millisecond
4 §8 lens 3: decimal encoding is smaller than 8-byte binary for ≤4-digit counters; cap becomes real ceiling only for ~20-digit traffic-saturated counters

Round-4 finding

§2.1 vs §4.2 — conflict predicate is still broader than intent (medium, Codex P1 carry-over)

This is the Codex P1 issue raised on d012f4c and it remains in 0b5bfa03.

§2.1 (lines 63–65):

max across nodes, with a row-level conflict=true flag raised when any per-node values disagreed for any cell in the row

§4.2 (lines 167–172):

Under stable leadership, exactly one node (the current leader for the route's group) records writes for any given window; other nodes report 0. max returns the leader's count; non-leader zeros do not perturb it.

The §2.1 wording says conflict fires whenever "any values disagreed." Under stable leadership, the leader reports e.g. writes=5 and every follower reports writes=0 — these disagree. So the §2.1 predicate triggers conflict=true on every active write row in steady state, not just during leadership flips. That directly contradicts §4.2's intent and the §7 test expectation ("stable-leader: max wins, no conflict").

The fix is a one-line clarification in §2.1 to narrow the predicate to multiple non-zero reporters:

-  flag raised when *any* per-node values disagreed for *any* cell
-  in the row (best-effort dedup; correct under stable leadership,
-  conservative under leadership flip).
+  flag raised when *multiple nodes report a non-zero value for
+  any cell in the row* (i.e., at least two nodes have counts > 0
+  for the same cell; follower zeros do not count as disagreement).
+  Correct under stable leadership (leader is the sole non-zero
+  reporter), conservative under a leadership flip.

The same wording needs a matching clarification in the §4.2 max rule: "conflict fires when the count of nodes with value > 0 for a given cell is ≥ 2 after aggregation" — making the "non-leader zeros do not perturb it" sentence also apply to the conflict flag, not just to the max value.

This is the only outstanding issue. All other aspects of the design are clean and ready for PR-1 implementation.


Open questions status

Q Status
Q1 Resolved ✓
Q2 Resolved ✓
Q3 (generated_at per node) Intentionally open, deferred to PR 2 ✓

Verdict

Three full rounds of careful review, and we're down to one issue: tighten the §2.1 conflict predicate wording to match §4.2's intent (follower zeros are not disagreements). That's a two-sentence doc fix. Everything else — merge rules, wire format, auth story, self-detection, cap math, timeout handling, test scenarios — is solid. Recommend fixing §2.1 and landing.
| Branch: docs/keyviz-fanout-phase2c

The 2.1 wording said conflict fires "when any per-node values
disagreed for any cell," which under stable leadership would
include the leader-says-5 vs follower-says-0 split (the
steady-state shape) and contradicted both 4.2 and the 7 stable-
leader test expectation ("max wins, no conflict").

Narrow the predicate to its actual shape: conflict fires only
when two or more nodes report a non-zero value for the same cell.
Follower zeros against a leader's non-zero value do not count.

This matches the implementation in PR #686 (maxMerge ignores zero
on either side and returns conflict=false).

Mirrored the clarification in 4.2 so the conflict-predicate rule
is explicit in both the scope summary (2.1) and the merge-rule
detail (4.2).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

Round-4 follow-up (3035181): tightened the conflict predicate in §2.1 and mirrored the clarification in §4.2.

The predicate is now stated as "≥ 2 nodes report a non-zero value for the same cell", not "any disagreement". Follower zeros against a leader's non-zero value do not count — that's the steady-state shape, not a conflict. This matches the implementation in PR #686 (maxMerge ignores zero on either side and returns conflict=false) and the §7 "stable-leader: max wins, no conflict" test expectation.

Both §2.1 (scope summary) and §4.2 (merge-rule detail) now state the rule explicitly so future readers can't reach the looser interpretation.

bootjp added 5 commits April 27, 2026 18:27
Implements the server-side half of the fan-out path proposed in
docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md. When the
operator passes --keyvizFanoutNodes, the existing
/admin/api/v1/keyviz/matrix handler aggregates the local matrix
with peer responses; with the flag empty (default), behaviour is
unchanged.

Aggregator (internal/admin/keyviz_fanout.go):

- Parallel HTTP GETs to each peer with a per-call timeout (default
  2 s, --keyvizFanoutTimeout overrides).
- Forward-compatible request shape: the per-peer URL is rebuilt
  from the parsed parameters so a peer running an older or newer
  server cannot trip on a parameter we never intended to forward.
- Merge rules per design 4: reads / read_bytes sum across nodes
  (distinct local serves are independent counts); writes /
  write_bytes take the max with a row-level Conflict flag when
  inputs disagree (conservative dedup that never overcounts during
  a leadership flip; the canonical (raftGroupID, leaderTerm) rule
  lands in Phase 2-C+).
- Degraded mode: a peer that times out, errors, or returns non-OK
  contributes FanoutNodeStatus{OK: false, Error: ...}. The local
  matrix still ships and Responded reflects partial success.

Wire format additions (forward-compatible):

- KeyVizMatrix.Fanout *FanoutResult — Nodes / Responded /
  Expected; omitted when fan-out is disabled.
- KeyVizRow.Conflict bool — true when the max-merge collapsed
  disagreeing values for any cell in the row.

Flag plumbing (main.go / main_admin.go):

- --keyvizFanoutNodes accepts comma-separated host:port endpoints.
- --keyvizFanoutTimeout sets the per-peer ceiling.
- Self-filter: a peer entry naming this node's own admin listener
  (literal match, or 0.0.0.0/:: bind matched against a
  127.0.0.1/localhost/::1 entry) is skipped so a symmetric
  --keyvizFanoutNodes=node1,node2,node3 stamped onto every host
  does not loop back over HTTP.

Tests (internal/admin/keyviz_fanout_test.go):

- mergeKeyVizMatrices table covering sum, stable-leader max,
  leadership-flip max with conflict, union-of-columns, and
  distinct-row order preservation.
- KeyVizFanout.Run end-to-end against httptest peers: single peer
  ok, peer 5xx surfaces in degraded mode, peer hang respected by
  the per-call timeout.
- buildKeyVizPeerURL table covering scheme/no-scheme/with-bounds
  variants.
- Peer order is operator-supplied with self always first.

SPA changes follow in PR-2 (degraded banner + conflict hatching).
The proto / JSON extension with raftGroupID + leaderTerm follows in
PR-3 (Phase 2-C+).
Two items from Gemini round-1 on PR #686:

- security-high: io.LimitReader the JSON decode at 64 MiB. A
  misbehaving peer that streams gigabytes back at us would otherwise
  pin a goroutine on json.Decode and balloon memory. The cap is
  generous enough for the design's worst-case payload (1024 rows x
  4096 columns x ~32 bytes per uint64 ~= 128 MiB raw, ~32 MiB
  JSON-encoded) but still bounded.
- medium: log resp.Body.Close errors at debug level via the existing
  KeyVizFanout.logger field (which was wired but unused).
  peerResponseExceededLimit reuses the +1 byte from the LimitReader
  to detect overshoot without parsing the streamed body.

Adds TestKeyVizFanoutRunPeerOverlargeBody to pin that an in-cap
response decodes promptly and reports OK; the over-cap case is
covered by manual inspection of the warning log path.
Two items from Claude bot round-2 on PR #686:

- Replace the LimitReader+probe pattern with a countingReader that
  wraps the LimitReader. The decoder's bufio buffering meant the
  external one-byte probe could return EOF even when the decoder
  itself had pulled past the cap into its own buffer; the warning
  fired only when the read happened to land at the boundary. The
  countingReader increments on every Read (including the chunks
  the decoder buffers internally), so the post-decode `n > cap`
  check fires reliably. The security bound (cap+1 max bytes off
  the wire) is unchanged; only the *detection* of overshoot is
  now deterministic.
- Add WithResponseBodyLimit so tests can drive the over-cap path
  without serving 64 MiB. Backed by a new TestKeyVizFanoutRunPeerExceedsBodyLimit
  test that lowers the cap to 1 KiB and verifies the call completes
  promptly with a per-node status surfaced.

Updated the keyVizPeerResponseBodyLimit comment to be honest about
the worst-case sizing: 1024 rows x 4096 columns x 20 bytes/uint64
(max-encoded) is ~80 MiB, slightly over the 64 MiB cap. That is
intentional - the operator-visible failure mode is "warning logged,
matrix may be truncated", not "DoS". A future flag can expose the
cap when an extreme-traffic deployment actually trips it.
… test

Two minor follow-ups from Claude bot round-2 on PR #686:

- countingReader retains the //nolint:wrapcheck but with a longer
  comment explaining why: json.Decoder uses err == io.EOF (pointer
  compare, not errors.Is) to decide whether to keep pulling, so
  wrapping the underlying io.Reader error with %w would silently
  break the decoder's stop condition. CLAUDE.md generally says
  "avoid nolint, refactor instead", but here refactoring is
  impossible -- the only options are pass-through-and-suppress
  (this) or wrap-and-break-the-decoder.
- Add TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning to cover
  the warning-fires path: small JSON envelope (~30 B) + 256 B of
  trailing whitespace against a 100 B cap. Decoder returns success;
  countingReader.n == 101 > 100 triggers the warning. The
  warning log itself is best-effort and not asserted from the
  test (slog has no clean injection point on the package), but the
  reachability of the success path under cap overrun is now pinned.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 1m 18s —— View job


Round-5 Review

  • Read design doc
  • Verify round-4 fix (conflict predicate in §2.1 and §4.2)
  • Check for any remaining issues
  • Post findings

Round-4 follow-up: addressed ✓

# Item Verified
1 §2.1 predicate narrowed to "two or more nodes report a non-zero value for the same cell" — follower zeros explicitly excluded
2 §4.2 adds "and do not raise the conflict flag" to the non-leader-zeros sentence; adds explicit contrast: "≥ 2 nodes report a value > 0 for the same cell" vs. "any two values differ"
3 §4.2 leadership-flip bullets updated: "both nodes report values > 0 for the same cell. The conflict predicate fires." — now matches the new predicate rather than the vague "both nodes report values"

Both §2.1 and §4.2 now state the same predicate in complementary ways — §2.1 as a scope summary, §4.2 as the detailed rule — and both are internally consistent with the §7 test expectations:

  • Stable-leader: leader=5, followers=0 → only 1 node has value > 0 → conflict=false ✓
  • Mid-window flip: ex-leader > 0, new-leader > 0 → 2 nodes > 0 → conflict=true ✓

Remaining issues

None. This is the last outstanding item from across five rounds of review.


Open questions status

Q Status
Q1 Resolved ✓
Q2 Resolved ✓
Q3 (generated_at per node) Intentionally open, deferred to PR 2 ✓

Verdict

All five rounds of review are complete and every finding has been addressed. The design is consistent end-to-end: merge rules, wire format, auth story, self-detection, cap math, timeout handling, time alignment, test scenarios, and the conflict predicate all agree. Ready to merge and open PR 1.

Branch: docs/keyviz-fanout-phase2c |

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3035181133

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +240 to +241
- Both sides round `column_unix_ms[i]` down to the nearest
multiple of `keyvizStep_ms` before merging.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Round timestamps to nearest step before merge

Section 4.5 says fan-out should absorb routine sub-step jitter, but this rule floors each column_unix_ms to the previous step boundary; with a 1s step, samples for the same logical window at 10.95s and 11.05s still split into different merged columns (10s vs 11s). That recreates the column fragmentation this section is trying to eliminate and can halve apparent traffic in affected windows. Use nearest-boundary rounding (or an explicitly aligned window-start timestamp contract) so ±step/2 skew coalesces correctly.

Useful? React with 👍 / 👎.

Comment on lines +359 to +360
nodes, M is rows per node, C is columns. At the 1024-row
budget × 3 nodes × 60 columns this is ~180k cells per
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Merge untrimmed peer rows before top-k selection

This design treats fan-out input as up to 1024 rows per node, but /admin/api/v1/keyviz/matrix already enforces local top-k trimming, so merging those truncated sets can miss globally hot read buckets (for example, a bucket ranked just below 1024 on each node can still be top-ranked after cross-node summation). In high-cardinality workloads this causes false negatives in the cluster heatmap. The aggregator should merge full peer data (or a larger internal limit) and apply row-budget truncation only after cluster-wide merge.

Useful? React with 👍 / 👎.

## Summary

Server-side half of Phase 2-C cluster fan-out, per
`docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md` (PR #685).
With `--keyvizFanoutNodes` set, the existing
`/admin/api/v1/keyviz/matrix` handler aggregates the local matrix with
peer responses; with the flag empty (default), behaviour is unchanged.

This is **PR-1** of the implementation plan in §7 of the design. PR-2
(SPA degraded banner + conflict hatching) and PR-3 (proto extension with
`raftGroupID` + `leaderTerm` for the canonical merge) are separate.

**Depends on PR #685** (the design doc) for review context — branched
off that branch.

## Aggregator (`internal/admin/keyviz_fanout.go`)

- Parallel HTTP GETs to each peer with a per-call timeout (default 2 s,
`--keyvizFanoutTimeout` overrides).
- Forward-compatible request shape: the per-peer URL is rebuilt from the
parsed parameters, so a peer running an older or newer server cannot
trip on an unrecognised parameter.
- **Merge rules per design §4**:
  - `reads` / `read_bytes` → sum across nodes
- `writes` / `write_bytes` → max with row-level `Conflict` flag when
inputs disagree
- Degraded mode: a peer that times out, 5xx's, or fails to decode
contributes `FanoutNodeStatus{OK: false, Error: ...}`. The local matrix
still ships.

## Wire format additions (forward-compatible)

```jsonc
{
  "rows": [{
    "...": "...",
    "conflict": false        // NEW: row-level disagreement flag
  }],
  "fanout": {                // NEW: omitted when fan-out is disabled
    "nodes": [{"node":"...", "ok":true, "error":""}],
    "responded": 2,
    "expected": 3
  }
}
```

Old SPA against fan-out server keeps working — extra fields are
additive.

## Flag plumbing

- `--keyvizFanoutNodes` accepts comma-separated `host:port` endpoints.
- `--keyvizFanoutTimeout` sets the per-peer ceiling (default 2 s).
- Self-filter handles the bind-vs-advertise asymmetry: `0.0.0.0`/`::`
bound listener matched against `127.0.0.1`/`localhost`/`::1` entries; a
symmetric `--keyvizFanoutNodes=node1,node2,node3` stamped on every host
does not loop back over HTTP.

## Five-lens self-review

1. **Data loss** — Read-only against the existing sampler; conservative
max-merge can under-count during a leadership flip but never
over-counts. Conflict flag tells operators when the cell is soft.
2. **Concurrency / distributed** — Per-peer calls run in parallel with a
context deadline; the aggregator itself waits for all responses
synchronously then merges. Test `TestKeyVizFanoutRunPeerTimeout`
confirms a hanging peer cannot extend the wall time.
3. **Performance** — Aggregator cost is O(N × M × C); at the 1024-row ×
60-column × 3-node ceiling that's ~180k cells per request — well under
the SPA's existing render budget. Hot path (`Observe`) is untouched.
4. **Data consistency** — Merge rules conservative under leadership
transitions. Reads exact in steady state and during transitions. The
§9.1 canonical rule is preserved as a Phase 2-C+ contract.
5. **Test coverage** — Table-driven over the four canonical scenarios in
the design + URL-builder variants + per-node order. `go test -race
-count=1 ./internal/admin/...` clean; `golangci-lint run ./...` clean.

## Test plan

- [x] `go test -race -count=1 ./internal/admin/...`
- [x] `go test -race -count=1 .` (main_admin tests with the new
`keyVizFanoutConfig{}` arg)
- [x] `go build ./...`
- [x] `golangci-lint run ./...`
- [ ] Manual: 3-node cluster with `--keyvizFanoutNodes=...` on each,
drive write traffic against one node, confirm the heatmap reflects
writes regardless of which node the SPA hits
- [ ] Manual: kill one node, confirm the response carries `responded: 2,
expected: 3` plus the failed node's error string

## What this PR does NOT do

- **SPA changes**. Wire-format additions are forwards-compatible; the
SPA renders correctly without knowing about `Fanout` or `Conflict`. UI
for those lands in PR-2.
- **`raftGroupID` / `leaderTerm` proto extension**. PR-3 of the plan.
- **Membership discovery** via `GetClusterOverview`. Static
`--keyvizFanoutNodes` is the contract for Phase 2-C; cache +
refresh-interval lands when clusters grow beyond hand-configured size.
- **Inter-node auth**. The fan-out path is currently anonymous;
intra-cluster admin traffic is assumed to ride a private network. Auth
follows in a follow-up before this hits production.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9ec6ca26c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +290 to +294
req, err := http.NewRequestWithContext(ctx, http.MethodGet, target, http.NoBody)
if err != nil {
return nil, pkgerrors.Wrap(err, "new request")
}
req.Header.Set("Accept", "application/json")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Forward admin auth when calling peer keyviz endpoints

fetchPeer constructs a brand-new GET with only an Accept header, but /admin/api/v1/keyviz/matrix is mounted behind the protected middleware chain (SessionAuth) in buildAPIMux. In normal deployments that means every peer fan-out call is unauthenticated and returns 401, so cluster fan-out degrades to local-only data with persistent peer failures instead of aggregating remote matrices.

Useful? React with 👍 / 👎.

Comment thread main_admin.go
Comment on lines +906 to +908
if isWildcardHost(selfHost) {
return isLoopbackHost(host)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recognize non-loopback self addresses in fan-out filtering

When --adminListen is a wildcard bind (for example 0.0.0.0:8080), this logic only treats loopback hosts as self, so a node’s own advertised hostname/IP is misclassified as a peer in --keyvizFanoutNodes. That creates a self-call on every request; today it shows up as a spurious failed node (401), and once peer auth is added it can recurse through the same fan-out handler path.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 4m 47s —— View job


Round-6 Review

  • Post initial comment
  • Read implementation files (keyviz_fanout.go, main_admin.go, keyviz_handler.go, server.go, middleware.go)
  • Read design doc (all 9 sections)
  • Analyze Codex P1/P2 findings from latest commit
  • Check for additional code/doc discrepancies
  • Post complete review findings

Round-5 follow-ups: addressed ✓

All five rounds of doc-level fixes are confirmed in the design doc.


Round-6 findings

Four of these are code/doc discrepancies — the design says one thing, the code does another. The first is the most serious.

1. fetchPeer vs SessionAuth: every peer call returns 401 in production (P1 — blocker)

fetchPeer (keyviz_fanout.go:290-295) constructs a GET with only an Accept header:

req, err := http.NewRequestWithContext(ctx, http.MethodGet, target, http.NoBody)
// ...
req.Header.Set("Accept", "application/json")
resp, err := f.client.Do(req)

The peer's endpoint is wired at server.go:316:

keyvizChain := protect(keyvizHandler)

protect at server.go:277-285 wraps SessionAuth, which reads the admin_session cookie and rejects the request with 401 if it is absent. CSRF (CSRFDoubleSubmit) is GET-exempt (middleware.go:140) so it is not a factor, but the session check is unconditional on every method.

Net effect: every fan-out call to a peer node returns 401 unauthenticated: missing session cookie. The aggregator records each peer as ok=false and the response is always local-only, regardless of cluster size. The feature does not work in production as shipped.

The design doc §3 explicitly describes the MVP as "anonymous — the aggregator issues unauthenticated GETs to peers", which is the intended behaviour. The contradiction is that no unauthenticated path exists for the endpoint. The two options that preserve the design intent are:

  • Option A — Register an additional route (e.g. /admin/internal/keyviz/matrix) outside the protect chain, bound only when fan-out is enabled, with an explicit warning in the flag docs that it must not be exposed across an untrusted boundary.
  • Option B — Strip SessionAuth from the keyviz endpoint when the server starts with --keyvizFanoutNodes and add a flag note that the admin port must be on a private interface. This is less precise but requires fewer changes.

The test stubs (newKeyVizPeerStub) have no auth middleware, so all fan-out tests pass — they do not cover this path. Fix this →

2. KeyVizRow.Conflict uses omitempty — contradicts design §5 invariant (P1)

keyviz_handler.go:90:

Conflict bool `json:"conflict,omitempty"`

Design doc §5 states explicitly:

The field is always present on every row in the response — even when fan-out is disabled (single-node mode), conflict defaults to false and is emitted on the wire. This is wire-stable: an SPA that pattern-matches "conflict" in row rather than checking the value will not regress when the operator toggles --keyvizFanoutNodes.

omitempty on a bool field silently drops the field when its value is false. That breaks any SPA code that writes if "conflict" in row instead of if row.conflict. The design doc specifically called out this failure mode and documented it as a guarantee — the code violates that guarantee. Fix this →

3. §4.5 timestamp alignment described but not implemented (P1)

Design doc §4.5 says:

Both sides round column_unix_ms[i] down to the nearest multiple of keyvizStep_ms before merging.

unionColumns (keyviz_fanout.go:553-566) inserts raw millisecond values directly into the seen map. mergeRowInto (keyviz_fanout.go:490) looks up indexByColumn[ts] on the exact timestamp. No alignment is applied anywhere in the merge pipeline.

With a 1 s step, node A flushing at t=10.97 s and node B flushing at t=11.03 s produce ColumnUnixMs values of 10970 and 11030. After the union, these land in two separate columns; each carries only one node's contribution. The result is two half-populated columns instead of one merged column — the exact fragmentation §4.5 claims to prevent. Fix this →

4. FanoutNodeStatus has no Warnings field — catalog_divergence never emitted (P1)

Design doc §5 shows the per-node wire format as:

{"node": "10.0.0.2:8080", "ok": true, "error": "", "warnings": [
  {"code": "catalog_divergence", "bucket_id": "route:42", "detail": "start/end disagree with prior node"}
]}

The struct at keyviz_fanout.go:78-82:

type FanoutNodeStatus struct {
    Node  string `json:"node"`
    OK    bool   `json:"ok"`
    Error string `json:"error,omitempty"`
}

No Warnings field. Furthermore, mergeRowInto (keyviz_fanout.go:466-501) sets Start/End from the first-seen node and merges subsequent rows' Values, but never compares Start/End across nodes — so catalog_divergence cannot be detected or emitted even if the field existed. The design doc §4.4 says this divergence "surfaces as a warning in the per-node status payload", but neither the struct nor the merge logic implements it. Fix this →

5. Codex P1 (3035181133) — pre-truncated top-k merge for reads-sum is still unaddressed (P2)

Each peer's /admin/api/v1/keyviz/matrix response has already been trimmed to the per-node rows budget (1024 by default) by applyKeyVizRowBudget before the fan-out aggregator receives it. For reads (sum), a bucket ranked 1025 on every individual node can rank first after cross-node summation, but the aggregator never sees it — it was dropped before the response left each peer.

The previous round-5 review verdict said "ready to merge and open PR 1", but this finding was raised by Codex on the same commit that round-5 reviewed (3035181133) and was not addressed. It is a reads-specific blind spot: for writes (max), the dominant leader's bucket is also the globally dominant one, so pre-truncation is harmless. For reads (sum), globally-hot-but-per-node-rank-just-outside-budget buckets are invisible.

The design doc does not acknowledge this limitation. It should either: add a note in §2.2 ("not in scope: pre-merge row filtering; the aggregator merges already-trimmed sets") so future operators know the constraint, or add a server-side fanout_rows parameter that requests a larger per-peer budget when fan-out is active.


Open questions status

Q Status
Q1 Resolved ✓
Q2 Resolved ✓
Q3 (generated_at per node) Intentionally open, deferred to PR 2 ✓

Verdict

Items 1–4 are code/doc discrepancies that need to be resolved before this PR merges:

  • Item 1 is a functional blocker: the feature silently degrades to local-only mode in every production deployment because SessionAuth rejects unauthenticated peer GETs.
  • Item 2 is a one-character fix (conflict,omitemptyconflict) but violates a documented wire-stability invariant.
  • Items 3 and 4 are implementation gaps where the design promises behaviour the code does not deliver.

Item 5 is a documentation gap that should be acknowledged in §2.2 but does not block merging.


| Branch: docs/keyviz-fanout-phase2c

@bootjp bootjp merged commit a3d312f into main Apr 27, 2026
12 checks passed
@bootjp bootjp deleted the docs/keyviz-fanout-phase2c branch April 27, 2026 11:45
bootjp added a commit that referenced this pull request Apr 27, 2026
…se 2-C PR-2) (#687)

## Summary

Phase 2-C **PR-2**, per the design doc landed in #685 §6: wire the
fan-out `Fanout` / `conflict` shapes shipped by PR #686 into the admin
SPA so operators can see when a peer failed and which rows are soft due
to a leadership flip.

Backend is unchanged. PR #686's `KeyVizMatrix.Fanout` block is currently
emitted on every fan-out request but the SPA was discarding it.

## What this PR adds

| Piece | Behavior |
|---|---|
| **Header counter** | `cluster view (N of M nodes)` in the per-matrix
metadata strip. Hidden when `fanout` is absent (single-node mode, no
behavior change). |
| **FanoutBanner** | Degraded-mode card above the heatmap when
`responded < expected`. Lists every failed node + error string. Hidden
when the cluster is healthy. |
| **ConflictOverlay** | SVG hatch over rows whose `conflict === true`.
Layered inside the scroll container so the hatch tracks horizontal
scroll (same idiom as `TimeAxis`). 4 px diagonal stripes, alpha 0.45,
inheriting `currentColor` so it works against both light and dark
themes. |
| **RowDetail pill** | Hovering a conflicting row reveals a `conflict`
pill with a hover-tooltip explaining the leadership-flip semantics. |

## Five-lens self-review

1. **Data loss** — n/a; SPA-only change reading existing fields.
2. **Concurrency / distributed** — n/a; single browser tab consuming an
existing API.
3. **Performance** — Banner: zero-cost when healthy (`null` returned).
Hatch: SVG with at most 1024 `<rect>` elements (matches `KeyVizRow`
budget); empirically negligible. Overlay only renders when at least one
row has `conflict === true`.
4. **Data consistency** — UI surfaces existing server signals; semantics
come from PR #685 design §4.2 and #686 implementation. Not relitigated
here.
5. **Test coverage** — `tsc -b --noEmit` and `vite build` clean. UI
behavior (banner visibility, hatch presence, header text) is hard to
assert in `tsc` alone; manual verification documented below.

## Test plan

- [x] `npm run lint` (`tsc -b --noEmit`) — clean
- [x] `npm run build` (Vite) — clean, output writes to
`internal/admin/dist`
- [ ] Manual: `make run` with `--keyvizEnabled
--keyvizFanoutNodes=127.0.0.1:8080`. Open `/keyviz`; header reads
`cluster view (1 of 1 nodes)`, no banner, no hatch.
- [ ] Manual: configure
`--keyvizFanoutNodes=127.0.0.1:8080,127.0.0.1:9999` (port 9999 unused).
Header reads `1 of 2 nodes`, the failed-node banner appears with
`connection refused` for `127.0.0.1:9999`.
- [ ] Manual: smoke a leadership flip mid-window (or fake
`conflict=true` server-side via `/admin/api/v1/keyviz/matrix?...`); the
affected row renders with diagonal hatch overlay; hovering reveals the
`conflict` pill.

## Out of scope

- **Per-cell hatching** — Phase 2-C+ once the proto extension lands and
`conflict` becomes per-cell instead of per-row.
- **Per-node `generated_at` skew indicator** — design §9 Q3, deferred.
- **Banner auto-dismiss / sound notification** — out of scope; an
operator who has the page open will see the banner. CI alert pages are a
separate system.

Closes the SPA half of Phase 2-C; the proto extension follows in PR-3.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* KeyViz now displays cluster fan-out status with node availability and
responsiveness counts.
* Added visual indicators for rows experiencing conflicts during
leadership transitions.
* Row details now include conflict status labels with informative
tooltips.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Apr 27, 2026
…401 (#692)

## Summary

**Hotfix for the screenshot a user shared after enabling fan-out** — the
heatmap collapsed to "1 of 6 nodes responded" with every peer returning
`401 missing session cookie`.

**Root cause**: PR #686 shipped fan-out with the design assumption (#685
§3) that peers would accept anonymous calls on a private network. But
the receiving side runs the same `SessionAuth` middleware as the browser
path, so anonymous fan-out calls are rejected with 401. The "anonymous
on a private network" framing was wrong: the implementation didn't put a
bypass on the peer side, so anonymous calls never went through
end-to-end.

**Fix**: forward the inbound user's cookies on every peer call. Cluster
nodes already share `--adminSessionSigningKey` for HA, so a cookie
minted on node A is verifiable on node B — no new infrastructure needed.

## What changed

- `KeyVizFanout.Run` gains a `cookies []*http.Cookie` parameter; nil
preserves the legacy behaviour for tests / future non-browser callers.
- `KeyVizHandler` passes `r.Cookies()` so a SPA poll forwards
`admin_session` (and `admin_csrf`, harmless on a GET) transparently.
- `TestKeyVizFanoutRunForwardsCookies` pins that both cookies reach the
peer verbatim.
- Design doc §3 rewritten: documents the cookie-forwarding scheme + the
operator-trust assumption that `--keyvizFanoutNodes` points at trusted
hosts only. Phase 2-C+ will replace cookie forwarding with a dedicated
inter-node token to decouple browser-session expiry from inter-node call
validity.

## Five-lens self-review

1. **Data loss** — n/a; auth path change.
2. **Concurrency / distributed** — Cookies are passed by reference into
per-peer goroutines but the slice is read-only inside `fetchPeer` (each
goroutine `req.AddCookie`s independently). No shared mutation.
3. **Performance** — Adds N `req.AddCookie` calls per fan-out request.
Trivial.
4. **Data consistency** — Auth-only change; merge semantics unchanged.
5. **Test coverage** — New `TestKeyVizFanoutRunForwardsCookies` asserts
both cookies reach the peer with the original values. Existing fan-out
tests pass `nil` and continue to exercise the unauthenticated path the
existing code handled (peer 401s, request reports `ok=false`).

## Test plan

- [x] `go test -race -count=1 ./internal/admin/...` — clean
- [x] `golangci-lint run ./internal/admin/...` — clean
- [ ] Manual: 6-node cluster with shared
`ELASTICKV_ADMIN_SESSION_SIGNING_KEY`, `--keyvizFanoutNodes` set on
every node. Heatmap should now show "6 of 6 nodes" instead of "1 of 6".

## Trust / threat model note

Cookie forwarding is safe when (a) every peer is operator-configured and
trusted, and (b) the network is private. **Do NOT point
`--keyvizFanoutNodes` at an untrusted host** — the user's admin session
would be replayed there. The design doc §3 update is explicit about
this. Phase 2-C+ removes this footgun by switching to a dedicated
inter-node token.

Closes the production-impacting symptom from screenshot review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant