docs: KeyViz cluster fan-out proposal (Phase 2-C) by bootjp · Pull Request #685 · bootjp/elastickv

bootjp · 2026-04-27T07:50:10Z

Summary

Phase 2-C proposal — cluster-wide KeyViz heatmaps via fan-out.

Phase 2-A and 2-B ship a heatmap that only renders the local node's view. Per parent design §5.1, reads are recorded by the follower that serves them and writes by the leader, so a single-node SPA shows a partial picture by construction. This proposal scopes a minimum-viable fan-out that ships the operator-visible value without yet requiring the proto extension parent §9.1 calls for.

Key design decisions worth a careful read

Static --keyvizFanoutNodes flag instead of GetClusterOverview discovery for the MVP (§3). The membership cache lands when clusters grow beyond hand-configured size.
Conservative max-merge for writes with a conflict flag instead of the canonical (bucketID, raftGroupID, leaderTerm, windowStart) rule (§4.2). The canonical rule needs proto/JSON extensions and lands in Phase 2-C+. The MVP rule never overcounts (operators react badly to phantom hotspots) but may undercount during a leadership flip — surfaced via conflict=true.
Embedded internal/admin aggregator instead of a standalone cmd/elastickv-admin binary (§2.2). The standalone binary was deferred indefinitely; the embedded path is the go-forward.
Forward-compatible wire format (§5). Old SPA against fan-out server keeps working — the new fields are additive.

What this proposal does NOT do

Listed in §2.2 to keep the PR-1 scope reviewable: no membership discovery, no proto extension, no standalone binary, no response caching. Each gets its own follow-up if and when needed.

Implementation plan (per §7)

PR 1 — server-side aggregator + flag plumbing + table-driven test
PR 2 — SPA degraded banner + conflict hatching + header counter
PR 3 (Phase 2-C+) — proto/JSON extension; replace max-merge with the canonical rule; promote conflict from per-row to per-cell

Test plan

Design review: confirm the §4.2 conservative max-merge trade is acceptable, or push back with the alternative (e.g., always sum + always conflict-flag if multiple nodes report > 0)
Confirm --keyvizFanoutNodes flag shape (host:port list, self-included) is the right contract for now
Confirm the timeout shape in §9 open question 2 (proposed: fixed 2 s with operator override)

Documents the MVP path for cluster-wide heatmaps: static node list via --keyvizFanoutNodes, per-node parallel GETs against the existing /admin/api/v1/keyviz/matrix, and a deliberately conservative merge rule (reads sum, writes max with conflict flag) that ships the operator-visible value without yet requiring the proto extension parent design 9.1 needs for the canonical merge. Scopes: - PR 1: server-side aggregator + flag plumbing (next). - PR 2: SPA degraded banner + conflict hatching. - PR 3 (Phase 2-C+): proto + JSON extension with raftGroupID + leaderTerm; replace max-merge with the canonical (bucketID, raftGroupID, leaderTerm, windowStart) rule. Explicitly defers GetClusterOverview membership discovery, the cmd/elastickv-admin standalone binary, and TLS for fan-out — each gets its own follow-up if and when needed.

coderabbitai · 2026-04-27T07:50:18Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 5 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 775cdd0c-e3fa-4d39-86e1-9133f09c7fa2

📥 Commits

Reviewing files that changed from the base of the PR and between df5ea5e and 9ec6ca2.

📒 Files selected for processing (8)

docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md
internal/admin/keyviz_fanout.go
internal/admin/keyviz_fanout_test.go
internal/admin/keyviz_handler.go
internal/admin/server.go
main.go
main_admin.go
main_admin_test.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/keyviz-fanout-phase2c

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5064129ec4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T07:51:33Z

+divergence the operator should investigate; we surface it as a
+warning in the per-node status payload but do not block the


Define a wire field for catalog-divergence warnings

Section 4.4 says Start/End mismatches should be surfaced as a warning in the per-node status payload, but the wire format in §5 only defines node, ok, and error for each fanout.nodes item. Without a dedicated warning field, implementations either cannot report this divergence at all or must overload error (which incorrectly implies node failure), so operators can miss a real consistency issue or see false degraded-state signals.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request proposes a design for KeyViz Cluster Fan-out (Phase 2-C), enabling a cluster-wide heatmap view by aggregating data from multiple nodes. The design includes merge rules for metrics, a static node configuration, and a phased implementation plan. Feedback on the design suggests detailing rolling upgrade and zero-downtime migration strategies, resolving internal inconsistencies regarding the 'conflict' flag, and clarifying inter-node authentication. Further recommendations include rounding timestamps to handle clock skew, limiting response sizes to prevent OOM, and caching DNS lookups to improve performance.

gemini-code-assist · 2026-04-27T07:53:06Z

+---
+status: proposed
+phase: 2-C
+parent_design: docs/admin_ui_key_visualizer_design.md
+date: 2026-04-27
+---
+
+# KeyViz Cluster Fan-out (Phase 2-C)
+
+## 1. Background


The design for cluster fanout introduces significant operational changes to node communication. To comply with repository standards, the document must detail a strategy for rolling upgrades, including mitigation strategies like a temporary 'bridge' or 'proxy' mode to handle communication between upgraded and non-upgraded nodes.

References

Design documents identifying significant operational risks, such as rolling upgrade issues, must detail mitigation strategies like bridge or proxy modes.

gemini-code-assist · 2026-04-27T07:53:06Z

+---
+status: proposed
+phase: 2-C
+parent_design: docs/admin_ui_key_visualizer_design.md
+date: 2026-04-27
+---
+
+# KeyViz Cluster Fan-out (Phase 2-C)
+
+## 1. Background


As this proposal outlines a migration to a new production-grade cluster aggregation model, the plan must include specific strategies for live or zero-downtime cutovers (e.g., blue/green deployment or dual-write proxy) to ensure no service interruption occurs during the transition.

References

Production-grade system migrations must detail strategies for zero-downtime cutovers to avoid service interruption.

gemini-code-assist · 2026-04-27T07:53:06Z

+  - **Writes**: max across nodes, with a `conflict=true` flag on
+    cells where the per-node values disagree (best-effort dedup;
+    correct under stable leadership, conservative under leadership
+    flip).


The definition of conflict=true is inconsistent across the document. Section 2.1 describes it as being per-cell and triggered by any disagreement, while Section 5 and Section 6 specify it as per-row and intended for cases where multiple nodes report non-zero values (e.g., during a leadership flip). Clarifying this to be per-row and based on multiple non-zero reports will ensure the implementation matches the intended MVP behavior and remains usable.

gemini-code-assist · 2026-04-27T07:53:06Z

+- The fan-out caller authenticates with the same admin session
+  cookie the upstream request carried. The `internal/admin`
+  middleware mints a short-lived inter-node token signed with the
+  shared admin secret (already used for browser sessions) so a
+  compromised browser cookie cannot be replayed beyond its TTL on


The authentication mechanism for fan-out requests is unclear. The text mentions both replaying the browser's session cookie and minting a short-lived inter-node token. Replaying user cookies between nodes is generally discouraged. It would be better to clarify that the aggregator uses a dedicated inter-node token for peer requests, as suggested by the mention of the shared admin secret.

gemini-code-assist · 2026-04-27T07:53:06Z

+arrays line up modulo clock skew. The aggregator pivots on
+`column_unix_ms[i]` exactly: a column present in node A but
+absent in node B contributes only A's values for that timestamp,
+with B's missing column treated as zero. Operators with NTP
+drift > Step should fix NTP — the heatmap is not designed to
+hide clock skew.


Pivoting on exact column_unix_ms values is fragile in a distributed system. Even with NTP, slight offsets or different start times can cause nodes to produce slightly different timestamps for the same logical window, leading to a sparse or jagged heatmap. Consider rounding timestamps to the nearest keyvizStep boundary during aggregation to ensure columns are correctly merged.

gemini-code-assist · 2026-04-27T07:53:06Z

+3. **Performance** — Aggregator cost is O(N × M × C) where N is
+   nodes, M is rows per node, C is columns. At the 1024-row
+   budget × 3 nodes × 60 columns this is ~180k cells per
+   request — well below the SPA's existing render budget. No
+   coordinated round trip per `Observe`; the existing hot path is
+   not touched.


While the aggregator cost is estimated to be low, the design should explicitly require limiting the size of the response bodies accepted from peer nodes (e.g., using http.MaxBytesReader). This prevents a single misconfigured or compromised node from sending an oversized payload that could cause an OOM on the aggregator node.

gemini-code-assist · 2026-04-27T07:53:06Z

+   addresses? Proposing **DNS lookups happen on every fan-out
+   request** (no caching) — small clusters that this MVP targets
+   do not need a resolver cache.


Performing a DNS lookup for every node on every fan-out request can introduce significant latency and load on the DNS infrastructure. Even for small clusters, transient DNS issues would directly impact the responsiveness of the Admin UI. It is recommended to resolve hostnames once at configuration load or use a short-lived cache for the resolutions.

Seven items from Codex P2 + Gemini medium / security-medium / high: - 2.1 conflict flag definition: clarify per-row scope and signal that 2-C+ moves it to per-cell, removing the inconsistency between 2.1 and 5/6. - 3 auth: replace the conflated cookie-replay-plus-mint-token paragraph with a clean two-tier story. Phase 2-C MVP is anonymous on a private network; Phase 2-C+ adds a short-lived signed inter-node token derived from the existing admin signing key. Adds an explicit do-not-enable-on-untrusted warning. - 4.5 time alignment: bucket-align column timestamps to the keyvizStep boundary before merge so sub-step NTP jitter does not split each window into two adjacent merged columns. - 5 wire format: add per-node warnings array with a catalog_divergence code so the wire form actually carries the catalog-mismatch signal 4.4 promises. - 8 lens 3: document the 64 MiB io.LimitReader cap on peer response bodies (matches the cap implemented in PR #686). - 9 open Q1: resolve the DNS question. Resolution rides the stdlib resolver / nscd cache; no aggregator-local cache. Operators on flaky DNS colocate a resolver. - 1 background: add a Rollout notes paragraph addressing the Gemini high asks. The feature is off by default, mixed-version rollout is correct by construction (no schema migration, no state, forwards-compatible wire format), so a dual-write proxy / blue-green plan is not justified for this scope.

chatgpt-codex-connector · 2026-04-27T09:05:24Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

bootjp · 2026-04-27T09:05:41Z

@claude review

Round-1 follow-ups (1cd6ed5):

Codex P2 — added per-node warnings array with catalog_divergence code in §5; gives the §4.4 catalog-mismatch signal an actual wire field.

Gemini medium / consistency — conflict flag scope now stated up front in §2.1 as row-level (with the per-cell upgrade explicitly tied to Phase 2-C+); removes the §2.1 vs §5/§6 mismatch.

Gemini security-medium / auth — rewrote §3 with a clean two-tier story:

Phase 2-C MVP: anonymous fan-out on a private network with an explicit "do NOT enable across an untrusted network" warning.
Phase 2-C+: signed inter-node token derived from the existing admin session-signing key, NOT a replay of the browser cookie. The earlier draft conflated the two; this paragraph supersedes it.

Gemini medium / column drift — §4.5 now bucket-aligns column timestamps to keyvizStep boundaries before merging, so sub-step NTP jitter does not split each window into two adjacent merged columns.

Gemini security-medium / body cap — §8 lens 3 documents the 64 MiB io.LimitReader cap (matches what PR #686 implements).

Gemini medium / DNS — §9 Q1 resolved: resolution rides the stdlib resolver and nscd cache; no aggregator-local cache. Per-call deadline already protects against transient DNS hangs. Operators on flaky DNS colocate a caching resolver — same answer for every short-lived HTTP client, not unique to this feature.

Gemini high / rollout strategy — pushed back in §1 "Rollout notes". This feature:

Is off by default (--keyvizFanoutNodes empty)
Lives entirely on the admin path (no data-plane impact)
Stores nothing on disk (no schema migration)
Has a forwards-compatible wire format (old SPA + new server, and old node + new aggregator, both work)

A dual-write proxy / blue-green isn't a fit for this scope; mixed-version rollout is correct by construction. The worst-case partial deploy surfaces as a per-node ok=false in the response, not data loss.

Open to pushback on any of these calls if the trade-offs aren't right.

claude · 2026-04-27T09:05:59Z

Claude finished @bootjp's task in 3m 25s —— View job

Round-2 Review

Summary: All round-1 issues are resolved. There are a few targeted follow-up items, mostly around the body-cap math, a 5s/2s timeout discrepancy, and one behavioral precision issue with io.LimitReader.

Round-1 follow-ups: addressed ✓

Issue	Resolution
Codex P2 — missing `warnings` wire field	§5 adds per-node `warnings` array with `catalog_divergence` code ✓
Gemini — `conflict` scope inconsistency	§2.1 now states row-level up front, per-cell tied to Phase 2-C+ ✓
Gemini — auth conflation (cookie vs. inter-node token)	§3 two-tier story is clear; Phase 2-C MVP anonymous + warning, Phase 2-C+ signed token from session-signing key ✓
Gemini — column timestamp drift	§4.5 bucket-aligns to `keyvizStep` boundary before pivot ✓
Gemini — no body cap	§8 lens 3 documents 64 MiB `io.LimitReader` cap ✓
Gemini — DNS resolution strategy	§9 Q1 resolved: rides stdlib/nscd, per-call deadline protects against hangs ✓
Gemini — rolling upgrade / zero-downtime	§1 "Rollout notes" justifies why mixed-version rollout is correct by construction ✓

Round-2 findings

1. §8 lens 3 — body-cap arithmetic is internally inconsistent (medium)

The current prose:

1024 rows × 4096 columns × ~32 B per uint64 ≈ 128 MiB raw, ~32 MiB JSON-encoded

The math doesn't add up in two ways:

"32 B per uint64" × 1024 × 4096 = 128 MiB — but then it claims the JSON-encoded result is only 32 MiB, implying 4× compression. JSON doesn't compress; a uint64 serialized as a decimal string is smaller than 8 bytes raw for small values (e.g. "5," is 2 bytes vs 8 bytes binary), not larger.
The actual raw binary size is 1024 × 4096 × 8 B = 32 MiB; the JSON representation of the same data would be roughly similar (small traffic counters encode shorter than 8 chars), not 4× larger.

The cap itself (64 MiB) is likely fine for typical deployments, but the prose math should be corrected so a future reader can trust it. Suggested rewording:

1024 rows × 4096 columns × 8 B/uint64 ≈ 32 MiB raw binary; JSON-encoded text is similar in size for small counters. The 64 MiB cap provides ≥2× headroom.

2. §8 lens 3 — 4096 is not `MaxHistoryColumns` (minor, same section)

The design uses 4096 columns as the example ceiling, but the actual upper bound in the sampler is MaxHistoryColumns = 100_000 (keyviz/sampler.go:92). At 100k columns × 1024 rows × 8 B, a single peer response is ~800 MiB raw — well above the 64 MiB cap. The cap will fire for any request spanning roughly 13,000+ columns at default config (or fewer with sub-second steps), returning a partial/error result.

The document should either:

Reference MaxHistoryColumns and acknowledge the cap fires at extreme column counts (treating this as the intended operator contract), or
Add a server-side column cap on the fan-out path (analogous to keyVizRowBudgetCap) so a broad time-range query doesn't surprise the aggregator.

Neither path is wrong — but leaving it implicit means the first operator who queries a node with a large ring buffer will get a confusing ok=false without knowing why.

3. §8 lens 2 vs. §3 — timeout value mismatch (minor)

§8 lens 2 says "5-second ceiling per node" but §3 says "Per-call timeout: 2 s default, override via --keyvizFanoutTimeout". These should agree. If the 2 s default is correct (matching §9 Q2's recommendation), §8 should say "per-call timeout (default 2 s, configurable via --keyvizFanoutTimeout)" rather than "5-second ceiling".

4. §9 Q2 still written as open but §3 already resolved it (minor)

§3 already commits to "2 s default, override via --keyvizFanoutTimeout". §9 Q2 still reads as a proposal ("probably the right shape — proposing this"). Mark it resolved the same way Q1 was, e.g.:

Resolved: Fixed 2 s default, operator override via --keyvizFanoutTimeout. See §3.

5. §4.4 — "first non-empty, deterministic by node order" needs precision (minor)

Nodes are fanned out in parallel, so responses arrive in non-deterministic wall-clock order. The "deterministic by node order in --keyvizFanoutNodes" guarantee requires the aggregator to collect all results first and then pick by config-list position — not by arrival order. If the implementation picks on first arrival it's non-deterministic, which means Start/End could change between polls for the same bucket. Worth making the sequential-selection invariant explicit:

After all per-node results are collected, Start/End/... are taken from the lowest-indexed node in --keyvizFanoutNodes whose response is non-empty for the row.

6. §7 / §8 test scenario count mismatch (minor)

§8 references "the four scenarios in §7 PR-1" but §7 lists only three:

stable-leader (max wins, no conflict)
mid-window flip (max wins, conflict=true)
partial failure (one node times out)

Either a fourth scenario was dropped from §7 (e.g., catalog_divergence warning), or the count in §8 should be corrected to three. If catalog_divergence is the missing fourth, add it to the §7 test list.

7. §8 lens 3 — `io.LimitReader` over-cap behavior description is imprecise (minor)

The document says:

Over-cap responses log a warning rather than silently truncating.

With io.LimitReader, reads stop exactly at the limit — there is no automatic warning. What actually happens is: the JSON decoder hits EOF mid-stream, returns a parse error, and the aggregator treats that peer as ok=false (error: "unexpected EOF" or similar). This is the right behavior but should be described accurately, both to set operator expectations and to ensure the implementation matches:

Responses exceeding the cap cause a JSON-decode error; the peer is treated as ok=false with the decode error in its error field, and a WARN-level log is emitted. Partial data is not used.

Open questions status

Q1: Resolved ✓
Q2: Effectively resolved (§3), but not marked resolved in §9 — see item 4 above.
Q3 (per-node generated_at in response): Still open, deferred to PR 2 — this is fine.

Verdict

The design is solid and ready to move to implementation on PR 1. The round-1 concerns are well addressed. Items 1–7 above are documentation-level fixes (no architectural changes needed). Recommend updating the doc to fix the arithmetic, resolve Q2, and tighten the io.LimitReader description before merging.
| Branch: docs/keyviz-fanout-phase2c

Seven follow-ups from Claude bot's round-2 review on 1cd6ed5, all documentation-level (no architectural change): - 4.4 row identity: state explicitly that selection is by lowest- indexed node in --keyvizFanoutNodes whose response is non-empty for the row, AFTER all per-node responses have been collected. Selection by config-list position rather than wall-clock arrival is the load-bearing property; without it Start/End would flap between polls. - 8 lens 2: replace "5-second ceiling per node" with "configured per-call timeout (default 2 s, override via --keyvizFanoutTimeout per 3)". Aligns with 3 and 9 Q2. - 8 lens 3: fix the body-cap arithmetic. The earlier "32 B per uint64 ≈ 128 MiB raw, ~32 MiB JSON-encoded" was internally inconsistent (JSON does not compress). Restate as 1024 × 4096 × 8 B/uint64 = ~32 MiB raw, JSON similar size, 64 MiB cap = 2x headroom. Also reference MaxHistoryColumns (100,000) so the reader knows the cap fires at extreme column counts. - 8 lens 3: tighten the over-cap behavioral description. Writing "log a warning rather than silently truncating" implied an automatic detection step; what actually happens is the json.Decoder hits EOF mid-stream (LimitReader-induced), the aggregator records ok=false with the decode error, and emits a WARN-level log. No partial data is accepted. - 8 lens 5: list the three actual scenarios in 7 PR-1 plus the url-builder / per-node-order / over-cap tests rather than claiming "four scenarios" with no fourth scenario in 7. - 9 Q2: mark Resolved with the same shape as Q1.

bootjp · 2026-04-27T09:16:22Z

@claude review

Round-2 follow-ups (d012f4c) addressing all 7 doc-level items:

§8 lens 3 / cap math — replaced 32 B/uint64 ≈ 128 MiB raw, ~32 MiB JSON (inconsistent: JSON doesn't compress) with 1024 × 4096 × 8 B/uint64 ≈ 32 MiB raw; 64 MiB cap = 2× headroom.
§8 lens 3 / MaxHistoryColumns — explicit reference to MaxHistoryColumns = 100 000 and the ~800 MiB raw upper bound that trips the cap on extreme ring queries.
§8 lens 2 / timeout — replaced 5-second ceiling with configured per-call timeout (default 2 s, override via --keyvizFanoutTimeout per §3).
§9 Q2 / mark resolved — restated as Resolved with the same shape as Q1, citing §3.
§4.4 / first-non-empty precision — explicit "after all per-node responses have been collected" + selection by lowest-indexed-in-flag rather than wall-clock arrival. Notes why (peers respond out-of-order).
§8 lens 5 / scenario count — was "four scenarios" with no fourth listed; now lists the three §7 PR-1 scenarios plus the url-builder / per-node-order / over-cap tests.
§8 lens 3 / over-cap behavior — clarified that io.LimitReader causes a JSON decode error (EOF mid-stream), recorded as ok=false with the error in that node's status entry plus a WARN-level log. No partial data accepted.

Q3 (per-node generated_at) intentionally still open — deferred to PR 2 (SPA UI).

claude · 2026-04-27T09:16:37Z

Claude finished @bootjp's task in 1m 44s —— View job

Round-3 Review

Read design doc (docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md)
Verify all 7 round-2 fixes
Check for any remaining issues
Post review findings

Round-2 follow-ups: all addressed ✓

#	Item	Verified
1	§8 cap math — `8 B/uint64 × 1024 × 4096 ≈ 32 MiB raw; 64 MiB = 2× headroom`	✓
2	§8 `MaxHistoryColumns = 100 000` + ~800 MiB upper-bound call-out	✓
3	§8 lens 2 timeout — "5-second ceiling" → "configured per-call timeout (default 2 s)"	✓
4	§9 Q2 marked Resolved with citation to §3	✓
5	§4.4 explicit post-collection selection + lowest-indexed-in-flag rule with rationale	✓
6	§8 lens 5 scenario count — lists three §7 scenarios plus url-builder / per-node-order / over-cap	✓
7	§8 lens 3 `io.LimitReader` over-cap → JSON-decode error → `ok=false` + WARN log, no partial data	✓

Round-3 findings

Three genuinely new items (all doc-level) plus one confirmation.

1. §3 — self-detection logic is under-specified (minor)

The doc says "internal/admin resolves --adminAddress and skips the network round-trip for the local entry." It does not say how self is detected. Two edge cases are worth documenting:

If the operator writes 127.0.0.1:8080 in --keyvizFanoutNodes but passes 0.0.0.0:8080 as --adminAddress, a naive string-equality check misses the match and the node fans out to itself.
If the admin listener is behind a reverse proxy or DNS alias, the address in the flag differs from --adminAddress and the self-skip never fires, resulting in a loopback round-trip that is harmless but wasteful.

Suggested addition to §3:

Self-detection uses string-equality against the normalized --adminAddress value. Operators who listen on 0.0.0.0 should include the fully-qualified address in --keyvizFanoutNodes (e.g., 10.0.0.1:8080, not 0.0.0.0:8080) so the skip fires. A loopback call is harmless — it degrades to an extra HTTP round-trip — but wastes a goroutine per fan-out request.

2. §5 — `conflict` field behavior for local-only (fan-out disabled) mode is unstated (minor)

conflict is a row-level field added to every row in the response. When --keyvizFanoutNodes is empty (local-only mode), there is no merge and no possible conflict, yet the field would still appear in the JSON (as false). This is fine from a wire-compatibility standpoint, but the document should state it explicitly so the implementation doesn't accidentally omit the field in fan-out mode and include it in local mode (or vice versa):

conflict is always present in the row object and defaults to false. It is only ever set to true by the fan-out aggregator — in local-only mode it is always false.

This also guards against a future SPA that patterns-matches "conflict" in row rather than checking its value.

3. §4.5 — `keyvizStep_ms` used but never defined (minor)

The section uses keyvizStep_ms as a variable name in the "Specifics" bullets, but the flag is --keyvizStep and its unit is not stated in §4.5 itself. One sentence is enough:

In the formulas below, keyvizStep_ms denotes the step duration in milliseconds (the value of --keyvizStep converted from its native unit).

4. §8 lens 3 — "similar size for typical small values" is defensible but could be tighter (nit)

The existing prose: "JSON encodes uint64 traffic counters at a similar size for typical small values." This is plausible (counters in the 0–9999 range encode shorter in decimal than 8 bytes binary) but "similar" is doing a lot of work. A precise version avoids any future argument:

For sparse heatmaps where most counters are ≤ 4 digits, JSON decimal encoding is smaller than raw binary; the 64 MiB cap provides comfortable headroom for the 32 MiB raw-binary ceiling at 1024 × 4096.

Open questions status

Q	Status
Q1	Resolved ✓
Q2	Resolved ✓
Q3 (`generated_at` per node)	Intentionally open, deferred to PR 2 ✓

Verdict

All round-2 items are cleanly addressed. The three round-3 items are genuinely minor (one implementation-hint for self-detection, one wire-format behavior clarification, one terminology nit). None of them are blockers. The design is ready to move to PR-1 implementation. Recommend landing the doc on this commit and opening the implementation PR.
| Branch: docs/keyviz-fanout-phase2c

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d012f4c219

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T09:17:51Z

+  - **Writes**: max across nodes, with a row-level `conflict=true`
+    flag raised when *any* per-node values disagreed for *any* cell
+    in the row (best-effort dedup; correct under stable leadership,


Define conflict to ignore leader-vs-follower zeros

The spec currently says to set row-level conflict=true when any per-node values disagree for any cell, but in steady state writes naturally disagree (leader>0, followers 0), so this rule would flag nearly every active write row as conflicted. That contradicts the later stable-leader expectation (no conflict) and would make the UI hatch normal traffic as if it were a leadership anomaly. Please narrow the conflict condition to true disagreement among competing non-zero writers (or an equivalent predicate) so expected follower zeros are not treated as conflicts.

Useful? React with 👍 / 👎.

Three minor doc items + one nit from Claude bot round-3: - 3 self-detection: document the matching shape (string equality against --adminAddress, plus wildcard-bind / loopback handling for the 0.0.0.0:8080 vs 127.0.0.1:8080 asymmetry). Reverse-proxy / DNS-aliased entries fall through to a harmless loopback call. - 5 conflict field: state explicitly that the field is always present, defaults to false in local-only mode. Wire-stable so pattern-match-on-presence SPAs do not regress when fan-out flips. - 4.5 keyvizStep_ms: define the variable in-section so the formula is self-contained. - 8 lens 3 sizing prose nit: replace "similar size for typical small values" with the precise statement (decimal encoding is smaller than 8-byte binary for ~4-digit counters; cap becomes a real ceiling only for traffic-saturated heatmaps with ~20-digit counters).

chatgpt-codex-connector · 2026-04-27T09:23:12Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

bootjp · 2026-04-27T09:23:18Z

@claude review

Round-3 follow-ups (0b5bfa0):

§3 self-detection — documented the matching shape: exact host:port match vs --adminAddress, plus wildcard-bind/loopback handling (0.0.0.0:8080 listener with 127.0.0.1:8080 in the flag still self-skips). Reverse-proxy / DNS-aliased entries fall through to a harmless loopback call. The behavior matches the implementation in PR feat(admin): KeyViz cluster fan-out aggregator (Phase 2-C PR-1) #686 (isSelfFanoutNode).
§5 conflict field — explicit "always present, defaults false in local-only mode" so an SPA that pattern-matches "conflict" in row rather than the value doesn't regress when --keyvizFanoutNodes flips.
§4.5 keyvizStep_ms — defined in-section as time.Duration(--keyvizStep) / time.Millisecond so the formula is self-contained.
§8 lens 3 sizing prose — replaced "similar size for typical small values" with the precise statement: decimal encoding is smaller than 8-byte binary for ~4-digit counters; cap only becomes a real ceiling for traffic-saturated heatmaps with ~20-digit counters.

claude · 2026-04-27T09:23:36Z

Claude finished @bootjp's task in 1m 53s —— View job

Round-4 Review

Read design doc
Verify round-3 fixes (4 items)
Check Codex P1 finding (conflict semantics in steady state)
Check for any remaining issues
Post findings

Round-3 follow-ups: all addressed ✓

#	Item	Verified
1	§3 self-detection: exact host:port + wildcard-bind + loopback; reverse-proxy falls through to harmless extra call	✓
2	§5 `conflict` always present, defaults `false` in local-only mode, only fan-out aggregator ever sets it `true`	✓
3	§4.5 `keyvizStep_ms` defined in-section as `time.Duration(--keyvizStep) / time.Millisecond`	✓
4	§8 lens 3: decimal encoding is smaller than 8-byte binary for ≤4-digit counters; cap becomes real ceiling only for ~20-digit traffic-saturated counters	✓

Round-4 finding

§2.1 vs §4.2 — `conflict` predicate is still broader than intent (medium, Codex P1 carry-over)

This is the Codex P1 issue raised on d012f4c and it remains in 0b5bfa03.

§2.1 (lines 63–65):

max across nodes, with a row-level conflict=true flag raised when any per-node values disagreed for any cell in the row

§4.2 (lines 167–172):

Under stable leadership, exactly one node (the current leader for the route's group) records writes for any given window; other nodes report 0. max returns the leader's count; non-leader zeros do not perturb it.

The §2.1 wording says conflict fires whenever "any values disagreed." Under stable leadership, the leader reports e.g. writes=5 and every follower reports writes=0 — these disagree. So the §2.1 predicate triggers conflict=true on every active write row in steady state, not just during leadership flips. That directly contradicts §4.2's intent and the §7 test expectation ("stable-leader: max wins, no conflict").

The fix is a one-line clarification in §2.1 to narrow the predicate to multiple non-zero reporters:

-  flag raised when *any* per-node values disagreed for *any* cell
-  in the row (best-effort dedup; correct under stable leadership,
-  conservative under leadership flip).
+  flag raised when *multiple nodes report a non-zero value for
+  any cell in the row* (i.e., at least two nodes have counts > 0
+  for the same cell; follower zeros do not count as disagreement).
+  Correct under stable leadership (leader is the sole non-zero
+  reporter), conservative under a leadership flip.

The same wording needs a matching clarification in the §4.2 max rule: "conflict fires when the count of nodes with value > 0 for a given cell is ≥ 2 after aggregation" — making the "non-leader zeros do not perturb it" sentence also apply to the conflict flag, not just to the max value.

This is the only outstanding issue. All other aspects of the design are clean and ready for PR-1 implementation.

Open questions status

Q	Status
Q1	Resolved ✓
Q2	Resolved ✓
Q3 (`generated_at` per node)	Intentionally open, deferred to PR 2 ✓

Verdict

Three full rounds of careful review, and we're down to one issue: tighten the §2.1 conflict predicate wording to match §4.2's intent (follower zeros are not disagreements). That's a two-sentence doc fix. Everything else — merge rules, wire format, auth story, self-detection, cap math, timeout handling, test scenarios — is solid. Recommend fixing §2.1 and landing.
| Branch: docs/keyviz-fanout-phase2c

The 2.1 wording said conflict fires "when any per-node values disagreed for any cell," which under stable leadership would include the leader-says-5 vs follower-says-0 split (the steady-state shape) and contradicted both 4.2 and the 7 stable- leader test expectation ("max wins, no conflict"). Narrow the predicate to its actual shape: conflict fires only when two or more nodes report a non-zero value for the same cell. Follower zeros against a leader's non-zero value do not count. This matches the implementation in PR #686 (maxMerge ignores zero on either side and returns conflict=false). Mirrored the clarification in 4.2 so the conflict-predicate rule is explicit in both the scope summary (2.1) and the merge-rule detail (4.2).

bootjp · 2026-04-27T09:27:23Z

@claude review

Round-4 follow-up (3035181): tightened the conflict predicate in §2.1 and mirrored the clarification in §4.2.

The predicate is now stated as "≥ 2 nodes report a non-zero value for the same cell", not "any disagreement". Follower zeros against a leader's non-zero value do not count — that's the steady-state shape, not a conflict. This matches the implementation in PR #686 (maxMerge ignores zero on either side and returns conflict=false) and the §7 "stable-leader: max wins, no conflict" test expectation.

Both §2.1 (scope summary) and §4.2 (merge-rule detail) now state the rule explicitly so future readers can't reach the looser interpretation.

Implements the server-side half of the fan-out path proposed in docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md. When the operator passes --keyvizFanoutNodes, the existing /admin/api/v1/keyviz/matrix handler aggregates the local matrix with peer responses; with the flag empty (default), behaviour is unchanged. Aggregator (internal/admin/keyviz_fanout.go): - Parallel HTTP GETs to each peer with a per-call timeout (default 2 s, --keyvizFanoutTimeout overrides). - Forward-compatible request shape: the per-peer URL is rebuilt from the parsed parameters so a peer running an older or newer server cannot trip on a parameter we never intended to forward. - Merge rules per design 4: reads / read_bytes sum across nodes (distinct local serves are independent counts); writes / write_bytes take the max with a row-level Conflict flag when inputs disagree (conservative dedup that never overcounts during a leadership flip; the canonical (raftGroupID, leaderTerm) rule lands in Phase 2-C+). - Degraded mode: a peer that times out, errors, or returns non-OK contributes FanoutNodeStatus{OK: false, Error: ...}. The local matrix still ships and Responded reflects partial success. Wire format additions (forward-compatible): - KeyVizMatrix.Fanout *FanoutResult — Nodes / Responded / Expected; omitted when fan-out is disabled. - KeyVizRow.Conflict bool — true when the max-merge collapsed disagreeing values for any cell in the row. Flag plumbing (main.go / main_admin.go): - --keyvizFanoutNodes accepts comma-separated host:port endpoints. - --keyvizFanoutTimeout sets the per-peer ceiling. - Self-filter: a peer entry naming this node's own admin listener (literal match, or 0.0.0.0/:: bind matched against a 127.0.0.1/localhost/::1 entry) is skipped so a symmetric --keyvizFanoutNodes=node1,node2,node3 stamped onto every host does not loop back over HTTP. Tests (internal/admin/keyviz_fanout_test.go): - mergeKeyVizMatrices table covering sum, stable-leader max, leadership-flip max with conflict, union-of-columns, and distinct-row order preservation. - KeyVizFanout.Run end-to-end against httptest peers: single peer ok, peer 5xx surfaces in degraded mode, peer hang respected by the per-call timeout. - buildKeyVizPeerURL table covering scheme/no-scheme/with-bounds variants. - Peer order is operator-supplied with self always first. SPA changes follow in PR-2 (degraded banner + conflict hatching). The proto / JSON extension with raftGroupID + leaderTerm follows in PR-3 (Phase 2-C+).

Two items from Gemini round-1 on PR #686: - security-high: io.LimitReader the JSON decode at 64 MiB. A misbehaving peer that streams gigabytes back at us would otherwise pin a goroutine on json.Decode and balloon memory. The cap is generous enough for the design's worst-case payload (1024 rows x 4096 columns x ~32 bytes per uint64 ~= 128 MiB raw, ~32 MiB JSON-encoded) but still bounded. - medium: log resp.Body.Close errors at debug level via the existing KeyVizFanout.logger field (which was wired but unused). peerResponseExceededLimit reuses the +1 byte from the LimitReader to detect overshoot without parsing the streamed body. Adds TestKeyVizFanoutRunPeerOverlargeBody to pin that an in-cap response decodes promptly and reports OK; the over-cap case is covered by manual inspection of the warning log path.

Two items from Claude bot round-2 on PR #686: - Replace the LimitReader+probe pattern with a countingReader that wraps the LimitReader. The decoder's bufio buffering meant the external one-byte probe could return EOF even when the decoder itself had pulled past the cap into its own buffer; the warning fired only when the read happened to land at the boundary. The countingReader increments on every Read (including the chunks the decoder buffers internally), so the post-decode `n > cap` check fires reliably. The security bound (cap+1 max bytes off the wire) is unchanged; only the *detection* of overshoot is now deterministic. - Add WithResponseBodyLimit so tests can drive the over-cap path without serving 64 MiB. Backed by a new TestKeyVizFanoutRunPeerExceedsBodyLimit test that lowers the cap to 1 KiB and verifies the call completes promptly with a per-node status surfaced. Updated the keyVizPeerResponseBodyLimit comment to be honest about the worst-case sizing: 1024 rows x 4096 columns x 20 bytes/uint64 (max-encoded) is ~80 MiB, slightly over the 64 MiB cap. That is intentional - the operator-visible failure mode is "warning logged, matrix may be truncated", not "DoS". A future flag can expose the cap when an extreme-traffic deployment actually trips it.

… test Two minor follow-ups from Claude bot round-2 on PR #686: - countingReader retains the //nolint:wrapcheck but with a longer comment explaining why: json.Decoder uses err == io.EOF (pointer compare, not errors.Is) to decide whether to keep pulling, so wrapping the underlying io.Reader error with %w would silently break the decoder's stop condition. CLAUDE.md generally says "avoid nolint, refactor instead", but here refactoring is impossible -- the only options are pass-through-and-suppress (this) or wrap-and-break-the-decoder. - Add TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning to cover the warning-fires path: small JSON envelope (~30 B) + 256 B of trailing whitespace against a 100 B cap. Decoder returns success; countingReader.n == 101 > 100 triggers the warning. The warning log itself is best-effort and not asserted from the test (slog has no clean injection point on the package), but the reachability of the success path under cap overrun is now pinned.

claude · 2026-04-27T09:27:36Z

Claude finished @bootjp's task in 1m 18s —— View job

Round-5 Review

Read design doc
Verify round-4 fix (conflict predicate in §2.1 and §4.2)
Check for any remaining issues
Post findings

Round-4 follow-up: addressed ✓

#	Item	Verified
1	§2.1 predicate narrowed to "two or more nodes report a non-zero value for the same cell" — follower zeros explicitly excluded	✓
2	§4.2 adds "and do not raise the conflict flag" to the non-leader-zeros sentence; adds explicit contrast: `"≥ 2 nodes report a value > 0 for the same cell"` vs. `"any two values differ"`	✓
3	§4.2 leadership-flip bullets updated: "both nodes report values > 0 for the same cell. The conflict predicate fires." — now matches the new predicate rather than the vague "both nodes report values"	✓

Both §2.1 and §4.2 now state the same predicate in complementary ways — §2.1 as a scope summary, §4.2 as the detailed rule — and both are internally consistent with the §7 test expectations:

Stable-leader: leader=5, followers=0 → only 1 node has value > 0 → conflict=false ✓
Mid-window flip: ex-leader > 0, new-leader > 0 → 2 nodes > 0 → conflict=true ✓

Remaining issues

None. This is the last outstanding item from across five rounds of review.

Open questions status

Q	Status
Q1	Resolved ✓
Q2	Resolved ✓
Q3 (`generated_at` per node)	Intentionally open, deferred to PR 2 ✓

Verdict

All five rounds of review are complete and every finding has been addressed. The design is consistent end-to-end: merge rules, wire format, auth story, self-detection, cap math, timeout handling, time alignment, test scenarios, and the conflict predicate all agree. Ready to merge and open PR 1.

Branch: docs/keyviz-fanout-phase2c |

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3035181133

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T09:31:19Z

+- Both sides round `column_unix_ms[i]` down to the nearest
+  multiple of `keyvizStep_ms` before merging.


Round timestamps to nearest step before merge

Section 4.5 says fan-out should absorb routine sub-step jitter, but this rule floors each column_unix_ms to the previous step boundary; with a 1s step, samples for the same logical window at 10.95s and 11.05s still split into different merged columns (10s vs 11s). That recreates the column fragmentation this section is trying to eliminate and can halve apparent traffic in affected windows. Use nearest-boundary rounding (or an explicitly aligned window-start timestamp contract) so ±step/2 skew coalesces correctly.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-27T09:31:19Z

+   nodes, M is rows per node, C is columns. At the 1024-row
+   budget × 3 nodes × 60 columns this is ~180k cells per


Merge untrimmed peer rows before top-k selection

This design treats fan-out input as up to 1024 rows per node, but /admin/api/v1/keyviz/matrix already enforces local top-k trimming, so merging those truncated sets can miss globally hot read buckets (for example, a bucket ranked just below 1024 on each node can still be top-ranked after cross-node summation). In high-cardinality workloads this causes false negatives in the cluster heatmap. The aggregator should merge full peer data (or a larger internal limit) and apply row-budget truncation only after cluster-wide merge.

Useful? React with 👍 / 👎.

## Summary Server-side half of Phase 2-C cluster fan-out, per `docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md` (PR #685). With `--keyvizFanoutNodes` set, the existing `/admin/api/v1/keyviz/matrix` handler aggregates the local matrix with peer responses; with the flag empty (default), behaviour is unchanged. This is **PR-1** of the implementation plan in §7 of the design. PR-2 (SPA degraded banner + conflict hatching) and PR-3 (proto extension with `raftGroupID` + `leaderTerm` for the canonical merge) are separate. **Depends on PR #685** (the design doc) for review context — branched off that branch. ## Aggregator (`internal/admin/keyviz_fanout.go`) - Parallel HTTP GETs to each peer with a per-call timeout (default 2 s, `--keyvizFanoutTimeout` overrides). - Forward-compatible request shape: the per-peer URL is rebuilt from the parsed parameters, so a peer running an older or newer server cannot trip on an unrecognised parameter. - **Merge rules per design §4**: - `reads` / `read_bytes` → sum across nodes - `writes` / `write_bytes` → max with row-level `Conflict` flag when inputs disagree - Degraded mode: a peer that times out, 5xx's, or fails to decode contributes `FanoutNodeStatus{OK: false, Error: ...}`. The local matrix still ships. ## Wire format additions (forward-compatible) ```jsonc { "rows": [{ "...": "...", "conflict": false // NEW: row-level disagreement flag }], "fanout": { // NEW: omitted when fan-out is disabled "nodes": [{"node":"...", "ok":true, "error":""}], "responded": 2, "expected": 3 } } ``` Old SPA against fan-out server keeps working — extra fields are additive. ## Flag plumbing - `--keyvizFanoutNodes` accepts comma-separated `host:port` endpoints. - `--keyvizFanoutTimeout` sets the per-peer ceiling (default 2 s). - Self-filter handles the bind-vs-advertise asymmetry: `0.0.0.0`/`::` bound listener matched against `127.0.0.1`/`localhost`/`::1` entries; a symmetric `--keyvizFanoutNodes=node1,node2,node3` stamped on every host does not loop back over HTTP. ## Five-lens self-review 1. **Data loss** — Read-only against the existing sampler; conservative max-merge can under-count during a leadership flip but never over-counts. Conflict flag tells operators when the cell is soft. 2. **Concurrency / distributed** — Per-peer calls run in parallel with a context deadline; the aggregator itself waits for all responses synchronously then merges. Test `TestKeyVizFanoutRunPeerTimeout` confirms a hanging peer cannot extend the wall time. 3. **Performance** — Aggregator cost is O(N × M × C); at the 1024-row × 60-column × 3-node ceiling that's ~180k cells per request — well under the SPA's existing render budget. Hot path (`Observe`) is untouched. 4. **Data consistency** — Merge rules conservative under leadership transitions. Reads exact in steady state and during transitions. The §9.1 canonical rule is preserved as a Phase 2-C+ contract. 5. **Test coverage** — Table-driven over the four canonical scenarios in the design + URL-builder variants + per-node order. `go test -race -count=1 ./internal/admin/...` clean; `golangci-lint run ./...` clean. ## Test plan - [x] `go test -race -count=1 ./internal/admin/...` - [x] `go test -race -count=1 .` (main_admin tests with the new `keyVizFanoutConfig{}` arg) - [x] `go build ./...` - [x] `golangci-lint run ./...` - [ ] Manual: 3-node cluster with `--keyvizFanoutNodes=...` on each, drive write traffic against one node, confirm the heatmap reflects writes regardless of which node the SPA hits - [ ] Manual: kill one node, confirm the response carries `responded: 2, expected: 3` plus the failed node's error string ## What this PR does NOT do - **SPA changes**. Wire-format additions are forwards-compatible; the SPA renders correctly without knowing about `Fanout` or `Conflict`. UI for those lands in PR-2. - **`raftGroupID` / `leaderTerm` proto extension**. PR-3 of the plan. - **Membership discovery** via `GetClusterOverview`. Static `--keyvizFanoutNodes` is the contract for Phase 2-C; cache + refresh-interval lands when clusters grow beyond hand-configured size. - **Inter-node auth**. The fan-out path is currently anonymous; intra-cluster admin traffic is assumed to ride a private network. Auth follows in a follow-up before this hits production.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9ec6ca26c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T09:53:12Z

+	req, err := http.NewRequestWithContext(ctx, http.MethodGet, target, http.NoBody)
+	if err != nil {
+		return nil, pkgerrors.Wrap(err, "new request")
+	}
+	req.Header.Set("Accept", "application/json")


Forward admin auth when calling peer keyviz endpoints

fetchPeer constructs a brand-new GET with only an Accept header, but /admin/api/v1/keyviz/matrix is mounted behind the protected middleware chain (SessionAuth) in buildAPIMux. In normal deployments that means every peer fan-out call is unauthenticated and returns 401, so cluster fan-out degrades to local-only data with persistent peer failures instead of aggregating remote matrices.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-27T09:53:12Z

+	if isWildcardHost(selfHost) {
+		return isLoopbackHost(host)
+	}


Recognize non-loopback self addresses in fan-out filtering

When --adminListen is a wildcard bind (for example 0.0.0.0:8080), this logic only treats loopback hosts as self, so a node’s own advertised hostname/IP is misclassified as a peer in --keyvizFanoutNodes. That creates a self-call on every request; today it shows up as a spurious failed node (401), and once peer auth is added it can recurse through the same fan-out handler path.

Useful? React with 👍 / 👎.

bootjp · 2026-04-27T10:47:44Z

@claude review

claude · 2026-04-27T10:48:04Z

Claude finished @bootjp's task in 4m 47s —— View job

Round-6 Review

Post initial comment
Read implementation files (keyviz_fanout.go, main_admin.go, keyviz_handler.go, server.go, middleware.go)
Read design doc (all 9 sections)
Analyze Codex P1/P2 findings from latest commit
Check for additional code/doc discrepancies
Post complete review findings

Round-5 follow-ups: addressed ✓

All five rounds of doc-level fixes are confirmed in the design doc.

Round-6 findings

Four of these are code/doc discrepancies — the design says one thing, the code does another. The first is the most serious.

1. `fetchPeer` vs `SessionAuth`: every peer call returns 401 in production (P1 — blocker)

fetchPeer (keyviz_fanout.go:290-295) constructs a GET with only an Accept header:

req, err := http.NewRequestWithContext(ctx, http.MethodGet, target, http.NoBody)
// ...
req.Header.Set("Accept", "application/json")
resp, err := f.client.Do(req)

The peer's endpoint is wired at server.go:316:

keyvizChain := protect(keyvizHandler)

protect at server.go:277-285 wraps SessionAuth, which reads the admin_session cookie and rejects the request with 401 if it is absent. CSRF (CSRFDoubleSubmit) is GET-exempt (middleware.go:140) so it is not a factor, but the session check is unconditional on every method.

Net effect: every fan-out call to a peer node returns 401 unauthenticated: missing session cookie. The aggregator records each peer as ok=false and the response is always local-only, regardless of cluster size. The feature does not work in production as shipped.

The design doc §3 explicitly describes the MVP as "anonymous — the aggregator issues unauthenticated GETs to peers", which is the intended behaviour. The contradiction is that no unauthenticated path exists for the endpoint. The two options that preserve the design intent are:

Option A — Register an additional route (e.g. /admin/internal/keyviz/matrix) outside the protect chain, bound only when fan-out is enabled, with an explicit warning in the flag docs that it must not be exposed across an untrusted boundary.
Option B — Strip SessionAuth from the keyviz endpoint when the server starts with --keyvizFanoutNodes and add a flag note that the admin port must be on a private interface. This is less precise but requires fewer changes.

The test stubs (newKeyVizPeerStub) have no auth middleware, so all fan-out tests pass — they do not cover this path. Fix this →

2. `KeyVizRow.Conflict` uses `omitempty` — contradicts design §5 invariant (P1)

keyviz_handler.go:90:

Conflict bool `json:"conflict,omitempty"`

Design doc §5 states explicitly:

The field is always present on every row in the response — even when fan-out is disabled (single-node mode), conflict defaults to false and is emitted on the wire. This is wire-stable: an SPA that pattern-matches "conflict" in row rather than checking the value will not regress when the operator toggles --keyvizFanoutNodes.

omitempty on a bool field silently drops the field when its value is false. That breaks any SPA code that writes if "conflict" in row instead of if row.conflict. The design doc specifically called out this failure mode and documented it as a guarantee — the code violates that guarantee. Fix this →

3. §4.5 timestamp alignment described but not implemented (P1)

Design doc §4.5 says:

Both sides round column_unix_ms[i] down to the nearest multiple of keyvizStep_ms before merging.

unionColumns (keyviz_fanout.go:553-566) inserts raw millisecond values directly into the seen map. mergeRowInto (keyviz_fanout.go:490) looks up indexByColumn[ts] on the exact timestamp. No alignment is applied anywhere in the merge pipeline.

With a 1 s step, node A flushing at t=10.97 s and node B flushing at t=11.03 s produce ColumnUnixMs values of 10970 and 11030. After the union, these land in two separate columns; each carries only one node's contribution. The result is two half-populated columns instead of one merged column — the exact fragmentation §4.5 claims to prevent. Fix this →

4. `FanoutNodeStatus` has no `Warnings` field — `catalog_divergence` never emitted (P1)

Design doc §5 shows the per-node wire format as:

{"node": "10.0.0.2:8080", "ok": true, "error": "", "warnings": [
  {"code": "catalog_divergence", "bucket_id": "route:42", "detail": "start/end disagree with prior node"}
]}

The struct at keyviz_fanout.go:78-82:

type FanoutNodeStatus struct {
    Node  string `json:"node"`
    OK    bool   `json:"ok"`
    Error string `json:"error,omitempty"`
}

No Warnings field. Furthermore, mergeRowInto (keyviz_fanout.go:466-501) sets Start/End from the first-seen node and merges subsequent rows' Values, but never compares Start/End across nodes — so catalog_divergence cannot be detected or emitted even if the field existed. The design doc §4.4 says this divergence "surfaces as a warning in the per-node status payload", but neither the struct nor the merge logic implements it. Fix this →

5. Codex P1 (`3035181133`) — pre-truncated top-k merge for reads-sum is still unaddressed (P2)

Each peer's /admin/api/v1/keyviz/matrix response has already been trimmed to the per-node rows budget (1024 by default) by applyKeyVizRowBudget before the fan-out aggregator receives it. For reads (sum), a bucket ranked 1025 on every individual node can rank first after cross-node summation, but the aggregator never sees it — it was dropped before the response left each peer.

The previous round-5 review verdict said "ready to merge and open PR 1", but this finding was raised by Codex on the same commit that round-5 reviewed (3035181133) and was not addressed. It is a reads-specific blind spot: for writes (max), the dominant leader's bucket is also the globally dominant one, so pre-truncation is harmless. For reads (sum), globally-hot-but-per-node-rank-just-outside-budget buckets are invisible.

The design doc does not acknowledge this limitation. It should either: add a note in §2.2 ("not in scope: pre-merge row filtering; the aggregator merges already-trimmed sets") so future operators know the constraint, or add a server-side fanout_rows parameter that requests a larger per-peer budget when fan-out is active.

Open questions status

Q	Status
Q1	Resolved ✓
Q2	Resolved ✓
Q3 (`generated_at` per node)	Intentionally open, deferred to PR 2 ✓

Verdict

Items 1–4 are code/doc discrepancies that need to be resolved before this PR merges:

Item 1 is a functional blocker: the feature silently degrades to local-only mode in every production deployment because SessionAuth rejects unauthenticated peer GETs.
Item 2 is a one-character fix (conflict,omitempty → conflict) but violates a documented wire-stability invariant.
Items 3 and 4 are implementation gaps where the design promises behaviour the code does not deliver.

Item 5 is a documentation gap that should be acknowledged in §2.2 but does not block merging.

| Branch: docs/keyviz-fanout-phase2c

…se 2-C PR-2) (#687) ## Summary Phase 2-C **PR-2**, per the design doc landed in #685 §6: wire the fan-out `Fanout` / `conflict` shapes shipped by PR #686 into the admin SPA so operators can see when a peer failed and which rows are soft due to a leadership flip. Backend is unchanged. PR #686's `KeyVizMatrix.Fanout` block is currently emitted on every fan-out request but the SPA was discarding it. ## What this PR adds | Piece | Behavior | |---|---| | **Header counter** | `cluster view (N of M nodes)` in the per-matrix metadata strip. Hidden when `fanout` is absent (single-node mode, no behavior change). | | **FanoutBanner** | Degraded-mode card above the heatmap when `responded < expected`. Lists every failed node + error string. Hidden when the cluster is healthy. | | **ConflictOverlay** | SVG hatch over rows whose `conflict === true`. Layered inside the scroll container so the hatch tracks horizontal scroll (same idiom as `TimeAxis`). 4 px diagonal stripes, alpha 0.45, inheriting `currentColor` so it works against both light and dark themes. | | **RowDetail pill** | Hovering a conflicting row reveals a `conflict` pill with a hover-tooltip explaining the leadership-flip semantics. | ## Five-lens self-review 1. **Data loss** — n/a; SPA-only change reading existing fields. 2. **Concurrency / distributed** — n/a; single browser tab consuming an existing API. 3. **Performance** — Banner: zero-cost when healthy (`null` returned). Hatch: SVG with at most 1024 `<rect>` elements (matches `KeyVizRow` budget); empirically negligible. Overlay only renders when at least one row has `conflict === true`. 4. **Data consistency** — UI surfaces existing server signals; semantics come from PR #685 design §4.2 and #686 implementation. Not relitigated here. 5. **Test coverage** — `tsc -b --noEmit` and `vite build` clean. UI behavior (banner visibility, hatch presence, header text) is hard to assert in `tsc` alone; manual verification documented below. ## Test plan - [x] `npm run lint` (`tsc -b --noEmit`) — clean - [x] `npm run build` (Vite) — clean, output writes to `internal/admin/dist` - [ ] Manual: `make run` with `--keyvizEnabled --keyvizFanoutNodes=127.0.0.1:8080`. Open `/keyviz`; header reads `cluster view (1 of 1 nodes)`, no banner, no hatch. - [ ] Manual: configure `--keyvizFanoutNodes=127.0.0.1:8080,127.0.0.1:9999` (port 9999 unused). Header reads `1 of 2 nodes`, the failed-node banner appears with `connection refused` for `127.0.0.1:9999`. - [ ] Manual: smoke a leadership flip mid-window (or fake `conflict=true` server-side via `/admin/api/v1/keyviz/matrix?...`); the affected row renders with diagonal hatch overlay; hovering reveals the `conflict` pill. ## Out of scope - **Per-cell hatching** — Phase 2-C+ once the proto extension lands and `conflict` becomes per-cell instead of per-row. - **Per-node `generated_at` skew indicator** — design §9 Q3, deferred. - **Banner auto-dismiss / sound notification** — out of scope; an operator who has the page open will see the banner. CI alert pages are a separate system. Closes the SPA half of Phase 2-C; the proto extension follows in PR-3.  ## Summary by CodeRabbit * **New Features** * KeyViz now displays cluster fan-out status with node availability and responsiveness counts. * Added visual indicators for rows experiencing conflicts during leadership transitions. * Row details now include conflict status labels with informative tooltips.

…401 (#692) ## Summary **Hotfix for the screenshot a user shared after enabling fan-out** — the heatmap collapsed to "1 of 6 nodes responded" with every peer returning `401 missing session cookie`. **Root cause**: PR #686 shipped fan-out with the design assumption (#685 §3) that peers would accept anonymous calls on a private network. But the receiving side runs the same `SessionAuth` middleware as the browser path, so anonymous fan-out calls are rejected with 401. The "anonymous on a private network" framing was wrong: the implementation didn't put a bypass on the peer side, so anonymous calls never went through end-to-end. **Fix**: forward the inbound user's cookies on every peer call. Cluster nodes already share `--adminSessionSigningKey` for HA, so a cookie minted on node A is verifiable on node B — no new infrastructure needed. ## What changed - `KeyVizFanout.Run` gains a `cookies []*http.Cookie` parameter; nil preserves the legacy behaviour for tests / future non-browser callers. - `KeyVizHandler` passes `r.Cookies()` so a SPA poll forwards `admin_session` (and `admin_csrf`, harmless on a GET) transparently. - `TestKeyVizFanoutRunForwardsCookies` pins that both cookies reach the peer verbatim. - Design doc §3 rewritten: documents the cookie-forwarding scheme + the operator-trust assumption that `--keyvizFanoutNodes` points at trusted hosts only. Phase 2-C+ will replace cookie forwarding with a dedicated inter-node token to decouple browser-session expiry from inter-node call validity. ## Five-lens self-review 1. **Data loss** — n/a; auth path change. 2. **Concurrency / distributed** — Cookies are passed by reference into per-peer goroutines but the slice is read-only inside `fetchPeer` (each goroutine `req.AddCookie`s independently). No shared mutation. 3. **Performance** — Adds N `req.AddCookie` calls per fan-out request. Trivial. 4. **Data consistency** — Auth-only change; merge semantics unchanged. 5. **Test coverage** — New `TestKeyVizFanoutRunForwardsCookies` asserts both cookies reach the peer with the original values. Existing fan-out tests pass `nil` and continue to exercise the unauthenticated path the existing code handled (peer 401s, request reports `ok=false`). ## Test plan - [x] `go test -race -count=1 ./internal/admin/...` — clean - [x] `golangci-lint run ./internal/admin/...` — clean - [ ] Manual: 6-node cluster with shared `ELASTICKV_ADMIN_SESSION_SIGNING_KEY`, `--keyvizFanoutNodes` set on every node. Heatmap should now show "6 of 6 nodes" instead of "1 of 6". ## Trust / threat model note Cookie forwarding is safe when (a) every peer is operator-configured and trusted, and (b) the network is private. **Do NOT point `--keyvizFanoutNodes` at an untrusted host** — the user's admin session would be replayed there. The design doc §3 update is explicit about this. Phase 2-C+ removes this footgun by switching to a dedicated inter-node token. Closes the production-impacting symptom from screenshot review.

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

bootjp mentioned this pull request Apr 27, 2026

feat(admin): KeyViz cluster fan-out aggregator (Phase 2-C PR-1) #686

Merged

6 tasks

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

bootjp added 5 commits April 27, 2026 18:27

fix(admin): generalize countingReader nolint comment per #686 round-3

7747be1

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

bootjp merged commit a3d312f into main Apr 27, 2026
12 checks passed

bootjp deleted the docs/keyviz-fanout-phase2c branch April 27, 2026 11:45

bootjp mentioned this pull request Apr 27, 2026

feat(admin): SPA degraded banner + conflict hatching for fan-out (Phase 2-C PR-2) #687

Merged

5 tasks

bootjp mentioned this pull request Apr 27, 2026

fix(admin): forward session cookie on KeyViz fan-out so peers do not 401 #692

Merged

3 tasks

		divergence the operator should investigate; we surface it as a
		warning in the per-node status payload but do not block the

		- Both sides round `column_unix_ms[i]` down to the nearest
		multiple of `keyvizStep_ms` before merging.

		nodes, M is rows per node, C is columns. At the 1024-row
		budget × 3 nodes × 60 columns this is ~180k cells per

Conversation

bootjp commented Apr 27, 2026

Summary

Key design decisions worth a careful read

What this proposal does NOT do

Implementation plan (per §7)

Test plan

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Apr 27, 2026

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-2 Review

Round-1 follow-ups: addressed ✓

Round-2 findings

1. §8 lens 3 — body-cap arithmetic is internally inconsistent (medium)

2. §8 lens 3 — 4096 is not MaxHistoryColumns (minor, same section)

3. §8 lens 2 vs. §3 — timeout value mismatch (minor)

4. §9 Q2 still written as open but §3 already resolved it (minor)

5. §4.4 — "first non-empty, deterministic by node order" needs precision (minor)

6. §7 / §8 test scenario count mismatch (minor)

7. §8 lens 3 — io.LimitReader over-cap behavior description is imprecise (minor)

Open questions status

Verdict

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-3 Review

Round-2 follow-ups: all addressed ✓

Round-3 findings

1. §3 — self-detection logic is under-specified (minor)

2. §5 — conflict field behavior for local-only (fan-out disabled) mode is unstated (minor)

3. §4.5 — keyvizStep_ms used but never defined (minor)

4. §8 lens 3 — "similar size for typical small values" is defensible but could be tighter (nit)

Open questions status

Verdict

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

claude Bot commented Apr 27, 2026 •

edited

Loading

2. §8 lens 3 — 4096 is not `MaxHistoryColumns` (minor, same section)

7. §8 lens 3 — `io.LimitReader` over-cap behavior description is imprecise (minor)

claude Bot commented Apr 27, 2026 •

edited

Loading

2. §5 — `conflict` field behavior for local-only (fan-out disabled) mode is unstated (minor)

3. §4.5 — `keyvizStep_ms` used but never defined (minor)

claude Bot commented Apr 27, 2026 •

edited

Loading

§2.1 vs §4.2 — `conflict` predicate is still broader than intent (medium, Codex P1 carry-over)

claude Bot commented Apr 27, 2026 •

edited

Loading

claude Bot commented Apr 27, 2026 •

edited

Loading

1. `fetchPeer` vs `SessionAuth`: every peer call returns 401 in production (P1 — blocker)

2. `KeyVizRow.Conflict` uses `omitempty` — contradicts design §5 invariant (P1)

4. `FanoutNodeStatus` has no `Warnings` field — `catalog_divergence` never emitted (P1)

5. Codex P1 (`3035181133`) — pre-truncated top-k merge for reads-sum is still unaddressed (P2)