feat(admin): KeyViz cluster fan-out aggregator (Phase 2-C PR-1) by bootjp · Pull Request #686 · bootjp/elastickv

bootjp · 2026-04-27T08:06:06Z

Summary

Server-side half of Phase 2-C cluster fan-out, per docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md (PR #685). With --keyvizFanoutNodes set, the existing /admin/api/v1/keyviz/matrix handler aggregates the local matrix with peer responses; with the flag empty (default), behaviour is unchanged.

This is PR-1 of the implementation plan in §7 of the design. PR-2 (SPA degraded banner + conflict hatching) and PR-3 (proto extension with raftGroupID + leaderTerm for the canonical merge) are separate.

Depends on PR #685 (the design doc) for review context — branched off that branch.

Aggregator (`internal/admin/keyviz_fanout.go`)

Parallel HTTP GETs to each peer with a per-call timeout (default 2 s, --keyvizFanoutTimeout overrides).
Forward-compatible request shape: the per-peer URL is rebuilt from the parsed parameters, so a peer running an older or newer server cannot trip on an unrecognised parameter.
Merge rules per design §4:
- reads / read_bytes → sum across nodes
- writes / write_bytes → max with row-level Conflict flag when inputs disagree
Degraded mode: a peer that times out, 5xx's, or fails to decode contributes FanoutNodeStatus{OK: false, Error: ...}. The local matrix still ships.

Wire format additions (forward-compatible)

{
  "rows": [{
    "...": "...",
    "conflict": false        // NEW: row-level disagreement flag
  }],
  "fanout": {                // NEW: omitted when fan-out is disabled
    "nodes": [{"node":"...", "ok":true, "error":""}],
    "responded": 2,
    "expected": 3
  }
}

Old SPA against fan-out server keeps working — extra fields are additive.

Flag plumbing

--keyvizFanoutNodes accepts comma-separated host:port endpoints.
--keyvizFanoutTimeout sets the per-peer ceiling (default 2 s).
Self-filter handles the bind-vs-advertise asymmetry: 0.0.0.0/:: bound listener matched against 127.0.0.1/localhost/::1 entries; a symmetric --keyvizFanoutNodes=node1,node2,node3 stamped on every host does not loop back over HTTP.

Five-lens self-review

Data loss — Read-only against the existing sampler; conservative max-merge can under-count during a leadership flip but never over-counts. Conflict flag tells operators when the cell is soft.
Concurrency / distributed — Per-peer calls run in parallel with a context deadline; the aggregator itself waits for all responses synchronously then merges. Test TestKeyVizFanoutRunPeerTimeout confirms a hanging peer cannot extend the wall time.
Performance — Aggregator cost is O(N × M × C); at the 1024-row × 60-column × 3-node ceiling that's ~180k cells per request — well under the SPA's existing render budget. Hot path (Observe) is untouched.
Data consistency — Merge rules conservative under leadership transitions. Reads exact in steady state and during transitions. The §9.1 canonical rule is preserved as a Phase 2-C+ contract.
Test coverage — Table-driven over the four canonical scenarios in the design + URL-builder variants + per-node order. go test -race -count=1 ./internal/admin/... clean; golangci-lint run ./... clean.

Test plan

go test -race -count=1 ./internal/admin/...
go test -race -count=1 . (main_admin tests with the new keyVizFanoutConfig{} arg)
go build ./...
golangci-lint run ./...
Manual: 3-node cluster with --keyvizFanoutNodes=... on each, drive write traffic against one node, confirm the heatmap reflects writes regardless of which node the SPA hits
Manual: kill one node, confirm the response carries responded: 2, expected: 3 plus the failed node's error string

What this PR does NOT do

SPA changes. Wire-format additions are forwards-compatible; the SPA renders correctly without knowing about Fanout or Conflict. UI for those lands in PR-2.
raftGroupID / leaderTerm proto extension. PR-3 of the plan.
Membership discovery via GetClusterOverview. Static --keyvizFanoutNodes is the contract for Phase 2-C; cache + refresh-interval lands when clusters grow beyond hand-configured size.
Inter-node auth. The fan-out path is currently anonymous; intra-cluster admin traffic is assumed to ride a private network. Auth follows in a follow-up before this hits production.

chatgpt-codex-connector · 2026-04-27T08:06:11Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

coderabbitai · 2026-04-27T08:06:14Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8575b529-fe6c-4d1e-9955-ae38ce70cf51

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/keyviz-fanout-server

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request implements cluster-wide fan-out for KeyViz, enabling a node to aggregate traffic matrices from peers in parallel. It includes specific merging logic for different metric types and adds conflict detection for write metrics. Feedback highlights a security concern regarding unbounded memory allocation when decoding peer responses and suggests improving observability by logging network resource cleanup errors.

gemini-code-assist · 2026-04-27T08:08:15Z

+		return nil, fmt.Errorf("%w: status %d: %s", errKeyVizPeer, resp.StatusCode, string(body))
+	}
+	var m KeyVizMatrix
+	if err := json.NewDecoder(resp.Body).Decode(&m); err != nil {


The JSON decoder reads from the response body without a size limit. A misbehaving or compromised peer could send an excessively large response, leading to high memory consumption or OOM on the aggregator node. Since the matrix size is bounded (max 1024 rows), we should use io.LimitReader to enforce a reasonable limit (e.g., 10MB).

Suggested change

if err := json.NewDecoder(resp.Body).Decode(&m); err != nil {

if err := json.NewDecoder(io.LimitReader(resp.Body, 10<<20)).Decode(&m); err != nil {

References

When handling HTTP requests, always limit the size of the request body to prevent Denial of Service attacks from oversized payloads.

gemini-code-assist · 2026-04-27T08:08:15Z

+	if err != nil {
+		return nil, pkgerrors.Wrap(err, "peer request")
+	}
+	defer func() { _ = resp.Body.Close() }()


The error from resp.Body.Close() is ignored, which violates the general rule about logging Close() errors on network resources. Additionally, the logger field in KeyVizFanout is currently unused. We should use the logger to record peer failures (including Close() errors and request errors) to provide server-side visibility into cluster issues.

Suggested change

defer func() { _ = resp.Body.Close() }()

defer func() {

if err := resp.Body.Close(); err != nil {

f.logger.LogAttrs(ctx, slog.LevelWarn, "failed to close peer response body",

slog.String("peer", peer),

slog.String("error", err.Error()),

)

}

}()

References

Do not silently ignore errors from Close() methods on resources like network connections. Log them to ensure resource leaks or other cleanup problems are visible.

Two items from Gemini round-1 on PR #686: - security-high: io.LimitReader the JSON decode at 64 MiB. A misbehaving peer that streams gigabytes back at us would otherwise pin a goroutine on json.Decode and balloon memory. The cap is generous enough for the design's worst-case payload (1024 rows x 4096 columns x ~32 bytes per uint64 ~= 128 MiB raw, ~32 MiB JSON-encoded) but still bounded. - medium: log resp.Body.Close errors at debug level via the existing KeyVizFanout.logger field (which was wired but unused). peerResponseExceededLimit reuses the +1 byte from the LimitReader to detect overshoot without parsing the streamed body. Adds TestKeyVizFanoutRunPeerOverlargeBody to pin that an in-cap response decodes promptly and reports OK; the over-cap case is covered by manual inspection of the warning log path.

bootjp · 2026-04-27T09:00:27Z

@claude review

Round-1 follow-ups (97e09a8) addressing both Gemini items on PR #686:

security-high (line 274) — JSON decoder now reads through io.LimitReader at 64 MiB. Sizing per the design's worst case (1024 rows × 4096 columns × ~32 B per uint64 ≈ 128 MiB raw / ~32 MiB JSON-encoded). Over-cap responses log a warning via the existing KeyVizFanout.logger rather than silently truncating. New TestKeyVizFanoutRunPeerOverlargeBody pins the in-cap path.
medium (line 266) — resp.Body.Close() errors now log at debug level on the same logger; the previously-unused logger field is wired into both the close path and the size-cap warning.

go test -race -count=1 ./internal/admin/... clean. golangci-lint run ./internal/admin/... clean.

claude · 2026-04-27T09:00:42Z

Claude finished @bootjp's task in 3m 3s —— View job

Round-1 follow-up review

Gather context
Read internal/admin/keyviz_fanout.go
Read internal/admin/keyviz_fanout_test.go
Review LimitReader + overshoot detection
Review Close() logging
Review test coverage gap

Both Gemini items are addressed correctly in terms of their primary goals. One subtle correctness issue in the overshoot detection is worth calling out.

`io.LimitReader` cap — security bound is solid

internal/admin/keyviz_fanout.go lines ~295–310

The LimitReader at keyVizPeerResponseBodyLimit + 1 (64 MiB + 1 byte) correctly caps how many bytes a malicious peer can push through the JSON decoder. The security property holds unconditionally.

`peerResponseExceededLimit` — subtle false-negative due to `json.Decoder` buffering

internal/admin/keyviz_fanout.go (peerResponseExceededLimit + caller)

json.NewDecoder wraps the reader in a bufio.Reader and reads ahead in chunks. After Decode returns successfully, the decoder's internal buffer may have already consumed bytes past the JSON boundary (and past the LimitReader's cap). When peerResponseExceededLimit(limited) then reads directly from limited, it reads from resp.Body via the LimitReader — but the LimitReader may already be at its cap (N or N+1 already consumed into the decoder's internal buffer), so Read returns (0, io.EOF) and the function returns false.

Concretely: if the body is 100 MiB of valid JSON, the decoder reads 64 MiB + 1 bytes from limited in internal chunks, the LimitReader is exhausted, the decoder stops (either with an error or with a valid partial object). If it errored, fetchPeer returns the error — fine. If it somehow decoded a valid object (e.g. the first 64 MiB contained a complete JSON value), then peerResponseExceededLimit reads from the now-exhausted limited, gets EOF, and returns false. The warning doesn't fire.

This is not a security regression — the bytes read are still bounded to 64 MiB + 1. The peerResponseExceededLimit warning is best-effort, and the commit message is transparent about this ("over-cap case is covered by manual inspection"). Still worth noting so future readers don't assume the warning is reliable.

A fully reliable approach would be a custom io.Reader that counts bytes consumed and checks after decode:

type countingReader struct {
    r io.Reader
    n int64
}
func (c *countingReader) Read(p []byte) (int, error) {
    n, err := c.r.Read(p)
    c.n += int64(n)
    return n, err
}

cr := &countingReader{r: io.LimitReader(resp.Body, keyVizPeerResponseBodyLimit+1)}
var m KeyVizMatrix
if err := json.NewDecoder(cr).Decode(&m); err != nil { ... }
if cr.n > keyVizPeerResponseBodyLimit { /* warn */ }

This accounts for bytes consumed into the decoder's internal buffer and fires reliably. Whether it's worth the extra type for a best-effort warning is a judgment call — the current approach is acceptable given the explicit acknowledgment.

`resp.Body.Close()` logging — correct

internal/admin/keyviz_fanout.go lines ~278–285

slog.LevelDebug is the right level for a network close error on a best-effort admin path. Using ctx from the enclosing function is safe — slog.Logger.LogAttrs only extracts context values (e.g. trace IDs) from the context; it doesn't fail or skip on a cancelled context.

`newKeyVizPeerStub` refactor — good cleanup

Removing the status int parameter and hardcoding http.StatusOK is the right call since non-200 tests build their handler inline. The signature is cleaner and all callers updated correctly.

Test gap: over-cap warning path is untested

internal/admin/keyviz_fanout_test.go TestKeyVizFanoutRunPeerOverlargeBody

The test name implies it covers "overlarge body" but the body it creates is ~500 KB (64 rows × 4096 uint64 values of 0, each 1 char in JSON = ~270 KB). This is well under the 64 MiB cap. The test only pins the in-cap path, not the warning path. The commit message acknowledges this explicitly.

Per CLAUDE.md: "New/changed branches must have unit tests." The peerResponseExceededLimit function and the warning log in fetchPeer are new code with no automated coverage. A future PR-3 (or a follow-up to this one) should add a test that serves a body > 64 MiB (or temporarily lowers keyVizPeerResponseBodyLimit to a test value) and asserts:

The decode completes in bounded time
The returned matrix is either an error or a truncated-but-valid result
The log output contains the warning

Sizing comment inconsistency (non-blocking)

The constant comment says 1024 rows × 4096 columns × ~32 bytes per uint64 ≈ 128 MiB raw, ~32 MiB JSON-encoded. The 32-byte-per-uint64 figure is for binary encoding; JSON encoding of uint64(0) is 1 byte ("0"), and uint64(18446744073709551615) is 20 bytes. At realistic heatmap densities (mostly-zero traffic matrix), the JSON is much closer to 1–3 bytes/value, so 64 MiB covers the realistic worst case generously. But if a heatmap captures peak-traffic windows with large counts, the theoretical worst case (20 bytes/value × 4M values ≈ 84 MiB) exceeds the cap. Worth noting for a future flag that exposes the limit.

Summary: Security bound is real and correct. The peerResponseExceededLimit detection has a documented false-negative edge case due to decoder buffering (not a safety issue, just a warning reliability issue). Close-error logging is clean. The main outstanding item is an automated test for the actual over-cap path. No blocking issues — this is shippable as-is with the understanding that the over-cap test and the sizing note are tracked for follow-up.

Seven items from Codex P2 + Gemini medium / security-medium / high: - 2.1 conflict flag definition: clarify per-row scope and signal that 2-C+ moves it to per-cell, removing the inconsistency between 2.1 and 5/6. - 3 auth: replace the conflated cookie-replay-plus-mint-token paragraph with a clean two-tier story. Phase 2-C MVP is anonymous on a private network; Phase 2-C+ adds a short-lived signed inter-node token derived from the existing admin signing key. Adds an explicit do-not-enable-on-untrusted warning. - 4.5 time alignment: bucket-align column timestamps to the keyvizStep boundary before merge so sub-step NTP jitter does not split each window into two adjacent merged columns. - 5 wire format: add per-node warnings array with a catalog_divergence code so the wire form actually carries the catalog-mismatch signal 4.4 promises. - 8 lens 3: document the 64 MiB io.LimitReader cap on peer response bodies (matches the cap implemented in PR #686). - 9 open Q1: resolve the DNS question. Resolution rides the stdlib resolver / nscd cache; no aggregator-local cache. Operators on flaky DNS colocate a resolver. - 1 background: add a Rollout notes paragraph addressing the Gemini high asks. The feature is off by default, mixed-version rollout is correct by construction (no schema migration, no state, forwards-compatible wire format), so a dual-write proxy / blue-green plan is not justified for this scope.

Two items from Gemini round-1 on PR #686: - security-high: io.LimitReader the JSON decode at 64 MiB. A misbehaving peer that streams gigabytes back at us would otherwise pin a goroutine on json.Decode and balloon memory. The cap is generous enough for the design's worst-case payload (1024 rows x 4096 columns x ~32 bytes per uint64 ~= 128 MiB raw, ~32 MiB JSON-encoded) but still bounded. - medium: log resp.Body.Close errors at debug level via the existing KeyVizFanout.logger field (which was wired but unused). peerResponseExceededLimit reuses the +1 byte from the LimitReader to detect overshoot without parsing the streamed body. Adds TestKeyVizFanoutRunPeerOverlargeBody to pin that an in-cap response decodes promptly and reports OK; the over-cap case is covered by manual inspection of the warning log path.

Two items from Claude bot round-2 on PR #686: - Replace the LimitReader+probe pattern with a countingReader that wraps the LimitReader. The decoder's bufio buffering meant the external one-byte probe could return EOF even when the decoder itself had pulled past the cap into its own buffer; the warning fired only when the read happened to land at the boundary. The countingReader increments on every Read (including the chunks the decoder buffers internally), so the post-decode `n > cap` check fires reliably. The security bound (cap+1 max bytes off the wire) is unchanged; only the *detection* of overshoot is now deterministic. - Add WithResponseBodyLimit so tests can drive the over-cap path without serving 64 MiB. Backed by a new TestKeyVizFanoutRunPeerExceedsBodyLimit test that lowers the cap to 1 KiB and verifies the call completes promptly with a per-node status surfaced. Updated the keyVizPeerResponseBodyLimit comment to be honest about the worst-case sizing: 1024 rows x 4096 columns x 20 bytes/uint64 (max-encoded) is ~80 MiB, slightly over the 64 MiB cap. That is intentional - the operator-visible failure mode is "warning logged, matrix may be truncated", not "DoS". A future flag can expose the cap when an extreme-traffic deployment actually trips it.

bootjp · 2026-04-27T09:13:29Z

@claude review

Round-2 follow-ups (2addffe):

peerResponseExceededLimit false-negative — replaced the LimitReader-plus-external-probe pattern with the countingReader you suggested. Increments on every Read (including the chunks json.NewDecoder buffers internally), so the post-decode n > cap check fires deterministically regardless of where the decoder happened to stop. Security bound unchanged (cap+1 max bytes off the wire); only the overshoot detection is now reliable.
Over-cap test gap — added TestKeyVizFanoutRunPeerExceedsBodyLimit driven by a new WithResponseBodyLimit option. Test lowers the cap to 1 KiB, serves a body well over that, and asserts the call completes promptly with per-node status surfaced.
Sizing-comment inconsistency — rewrote the keyVizPeerResponseBodyLimit comment to be honest about the worst case: 1024 × 4096 × 20 B (max-encoded uint64) ≈ 80 MiB, which exceeds the 64 MiB cap. The operator-visible failure mode is "warning logged, matrix may be truncated", not DoS. A future flag exposes the cap when an extreme-traffic deployment actually trips it.

go test -race -count=1 ./internal/admin/... clean. golangci-lint run ./internal/admin/... clean.

claude · 2026-04-27T09:13:42Z

Claude finished @bootjp's task in 2m 30s —— View job

Round-2 review

Gather context
Read internal/admin/keyviz_fanout.go (countingReader + limit changes)
Read internal/admin/keyviz_fanout_test.go (new ExceedsBodyLimit test)
Review countingReader implementation
Review WithResponseBodyLimit option
Review sizing comment
Final assessment

Round-2 follow-up review

All three items from Round-1 are addressed correctly. Below are observations on each, plus one subtle point worth being explicit about.

`countingReader` — false-negative is resolved correctly

internal/admin/keyviz_fanout.go:290–310

The implementation matches the pattern I suggested:

cr := &countingReader{r: io.LimitReader(resp.Body, f.responseBodyLimit()+1)}
var m KeyVizMatrix
if err := json.NewDecoder(cr).Decode(&m); err != nil {
    return nil, pkgerrors.Wrap(err, "decode peer response")
}
if cr.n > f.responseBodyLimit() {
    f.logger.LogAttrs(...)
}

The invariant holds: cr.n tracks every byte json.NewDecoder pulls from the LimitReader (including its internal bufio read-ahead), so the post-decode comparison fires reliably for the "just-barely-over-cap body that happens to be valid JSON at the boundary."

One subtlety to be aware of (not a bug — just clarifying the semantics): for a body that is massively over the cap (e.g. 80 MiB), the LimitReader returns io.EOF mid-JSON, the decoder returns an error, and we return nil, pkgerrors.Wrap(err, "decode peer response") before reaching the cr.n > cap check. The warning never logs in that path. The security property (≤ cap+1 bytes off the wire) holds unconditionally — the cr.n check is only load-bearing for the near-cap success case. The PR sizing comment is now honest about this, which is good.

`WithResponseBodyLimit` option — clean

internal/admin/keyviz_fanout.go:136–148

The <= 0 reset-to-default sentinel is the right choice. Values are only honoured in tests via the functional option; production leaves f.bodyLimit == 0 and falls through to the package-level constant.

Sizing comment — now correct

internal/admin/keyviz_fanout.go:40–56

The rewrite is accurate:

Worst-case: 1024 × 4096 × 20 B ≈ 80 MiB — exceeds the 64 MiB cap
Operator-visible failure mode: warning logged + decode error returned, not DoS
Future flag noted as the escape hatch

This is the right trade-off to document explicitly.

`//nolint:wrapcheck` in countingReader

internal/admin/keyviz_fanout.go:338

return n, err //nolint:wrapcheck // pure pass-through; underlying reader owns the error.

CLAUDE.md says "Avoid //nolint — refactor instead." The wrapcheck complaint here is a false positive: wrapping io.EOF or other io.Reader sentinel errors would break callers that errors.Is(err, io.EOF). The comment is the correct local exception. If the linter accepts it (reporter says lint is clean), this is fine — just note it for CLAUDE.md readers who encounter it in future reviews.

Test `TestKeyVizFanoutRunPeerExceedsBodyLimit` — covers the critical property

internal/admin/keyviz_fanout_test.go:380–427

With a 1 KiB cap and ~14 KB body, the decoder always fails mid-stream → peer returns ok=false. The "Either is fine" comment in the test is technically accurate but slightly misleading — in practice the outcome is always ok=false for this construction. The load-bearing assertion (time.Since(start) < 5s) is correct and is what this test actually pins.

The warning-fires path (decode succeeds, cr.n > cap) is not exercised by any automated test — that would require a body that is exactly cap+1 bytes of well-formed JSON. This was acknowledged in the PR description. It's acceptable as a documented gap; the security bound is what matters.

Summary

Item	Status
Security bound (≤ cap+1 bytes off wire)	✅ Correct and unchanged
`countingReader` false-negative fix	✅ Correct
Over-cap warning for near-cap success case	✅ Reliable via byte-count post-decode
Over-cap path when decode fails (massively over cap)	ℹ️ Warning not logged (returns error instead), acknowledged
`WithResponseBodyLimit` test hook	✅ Clean
Sizing comment honesty (80 MiB > 64 MiB cap)	✅ Now accurate
Automated test for warning-fires path	⚠️ Still a gap, acknowledged in PR
`//nolint:wrapcheck`	ℹ️ Justified but worth noting for future reviewers

No blocking issues. The PR is shippable. The remaining gap (automated test for the near-cap success + warning path) is the only thing worth tracking as a follow-up ticket if extreme-traffic deployments eventually trip the cap.

… test Two minor follow-ups from Claude bot round-2 on PR #686: - countingReader retains the //nolint:wrapcheck but with a longer comment explaining why: json.Decoder uses err == io.EOF (pointer compare, not errors.Is) to decide whether to keep pulling, so wrapping the underlying io.Reader error with %w would silently break the decoder's stop condition. CLAUDE.md generally says "avoid nolint, refactor instead", but here refactoring is impossible -- the only options are pass-through-and-suppress (this) or wrap-and-break-the-decoder. - Add TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning to cover the warning-fires path: small JSON envelope (~30 B) + 256 B of trailing whitespace against a 100 B cap. Decoder returns success; countingReader.n == 101 > 100 triggers the warning. The warning log itself is best-effort and not asserted from the test (slog has no clean injection point on the package), but the reachability of the success path under cap overrun is now pinned.

bootjp · 2026-04-27T09:20:23Z

@claude review

Round-3 follow-ups (4538f1d) addressing both round-2 minor notes:

//nolint:wrapcheck rationale — kept the suppression but extended the comment to explain why refactoring is impossible here: json.Decoder uses err == io.EOF (pointer compare, not errors.Is) to decide whether to keep pulling, so wrapping the io.Reader error with %w would silently break the decoder's stop condition. The CLAUDE.md "avoid nolint, refactor instead" rule has a genuine exception here.
Warning-fires test gap — added TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning: small JSON envelope (~30 B) + 256 B of trailing whitespace against a 100 B cap. json.Decoder reads in bufio chunks so the LimitReader hands it cap+1 = 101 bytes; decoder sees the complete object and returns nil, leaving cr.n == 101 > 100. The warning condition is now reachable from a unit test. The log line itself is not asserted (no clean slog injection point on the package), but the byte-count reliability is pinned.

go test -race -count=1 ./internal/admin/... clean. golangci-lint run ./internal/admin/... clean.

Two items from Gemini round-1 on PR #686: - security-high: io.LimitReader the JSON decode at 64 MiB. A misbehaving peer that streams gigabytes back at us would otherwise pin a goroutine on json.Decode and balloon memory. The cap is generous enough for the design's worst-case payload (1024 rows x 4096 columns x ~32 bytes per uint64 ~= 128 MiB raw, ~32 MiB JSON-encoded) but still bounded. - medium: log resp.Body.Close errors at debug level via the existing KeyVizFanout.logger field (which was wired but unused). peerResponseExceededLimit reuses the +1 byte from the LimitReader to detect overshoot without parsing the streamed body. Adds TestKeyVizFanoutRunPeerOverlargeBody to pin that an in-cap response decodes promptly and reports OK; the over-cap case is covered by manual inspection of the warning log path.

Two items from Claude bot round-2 on PR #686: - Replace the LimitReader+probe pattern with a countingReader that wraps the LimitReader. The decoder's bufio buffering meant the external one-byte probe could return EOF even when the decoder itself had pulled past the cap into its own buffer; the warning fired only when the read happened to land at the boundary. The countingReader increments on every Read (including the chunks the decoder buffers internally), so the post-decode `n > cap` check fires reliably. The security bound (cap+1 max bytes off the wire) is unchanged; only the *detection* of overshoot is now deterministic. - Add WithResponseBodyLimit so tests can drive the over-cap path without serving 64 MiB. Backed by a new TestKeyVizFanoutRunPeerExceedsBodyLimit test that lowers the cap to 1 KiB and verifies the call completes promptly with a per-node status surfaced. Updated the keyVizPeerResponseBodyLimit comment to be honest about the worst-case sizing: 1024 rows x 4096 columns x 20 bytes/uint64 (max-encoded) is ~80 MiB, slightly over the 64 MiB cap. That is intentional - the operator-visible failure mode is "warning logged, matrix may be truncated", not "DoS". A future flag can expose the cap when an extreme-traffic deployment actually trips it.

… test Two minor follow-ups from Claude bot round-2 on PR #686: - countingReader retains the //nolint:wrapcheck but with a longer comment explaining why: json.Decoder uses err == io.EOF (pointer compare, not errors.Is) to decide whether to keep pulling, so wrapping the underlying io.Reader error with %w would silently break the decoder's stop condition. CLAUDE.md generally says "avoid nolint, refactor instead", but here refactoring is impossible -- the only options are pass-through-and-suppress (this) or wrap-and-break-the-decoder. - Add TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning to cover the warning-fires path: small JSON envelope (~30 B) + 256 B of trailing whitespace against a 100 B cap. Decoder returns success; countingReader.n == 101 > 100 triggers the warning. The warning log itself is best-effort and not asserted from the test (slog has no clean injection point on the package), but the reachability of the success path under cap overrun is now pinned.

claude · 2026-04-27T09:20:37Z

Claude finished @bootjp's task in 2m 17s —— View job

Round-3 review

Gather context
Read internal/admin/keyviz_fanout.go (nolint comment + countingReader)
Read internal/admin/keyviz_fanout_test.go (new NearCap test)
Assess nolint rationale accuracy
Assess NearCap test correctness and coverage
Final assessment

Both round-2 minor notes are addressed. One small precision issue in the nolint comment is worth recording for future readers; everything else is clean.

`//nolint:wrapcheck` — suppression is correct, comment has one imprecision

internal/admin/keyviz_fanout.go:352–364

The suppression is justified and the multi-line explanation is useful. One precision note: the comment states "json.Decoder uses err == io.EOF (pointer compare, not errors.Is)". The actual library in scope is github.com/goccy/go-json, not encoding/json. Modern libraries like goccy/go-json may use errors.Is rather than pointer comparison for EOF detection.

This does not change the conclusion: the io.Reader interface contract requires implementations to return io.EOF unwrapped regardless of which consumer sees it. Code outside the JSON decoder — any other caller that borrows or wraps countingReader — may use pointer comparison. The canonical pass-through rule ("do not wrap io.Reader sentinel errors") stands independent of which JSON library happens to be the immediate consumer.

If the comment is ever revised, "the io.Reader contract requires pass-through of sentinels like io.EOF — wrapping breaks any caller relying on pointer equality" is more accurate than attributing the constraint to json.Decoder specifically. Not blocking — the suppression is correct; only the stated reason is slightly imprecise.

`TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning` — pins the right property

internal/admin/keyviz_fanout_test.go:397–426

The test construction is sound:

Tiny JSON (~30 B) + 256 B trailing whitespace against a 100 B cap
LimitReader set to cap+1 = 101, so the decoder's internal buffer reads exactly 101 bytes from countingReader
JSON object is complete at ~30 B → Decode returns nil
cr.n == 101 > 100 → warning branch is reachable in fetchPeer

The comment notes "bufio chunks" and attributes buffering behaviour to json.Decoder. goccy/go-json may not use bufio.Reader internally — it is a code-generation-based decoder with its own buffer management. In practice this does not matter: any reasonable read-ahead will consume past the JSON boundary into the whitespace padding, making cr.n > cap true. The test has been verified clean, confirming the property holds.

The test name slightly overpromises — it asserts Nodes[1].OK == true (decode succeeded) but does not assert the warning log was emitted. This was explicitly acknowledged in the PR description ("the log line itself is not asserted — no clean slog injection point on the package"). That is an acceptable documented gap; the byte-count reliability is what was under question in round-2, and that is now pinned. ✅

Summary

Item	Status
`//nolint:wrapcheck` suppression correctness	✅ Justified
Comment's stated reason (`json.Decoder` pointer-compare)	ℹ️ Minor imprecision; `goccy/go-json` is the actual library
Warning-fires path covered by automated test	✅ Byte-count path reachable and verified
Warning log emission asserted	ℹ️ Not asserted; acknowledged in PR description
Security bound (≤ cap+1 bytes off wire)	✅ Unchanged and correct
`go test -race` / `golangci-lint`	✅ Clean per author

No blocking issues. The two round-2 notes are closed. The PR is ready to merge.

Two items from Gemini round-1 on PR #686: - security-high: io.LimitReader the JSON decode at 64 MiB. A misbehaving peer that streams gigabytes back at us would otherwise pin a goroutine on json.Decode and balloon memory. The cap is generous enough for the design's worst-case payload (1024 rows x 4096 columns x ~32 bytes per uint64 ~= 128 MiB raw, ~32 MiB JSON-encoded) but still bounded. - medium: log resp.Body.Close errors at debug level via the existing KeyVizFanout.logger field (which was wired but unused). peerResponseExceededLimit reuses the +1 byte from the LimitReader to detect overshoot without parsing the streamed body. Adds TestKeyVizFanoutRunPeerOverlargeBody to pin that an in-cap response decodes promptly and reports OK; the over-cap case is covered by manual inspection of the warning log path.

Two items from Claude bot round-2 on PR #686: - Replace the LimitReader+probe pattern with a countingReader that wraps the LimitReader. The decoder's bufio buffering meant the external one-byte probe could return EOF even when the decoder itself had pulled past the cap into its own buffer; the warning fired only when the read happened to land at the boundary. The countingReader increments on every Read (including the chunks the decoder buffers internally), so the post-decode `n > cap` check fires reliably. The security bound (cap+1 max bytes off the wire) is unchanged; only the *detection* of overshoot is now deterministic. - Add WithResponseBodyLimit so tests can drive the over-cap path without serving 64 MiB. Backed by a new TestKeyVizFanoutRunPeerExceedsBodyLimit test that lowers the cap to 1 KiB and verifies the call completes promptly with a per-node status surfaced. Updated the keyVizPeerResponseBodyLimit comment to be honest about the worst-case sizing: 1024 rows x 4096 columns x 20 bytes/uint64 (max-encoded) is ~80 MiB, slightly over the 64 MiB cap. That is intentional - the operator-visible failure mode is "warning logged, matrix may be truncated", not "DoS". A future flag can expose the cap when an extreme-traffic deployment actually trips it.

… test Two minor follow-ups from Claude bot round-2 on PR #686: - countingReader retains the //nolint:wrapcheck but with a longer comment explaining why: json.Decoder uses err == io.EOF (pointer compare, not errors.Is) to decide whether to keep pulling, so wrapping the underlying io.Reader error with %w would silently break the decoder's stop condition. CLAUDE.md generally says "avoid nolint, refactor instead", but here refactoring is impossible -- the only options are pass-through-and-suppress (this) or wrap-and-break-the-decoder. - Add TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning to cover the warning-fires path: small JSON envelope (~30 B) + 256 B of trailing whitespace against a 100 B cap. Decoder returns success; countingReader.n == 101 > 100 triggers the warning. The warning log itself is best-effort and not asserted from the test (slog has no clean injection point on the package), but the reachability of the success path under cap overrun is now pinned.

bootjp · 2026-04-27T09:24:28Z

@claude review

Round-3 follow-up (186b6e9 after rebase on updated #685): generalized the //nolint:wrapcheck comment per your imprecision note. Now states the constraint as the io.Reader contract (sentinels must be passed unwrapped so err == io.EOF pointer compares keep working), with encoding/json and goccy/go-json mentioned as illustrative consumers — the underlying rule holds regardless of which decoder is in scope.

go test -race -count=1 ./internal/admin/... clean. golangci-lint run ./internal/admin/... clean.

claude · 2026-04-27T09:24:47Z

Claude finished @bootjp's task in 1m 6s —— View job

Round-3 follow-up review

Gather context
Read internal/admin/keyviz_fanout.go (updated nolint comment)
Assess comment accuracy and completeness
Final assessment

The generalized //nolint:wrapcheck comment in keyviz_fanout.go:355–369 correctly addresses the round-3 precision note.

`//nolint:wrapcheck` comment — now accurate

The updated comment makes three improvements over the prior version:

Root cause is the io.Reader contract, not a specific library: "implementations are required to pass io.EOF through unwrapped so any caller that does err == io.EOF (pointer compare) keeps working." This is the correct, library-independent framing.
Explicit about the mechanism: "Wrapping with %w produces a different error value that pointer compare will not match, even though errors.Is would." The distinction between pointer equality and errors.Is is what makes this non-obvious; spelling it out is the right call.
Correct attribution of consumers: Both encoding/json (historical, pointer compare) and goccy/go-json (the actual library in use, may use errors.Is) are named, with the clarification that "the io.Reader contract holds independent of which consumer is in scope." This is precisely what was missing from the prior version.

The type-level doc comment for countingReader (line 341–346) still references json.NewDecoder using internal buffering — but that's describing the motivation for the type's existence (why we can't probe the LimitReader after decode), not the contract, and the description is accurate regardless of which decoder implementation is used. No change needed there.

Summary

Item	Status
`//nolint:wrapcheck` suppression correctness	✅ Justified
Comment attribution (io.Reader contract, not specific library)	✅ Fixed — contract correctly named as root cause
Both `encoding/json` and `goccy/go-json` called out	✅ Present, with correct framing
Contract-holds-independent-of-consumer clause	✅ Explicit
`go test -race` / `golangci-lint`	✅ Clean per author

All three rounds of review items are closed. No blocking issues remain. The PR is ready to merge.

The 2.1 wording said conflict fires "when any per-node values disagreed for any cell," which under stable leadership would include the leader-says-5 vs follower-says-0 split (the steady-state shape) and contradicted both 4.2 and the 7 stable- leader test expectation ("max wins, no conflict"). Narrow the predicate to its actual shape: conflict fires only when two or more nodes report a non-zero value for the same cell. Follower zeros against a leader's non-zero value do not count. This matches the implementation in PR #686 (maxMerge ignores zero on either side and returns conflict=false). Mirrored the clarification in 4.2 so the conflict-predicate rule is explicit in both the scope summary (2.1) and the merge-rule detail (4.2).

Implements the server-side half of the fan-out path proposed in docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md. When the operator passes --keyvizFanoutNodes, the existing /admin/api/v1/keyviz/matrix handler aggregates the local matrix with peer responses; with the flag empty (default), behaviour is unchanged. Aggregator (internal/admin/keyviz_fanout.go): - Parallel HTTP GETs to each peer with a per-call timeout (default 2 s, --keyvizFanoutTimeout overrides). - Forward-compatible request shape: the per-peer URL is rebuilt from the parsed parameters so a peer running an older or newer server cannot trip on a parameter we never intended to forward. - Merge rules per design 4: reads / read_bytes sum across nodes (distinct local serves are independent counts); writes / write_bytes take the max with a row-level Conflict flag when inputs disagree (conservative dedup that never overcounts during a leadership flip; the canonical (raftGroupID, leaderTerm) rule lands in Phase 2-C+). - Degraded mode: a peer that times out, errors, or returns non-OK contributes FanoutNodeStatus{OK: false, Error: ...}. The local matrix still ships and Responded reflects partial success. Wire format additions (forward-compatible): - KeyVizMatrix.Fanout *FanoutResult — Nodes / Responded / Expected; omitted when fan-out is disabled. - KeyVizRow.Conflict bool — true when the max-merge collapsed disagreeing values for any cell in the row. Flag plumbing (main.go / main_admin.go): - --keyvizFanoutNodes accepts comma-separated host:port endpoints. - --keyvizFanoutTimeout sets the per-peer ceiling. - Self-filter: a peer entry naming this node's own admin listener (literal match, or 0.0.0.0/:: bind matched against a 127.0.0.1/localhost/::1 entry) is skipped so a symmetric --keyvizFanoutNodes=node1,node2,node3 stamped onto every host does not loop back over HTTP. Tests (internal/admin/keyviz_fanout_test.go): - mergeKeyVizMatrices table covering sum, stable-leader max, leadership-flip max with conflict, union-of-columns, and distinct-row order preservation. - KeyVizFanout.Run end-to-end against httptest peers: single peer ok, peer 5xx surfaces in degraded mode, peer hang respected by the per-call timeout. - buildKeyVizPeerURL table covering scheme/no-scheme/with-bounds variants. - Peer order is operator-supplied with self always first. SPA changes follow in PR-2 (degraded banner + conflict hatching). The proto / JSON extension with raftGroupID + leaderTerm follows in PR-3 (Phase 2-C+).

Two items from Gemini round-1 on PR #686: - security-high: io.LimitReader the JSON decode at 64 MiB. A misbehaving peer that streams gigabytes back at us would otherwise pin a goroutine on json.Decode and balloon memory. The cap is generous enough for the design's worst-case payload (1024 rows x 4096 columns x ~32 bytes per uint64 ~= 128 MiB raw, ~32 MiB JSON-encoded) but still bounded. - medium: log resp.Body.Close errors at debug level via the existing KeyVizFanout.logger field (which was wired but unused). peerResponseExceededLimit reuses the +1 byte from the LimitReader to detect overshoot without parsing the streamed body. Adds TestKeyVizFanoutRunPeerOverlargeBody to pin that an in-cap response decodes promptly and reports OK; the over-cap case is covered by manual inspection of the warning log path.

Two items from Claude bot round-2 on PR #686: - Replace the LimitReader+probe pattern with a countingReader that wraps the LimitReader. The decoder's bufio buffering meant the external one-byte probe could return EOF even when the decoder itself had pulled past the cap into its own buffer; the warning fired only when the read happened to land at the boundary. The countingReader increments on every Read (including the chunks the decoder buffers internally), so the post-decode `n > cap` check fires reliably. The security bound (cap+1 max bytes off the wire) is unchanged; only the *detection* of overshoot is now deterministic. - Add WithResponseBodyLimit so tests can drive the over-cap path without serving 64 MiB. Backed by a new TestKeyVizFanoutRunPeerExceedsBodyLimit test that lowers the cap to 1 KiB and verifies the call completes promptly with a per-node status surfaced. Updated the keyVizPeerResponseBodyLimit comment to be honest about the worst-case sizing: 1024 rows x 4096 columns x 20 bytes/uint64 (max-encoded) is ~80 MiB, slightly over the 64 MiB cap. That is intentional - the operator-visible failure mode is "warning logged, matrix may be truncated", not "DoS". A future flag can expose the cap when an extreme-traffic deployment actually trips it.

… test Two minor follow-ups from Claude bot round-2 on PR #686: - countingReader retains the //nolint:wrapcheck but with a longer comment explaining why: json.Decoder uses err == io.EOF (pointer compare, not errors.Is) to decide whether to keep pulling, so wrapping the underlying io.Reader error with %w would silently break the decoder's stop condition. CLAUDE.md generally says "avoid nolint, refactor instead", but here refactoring is impossible -- the only options are pass-through-and-suppress (this) or wrap-and-break-the-decoder. - Add TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning to cover the warning-fires path: small JSON envelope (~30 B) + 256 B of trailing whitespace against a 100 B cap. Decoder returns success; countingReader.n == 101 > 100 triggers the warning. The warning log itself is best-effort and not asserted from the test (slog has no clean injection point on the package), but the reachability of the success path under cap overrun is now pinned.

…se 2-C PR-2) (#687) ## Summary Phase 2-C **PR-2**, per the design doc landed in #685 §6: wire the fan-out `Fanout` / `conflict` shapes shipped by PR #686 into the admin SPA so operators can see when a peer failed and which rows are soft due to a leadership flip. Backend is unchanged. PR #686's `KeyVizMatrix.Fanout` block is currently emitted on every fan-out request but the SPA was discarding it. ## What this PR adds | Piece | Behavior | |---|---| | **Header counter** | `cluster view (N of M nodes)` in the per-matrix metadata strip. Hidden when `fanout` is absent (single-node mode, no behavior change). | | **FanoutBanner** | Degraded-mode card above the heatmap when `responded < expected`. Lists every failed node + error string. Hidden when the cluster is healthy. | | **ConflictOverlay** | SVG hatch over rows whose `conflict === true`. Layered inside the scroll container so the hatch tracks horizontal scroll (same idiom as `TimeAxis`). 4 px diagonal stripes, alpha 0.45, inheriting `currentColor` so it works against both light and dark themes. | | **RowDetail pill** | Hovering a conflicting row reveals a `conflict` pill with a hover-tooltip explaining the leadership-flip semantics. | ## Five-lens self-review 1. **Data loss** — n/a; SPA-only change reading existing fields. 2. **Concurrency / distributed** — n/a; single browser tab consuming an existing API. 3. **Performance** — Banner: zero-cost when healthy (`null` returned). Hatch: SVG with at most 1024 `<rect>` elements (matches `KeyVizRow` budget); empirically negligible. Overlay only renders when at least one row has `conflict === true`. 4. **Data consistency** — UI surfaces existing server signals; semantics come from PR #685 design §4.2 and #686 implementation. Not relitigated here. 5. **Test coverage** — `tsc -b --noEmit` and `vite build` clean. UI behavior (banner visibility, hatch presence, header text) is hard to assert in `tsc` alone; manual verification documented below. ## Test plan - [x] `npm run lint` (`tsc -b --noEmit`) — clean - [x] `npm run build` (Vite) — clean, output writes to `internal/admin/dist` - [ ] Manual: `make run` with `--keyvizEnabled --keyvizFanoutNodes=127.0.0.1:8080`. Open `/keyviz`; header reads `cluster view (1 of 1 nodes)`, no banner, no hatch. - [ ] Manual: configure `--keyvizFanoutNodes=127.0.0.1:8080,127.0.0.1:9999` (port 9999 unused). Header reads `1 of 2 nodes`, the failed-node banner appears with `connection refused` for `127.0.0.1:9999`. - [ ] Manual: smoke a leadership flip mid-window (or fake `conflict=true` server-side via `/admin/api/v1/keyviz/matrix?...`); the affected row renders with diagonal hatch overlay; hovering reveals the `conflict` pill. ## Out of scope - **Per-cell hatching** — Phase 2-C+ once the proto extension lands and `conflict` becomes per-cell instead of per-row. - **Per-node `generated_at` skew indicator** — design §9 Q3, deferred. - **Banner auto-dismiss / sound notification** — out of scope; an operator who has the page open will see the banner. CI alert pages are a separate system. Closes the SPA half of Phase 2-C; the proto extension follows in PR-3.  ## Summary by CodeRabbit * **New Features** * KeyViz now displays cluster fan-out status with node availability and responsiveness counts. * Added visual indicators for rows experiencing conflicts during leadership transitions. * Row details now include conflict status labels with informative tooltips.

…401 (#692) ## Summary **Hotfix for the screenshot a user shared after enabling fan-out** — the heatmap collapsed to "1 of 6 nodes responded" with every peer returning `401 missing session cookie`. **Root cause**: PR #686 shipped fan-out with the design assumption (#685 §3) that peers would accept anonymous calls on a private network. But the receiving side runs the same `SessionAuth` middleware as the browser path, so anonymous fan-out calls are rejected with 401. The "anonymous on a private network" framing was wrong: the implementation didn't put a bypass on the peer side, so anonymous calls never went through end-to-end. **Fix**: forward the inbound user's cookies on every peer call. Cluster nodes already share `--adminSessionSigningKey` for HA, so a cookie minted on node A is verifiable on node B — no new infrastructure needed. ## What changed - `KeyVizFanout.Run` gains a `cookies []*http.Cookie` parameter; nil preserves the legacy behaviour for tests / future non-browser callers. - `KeyVizHandler` passes `r.Cookies()` so a SPA poll forwards `admin_session` (and `admin_csrf`, harmless on a GET) transparently. - `TestKeyVizFanoutRunForwardsCookies` pins that both cookies reach the peer verbatim. - Design doc §3 rewritten: documents the cookie-forwarding scheme + the operator-trust assumption that `--keyvizFanoutNodes` points at trusted hosts only. Phase 2-C+ will replace cookie forwarding with a dedicated inter-node token to decouple browser-session expiry from inter-node call validity. ## Five-lens self-review 1. **Data loss** — n/a; auth path change. 2. **Concurrency / distributed** — Cookies are passed by reference into per-peer goroutines but the slice is read-only inside `fetchPeer` (each goroutine `req.AddCookie`s independently). No shared mutation. 3. **Performance** — Adds N `req.AddCookie` calls per fan-out request. Trivial. 4. **Data consistency** — Auth-only change; merge semantics unchanged. 5. **Test coverage** — New `TestKeyVizFanoutRunForwardsCookies` asserts both cookies reach the peer with the original values. Existing fan-out tests pass `nil` and continue to exercise the unauthenticated path the existing code handled (peer 401s, request reports `ok=false`). ## Test plan - [x] `go test -race -count=1 ./internal/admin/...` — clean - [x] `golangci-lint run ./internal/admin/...` — clean - [ ] Manual: 6-node cluster with shared `ELASTICKV_ADMIN_SESSION_SIGNING_KEY`, `--keyvizFanoutNodes` set on every node. Heatmap should now show "6 of 6 nodes" instead of "1 of 6". ## Trust / threat model note Cookie forwarding is safe when (a) every peer is operator-configured and trusted, and (b) the network is private. **Do NOT point `--keyvizFanoutNodes` at an untrusted host** — the user's admin session would be replayed there. The design doc §3 update is explicit about this. Phase 2-C+ removes this footgun by switching to a dedicated inter-node token. Closes the production-impacting symptom from screenshot review.

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

bootjp mentioned this pull request Apr 27, 2026

docs: KeyViz cluster fan-out proposal (Phase 2-C) #685

Merged

3 tasks

bootjp force-pushed the feat/keyviz-fanout-server branch from 97e09a8 to 3c94ba4 Compare April 27, 2026 09:05

bootjp force-pushed the feat/keyviz-fanout-server branch from 4538f1d to ec8c1e7 Compare April 27, 2026 09:20

bootjp force-pushed the feat/keyviz-fanout-server branch from ec8c1e7 to 186b6e9 Compare April 27, 2026 09:24

bootjp added a commit that referenced this pull request Apr 27, 2026

fix(admin): generalize countingReader nolint comment per #686 round-3

186b6e9

bootjp added 2 commits April 27, 2026 18:27

bootjp added 3 commits April 27, 2026 18:27

fix(admin): generalize countingReader nolint comment per #686 round-3

7747be1

bootjp force-pushed the feat/keyviz-fanout-server branch from 186b6e9 to 7747be1 Compare April 27, 2026 09:27

bootjp merged commit 9ec6ca2 into docs/keyviz-fanout-phase2c Apr 27, 2026
5 checks passed

bootjp deleted the feat/keyviz-fanout-server branch April 27, 2026 09:44

bootjp mentioned this pull request Apr 27, 2026

feat(admin): SPA degraded banner + conflict hatching for fan-out (Phase 2-C PR-2) #687

Merged

5 tasks

bootjp mentioned this pull request Apr 27, 2026

fix(admin): forward session cookie on KeyViz fan-out so peers do not 401 #692

Merged

3 tasks

	if err := json.NewDecoder(resp.Body).Decode(&m); err != nil {
	if err := json.NewDecoder(io.LimitReader(resp.Body, 10<<20)).Decode(&m); err != nil {

-	defer func() { _ = resp.Body.Close() }()
+	defer func() {
+		if err := resp.Body.Close(); err != nil {
+			f.logger.LogAttrs(ctx, slog.LevelWarn, "failed to close peer response body",
+				slog.String("peer", peer),
+				slog.String("error", err.Error()),
+			)
+		}
+	}()

Conversation

bootjp commented Apr 27, 2026

Summary

Aggregator (internal/admin/keyviz_fanout.go)

Wire format additions (forward-compatible)

Flag plumbing

Five-lens self-review

Test plan

What this PR does NOT do

Uh oh!

chatgpt-codex-connector Bot commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-1 follow-up review

io.LimitReader cap — security bound is solid

peerResponseExceededLimit — subtle false-negative due to json.Decoder buffering

resp.Body.Close() logging — correct

newKeyVizPeerStub refactor — good cleanup

Test gap: over-cap warning path is untested

Sizing comment inconsistency (non-blocking)

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-2 review

Round-2 follow-up review

countingReader — false-negative is resolved correctly

WithResponseBodyLimit option — clean

Sizing comment — now correct

//nolint:wrapcheck in countingReader

Test TestKeyVizFanoutRunPeerExceedsBodyLimit — covers the critical property

Summary

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-3 review

//nolint:wrapcheck — suppression is correct, comment has one imprecision

TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning — pins the right property

Summary

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-3 follow-up review

//nolint:wrapcheck comment — now accurate

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aggregator (`internal/admin/keyviz_fanout.go`)

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

claude Bot commented Apr 27, 2026 •

edited

Loading

`io.LimitReader` cap — security bound is solid

`peerResponseExceededLimit` — subtle false-negative due to `json.Decoder` buffering

`resp.Body.Close()` logging — correct

`newKeyVizPeerStub` refactor — good cleanup

claude Bot commented Apr 27, 2026 •

edited

Loading

`countingReader` — false-negative is resolved correctly

`WithResponseBodyLimit` option — clean

`//nolint:wrapcheck` in countingReader

Test `TestKeyVizFanoutRunPeerExceedsBodyLimit` — covers the critical property

claude Bot commented Apr 27, 2026 •

edited

Loading

`//nolint:wrapcheck` — suppression is correct, comment has one imprecision

`TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning` — pins the right property

claude Bot commented Apr 27, 2026 •

edited

Loading

`//nolint:wrapcheck` comment — now accurate