fix(redis): bound BZPOPMIN fast mode to redisBlockWaitFallback by bootjp · Pull Request #683 · bootjp/elastickv

bootjp · 2026-04-26T20:14:09Z

Address Codex P1 / Claude bot review on 6173f58:

The previous bzpopminWaitLoop set fast directly from waitForBlockedCommandUpdate's return value. Under sustained zsetWaiters.Signal pressure (a ZADD/ZINCRBY hot key whose signals fire faster than each bzpopminTryAllKeys round finishes) the buffered-1 waiterC stays perpetually full: every wait returns via the signal arm, fast stays true forever, the 100 ms fallback timer never fires. A wrongType write (HSET / SET / etc.) on a co- registered key (multi-key BZPOPMIN) then goes undetected for the entire BLOCK window — a regression vs #666's strict 100 ms ceiling.

Fix: track the wall time of the last full check inside bzpopminWaitLoop and demote fast back to false once redisBlockWaitFallback has elapsed since that check. Signal-driven wakes still take the fast path during the first 100 ms after a full check; after that, the next wake (signal or timer) runs the full check and resets the budget. The 100 ms WRONGTYPE-detection ceiling is restored regardless of signal rate.

Per CLAUDE.md regression-test convention, the failing scenario is locked down by TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad before the fix:

A signal-pressure goroutine drives zsetWaiters.Signal in a tight loop on bzpop-press-hot. The reader's waiter, registered on both bzpop-press-cold and bzpop-press-hot, wakes on every signal.
After the reader settles into fast mode (~200 ms), a string is SET at bzpop-press-cold.
Pre-fix: the loop never demotes fast=false, the fast-path probe on bzpop-press-cold never detects the wrongType, BZPOPMIN times out at the 3 s BLOCK deadline.
Post-fix: the wall-time gate forces a full check within ~100 ms of the SET, WRONGTYPE surfaces well before the BLOCK deadline.

Verified: the test fails on parent commit 6173f58 (loop ran for the full 3 s without surfacing WRONGTYPE) and passes on this commit (returns WRONGTYPE in ~100 ms).

Drive-by cleanups from the same review pass:

Collapse tryBZPopMin / tryBZPopMinFast wrappers; bzpopminTryAllKeys calls tryBZPopMinWithMode directly with the fast bool. Both wrappers had no other callers (per grep) and only added one level of indirection.
Move TestRedis_BZPopMinTimesOutOnEmptyKey's stranded doc comment back over its own function; the previous PR mistakenly prepended it to TestRedis_BZPopMinRejectsInitialWrongType.

Self-review (CLAUDE.md 5 lenses):

Data loss -- None. Loop control flow only; persistBZPopMinResult path unchanged.
Concurrency -- The wall-time gate is a local variable; no shared state added. waitForBlockedCommandUpdate's signaled-bool contract is unchanged.
Performance -- One time.Since per loop iteration on top of the existing ScanAt cost; negligible. Worst-case fast-path utilisation drops from "always" to "first 100 ms of every full check window," but the fast-mode CPU win is still claimed for the dominant signal-driven case (most BZPOPMIN BLOCK windows complete within one full-check window because the consumer pops the first matched ZADD).
Data consistency -- WRONGTYPE detection ceiling is back to 100 ms regardless of signal rate. Fast-path correctness invariants from perf(redis): empty-zset fast path for BZPOPMIN signal-driven wakes #677 (signal-driven wake guarantees no wrongType transition) are unchanged.
Test coverage -- regression-test-first per CLAUDE.md convention. Existing TestRedis_BZPopMinRejectsInitialWrongType, TestRedis_BZPopMinDetectsMidBlockWrongType, and TestRedis_BZPopMinTimesOutOnEmptyKey continue to lock down the other branches.

Summary by CodeRabbit

Bug Fixes
- BZPOPMIN now correctly detects wrong-type errors even under sustained signal-driven conditions, preventing timeouts in edge cases
Tests
- Added test to verify BZPOPMIN wrong-type error detection under high signal load

Address Codex P1 / Claude bot review on 6173f58: The previous bzpopminWaitLoop set `fast` directly from waitForBlockedCommandUpdate's return value. Under sustained zsetWaiters.Signal pressure (a ZADD/ZINCRBY hot key whose signals fire faster than each bzpopminTryAllKeys round finishes) the buffered-1 waiterC stays perpetually full: every wait returns via the signal arm, fast stays true forever, the 100 ms fallback timer never fires. A wrongType write (HSET / SET / etc.) on a co- registered key (multi-key BZPOPMIN) then goes undetected for the entire BLOCK window — a regression vs #666's strict 100 ms ceiling. Fix: track the wall time of the last full check inside bzpopminWaitLoop and demote `fast` back to false once redisBlockWaitFallback has elapsed since that check. Signal-driven wakes still take the fast path during the first 100 ms after a full check; after that, the next wake (signal or timer) runs the full check and resets the budget. The 100 ms WRONGTYPE-detection ceiling is restored regardless of signal rate. Per CLAUDE.md regression-test convention, the failing scenario is locked down by TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad before the fix: - A signal-pressure goroutine drives zsetWaiters.Signal in a tight loop on bzpop-press-hot. The reader's waiter, registered on both bzpop-press-cold and bzpop-press-hot, wakes on every signal. - After the reader settles into fast mode (~200 ms), a string is SET at bzpop-press-cold. - Pre-fix: the loop never demotes fast=false, the fast-path probe on bzpop-press-cold never detects the wrongType, BZPOPMIN times out at the 3 s BLOCK deadline. - Post-fix: the wall-time gate forces a full check within ~100 ms of the SET, WRONGTYPE surfaces well before the BLOCK deadline. Verified: the test fails on parent commit 6173f58 (loop ran for the full 3 s without surfacing WRONGTYPE) and passes on this commit (returns WRONGTYPE in ~100 ms). Drive-by cleanups from the same review pass: - Collapse tryBZPopMin / tryBZPopMinFast wrappers; bzpopminTryAllKeys calls tryBZPopMinWithMode directly with the fast bool. Both wrappers had no other callers (per grep) and only added one level of indirection. - Move TestRedis_BZPopMinTimesOutOnEmptyKey's stranded doc comment back over its own function; the previous PR mistakenly prepended it to TestRedis_BZPopMinRejectsInitialWrongType. Self-review (CLAUDE.md 5 lenses): 1. Data loss -- None. Loop control flow only; persistBZPopMinResult path unchanged. 2. Concurrency -- The wall-time gate is a local variable; no shared state added. waitForBlockedCommandUpdate's signaled-bool contract is unchanged. 3. Performance -- One time.Since per loop iteration on top of the existing ScanAt cost; negligible. Worst-case fast-path utilisation drops from "always" to "first 100 ms of every full check window," but the fast-mode CPU win is still claimed for the dominant signal-driven case (most BZPOPMIN BLOCK windows complete within one full-check window because the consumer pops the first matched ZADD). 4. Data consistency -- WRONGTYPE detection ceiling is back to 100 ms regardless of signal rate. Fast-path correctness invariants from #677 (signal-driven wake guarantees no wrongType transition) are unchanged. 5. Test coverage -- regression-test-first per CLAUDE.md convention. Existing TestRedis_BZPopMinRejectsInitialWrongType, TestRedis_BZPopMinDetectsMidBlockWrongType, and TestRedis_BZPopMinTimesOutOnEmptyKey continue to lock down the other branches.

chatgpt-codex-connector · 2026-04-26T20:14:13Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

coderabbitai · 2026-04-26T20:14:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe5b5002-f418-4658-8a69-0aa8e6063b64

📥 Commits

Reviewing files that changed from the base of the PR and between 98d38b5 and fe5b718.

📒 Files selected for processing (2)

adapter/redis_bzpopmin_wake_test.go
adapter/redis_compat_commands.go

📝 Walkthrough

Walkthrough

The PR adds a test case for BZPOPMIN wrong-type detection under sustained signal pressure and refactors the waiting loop to enforce time-bounded fast-mode checks instead of separate entry points, preventing starvation while maintaining wrong-type error detection.

Changes

Cohort / File(s)	Summary
BZPOPMIN Signal Load Test `adapter/redis_bzpopmin_wake_test.go`	Added `TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad` to verify BZPOPMIN returns WRONGTYPE error under repeated signal-driven wakes; relocated existing timeout contract test.
BZPOPMIN Fast Mode Refactoring `adapter/redis_compat_commands.go`	Removed separate `tryBZPopMin`/`tryBZPopMinFast` entry points; `bzpopminWaitLoop` now uses `lastFullCheck` to enforce time-bounded fast mode, demoting to full mode if threshold exceeded to prevent starvation and ensure wrong-type detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A signal wakes the waiting pop,
Through fast-path loops it won't just stop—
But time keeps watch with lastFullCheck,
So wrong-type errors won't be wreck'd!
Hopping safe, both quick and slow. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically identifies the main fix: bounding BZPOPMIN's fast mode to the redisBlockWaitFallback timeout, which directly addresses the regression bug described in the objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/bzpopmin-empty-zset-fastpath

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses a bug in the BZPOPMIN command where sustained signal pressure could cause the command to remain in 'fast mode' indefinitely, potentially missing WRONGTYPE errors on co-registered keys. The implementation introduces a lastFullCheck timestamp to ensure a full probe is performed at least once every redisBlockWaitFallback interval (100ms), even under heavy signal load. A new regression test, TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad, has been added to verify the fix. Review feedback suggests evaluating whether a shorter fallback duration, such as 10ms, would be more appropriate for low-latency performance.

gemini-code-assist · 2026-04-26T20:19:30Z

 		}
-		fast = waitForBlockedCommandUpdate(handlerCtx, w.C, deadline)
+		signaled := waitForBlockedCommandUpdate(handlerCtx, w.C, deadline)
+		fast = signaled && time.Since(lastFullCheck) < redisBlockWaitFallback


The logic for demoting fast mode based on wall time is sound. However, for low-latency systems, consider if the 100ms ceiling (redisBlockWaitFallback) is too high; a shorter maximum backoff duration like 10ms is often preferred to handle contention more effectively.

References

For low-latency systems, it is acceptable to use a short maximum backoff duration for retries (e.g., 10ms) as long-lived contention is not the primary scenario to handle.

bootjp · 2026-04-27T10:46:24Z

@claude review

claude · 2026-04-27T10:46:38Z

Claude finished @bootjp's task in 2m 23s —— View job

Code Review — PR #683

Summary

This PR correctly fixes a liveness regression introduced in #677 (commit 6173f584) and follows the CLAUDE.md regression-test-first convention. The core bug and fix are sound; review details below.

`adapter/redis_compat_commands.go` — Core fix

The bug (pre-fix)

fast = waitForBlockedCommandUpdate(handlerCtx, w.C, deadline)

Under sustained signal pressure, waiterC (buffered-1) is perpetually full, so the select in waitForBlockedCommandUpdate always picks the <-waiterC arm. The fallback timer never fires, fast stays true forever, and a wrongType write on any co-registered key goes undetected for the entire BLOCK window.

The fix

signaled := waitForBlockedCommandUpdate(handlerCtx, w.C, deadline)
fast = signaled && time.Since(lastFullCheck) < redisBlockWaitFallback

lastFullCheck is updated only when !fast (i.e., when a full check actually ran). After redisBlockWaitFallback (100 ms) elapses, time.Since(lastFullCheck) >= 100ms, so fast is demoted to false even when signals keep arriving. This restores the 100 ms WRONGTYPE-detection ceiling from #666.

Logic trace — no correctness issues found

State	Action
`fast=false` (full check)	`bzpopminTryAllKeys` runs full probe; `lastFullCheck = time.Now()` after
`fast=true`, within 100 ms	Fast probe; `lastFullCheck` not updated
`fast=true`, >100 ms since last full check	`time.Since(lastFullCheck) >= 100ms` → `fast = false` on next iteration
Timer/cancellation wake	`waitForBlockedCommandUpdate` returns `false` → `fast = false` unconditionally

All four branches correctly reach a full check within one fallback interval.

One observation on lastFullCheck initialization

fast := false
lastFullCheck := time.Now()  // ← initial value

Since fast starts as false, the very first iteration runs a full check, and then lastFullCheck = time.Now() is immediately overwritten. The initial value has no effect on correctness but is mildly redundant — time.Time{} (zero value) would work equally well (since time.Since(time.Time{}) is ~56 years, well above 100 ms, which would keep fast=false). Not worth changing; just noting it's a no-op.

On Gemini's suggestion to use 10 ms instead of 100 ms

The existing redisBlockWaitFallback = 100 ms constant is deliberately scoped to be low-CPU (1/10th of a busy-poll per the constant's own comment) while bounding stale-poll latency to values clients already tolerate from network round-trips. Using the same constant for the wall-time gate keeps the two ceilings aligned: WRONGTYPE takes at most ~100 ms whether detected via the timer path or the wall-time path. Introducing a separate tighter constant would diverge the two paths for no clear benefit. The 100 ms ceiling is correct as-is.

`adapter/redis_compat_commands.go` — Drive-by cleanups

tryBZPopMin / tryBZPopMinFast wrapper collapse

Clean removal. bzpopminTryAllKeys now calls tryBZPopMinWithMode directly:

// before
if fast {
    result, err = r.tryBZPopMinFast(key)
} else {
    result, err = r.tryBZPopMin(key)
}

// after
result, err := r.tryBZPopMinWithMode(key, fast)

One level of indirection removed; no callers of the wrappers remain. The tryBZPopMinWithMode doc comment is updated to describe the caller's obligation clearly (the caller MUST guarantee that the only mutations since the previous full check are signalling writes).

`adapter/redis_bzpopmin_wake_test.go` — New test

TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad

The test correctly locks down the regression:

Synthetic signal driver: directly invokes signalServer.zsetWaiters.Signal(key) in a tight loop (same-package access, appropriate for an internal test). runtime.Gosched() lets the reader goroutine make forward progress — necessary to avoid starving the reader of CPU.
Timing: 200 ms sleep before introducing the wrongType write. With the fix, a forced full check has already run within this window (100 ms budget). Pre-fix, the reader never demotes fast, so bzpop-press-cold is never probed with keyTypeAtExpect.
Assertion: uses require.Contains(t, res.err.Error(), "WRONGTYPE") rather than exact-matching an error code — good, as the wire encoding may vary.
Timeout: 3500 ms catches the pre-fix regression (BZPOPMIN running full 3 s BLOCK without returning WRONGTYPE) while the fix surfaces WRONGTYPE well within that window.

Doc comment relocation (TestRedis_BZPopMinTimesOutOnEmptyKey): the misplaced comment (previously prepended to TestRedis_BZPopMinRejectsInitialWrongType) is correctly restored above its own function.

One minor nit: the test creates a full 3-node cluster via createNode(t, 3). The signal-pressure scenario is entirely in-process (same node), so nodes[0] is the only node exercised. A single-node setup via createNode(t, 1) would reduce test overhead, though this is a style preference — the existing tests in the file all use 3 nodes for consistency.

Self-review lens check (CLAUDE.md)

Lens	Status
Data loss	No path changes to `persistBZPopMinResult`; loop-control only
Concurrency	`lastFullCheck` is a local variable; no shared state added; `waitForBlockedCommandUpdate` contract unchanged
Performance	One `time.Since` per loop iteration; negligible vs. existing `ScanAt` cost
Data consistency	WRONGTYPE ceiling restored to ≤100 ms regardless of signal rate
Test coverage	Regression test added before fix, per CLAUDE.md convention

Verdict

Approve. The fix is minimal, correct, and well-tested. No blocking issues. The nit on 3-node vs. 1-node setup and the observation on lastFullCheck initialization are both non-blocking.

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

bootjp merged commit 303ebe2 into main Apr 27, 2026
11 checks passed

bootjp deleted the perf/bzpopmin-empty-zset-fastpath branch April 27, 2026 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(redis): bound BZPOPMIN fast mode to redisBlockWaitFallback#683

fix(redis): bound BZPOPMIN fast mode to redisBlockWaitFallback#683
bootjp merged 1 commit intomainfrom
perf/bzpopmin-empty-zset-fastpath

bootjp commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

chatgpt-codex-connector Bot commented Apr 26, 2026

Uh oh!

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented Apr 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

chatgpt-codex-connector Bot commented Apr 26, 2026

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — PR #683

Summary

adapter/redis_compat_commands.go — Core fix

adapter/redis_compat_commands.go — Drive-by cleanups

adapter/redis_bzpopmin_wake_test.go — New test

Self-review lens check (CLAUDE.md)

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bootjp commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

claude Bot commented Apr 27, 2026 •

edited

Loading

`adapter/redis_compat_commands.go` — Core fix

`adapter/redis_compat_commands.go` — Drive-by cleanups

`adapter/redis_bzpopmin_wake_test.go` — New test