fix(redis): bound BZPOPMIN fast mode to redisBlockWaitFallback#683
fix(redis): bound BZPOPMIN fast mode to redisBlockWaitFallback#683
Conversation
Address Codex P1 / Claude bot review on 6173f58: The previous bzpopminWaitLoop set `fast` directly from waitForBlockedCommandUpdate's return value. Under sustained zsetWaiters.Signal pressure (a ZADD/ZINCRBY hot key whose signals fire faster than each bzpopminTryAllKeys round finishes) the buffered-1 waiterC stays perpetually full: every wait returns via the signal arm, fast stays true forever, the 100 ms fallback timer never fires. A wrongType write (HSET / SET / etc.) on a co- registered key (multi-key BZPOPMIN) then goes undetected for the entire BLOCK window — a regression vs #666's strict 100 ms ceiling. Fix: track the wall time of the last full check inside bzpopminWaitLoop and demote `fast` back to false once redisBlockWaitFallback has elapsed since that check. Signal-driven wakes still take the fast path during the first 100 ms after a full check; after that, the next wake (signal or timer) runs the full check and resets the budget. The 100 ms WRONGTYPE-detection ceiling is restored regardless of signal rate. Per CLAUDE.md regression-test convention, the failing scenario is locked down by TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad before the fix: - A signal-pressure goroutine drives zsetWaiters.Signal in a tight loop on bzpop-press-hot. The reader's waiter, registered on both bzpop-press-cold and bzpop-press-hot, wakes on every signal. - After the reader settles into fast mode (~200 ms), a string is SET at bzpop-press-cold. - Pre-fix: the loop never demotes fast=false, the fast-path probe on bzpop-press-cold never detects the wrongType, BZPOPMIN times out at the 3 s BLOCK deadline. - Post-fix: the wall-time gate forces a full check within ~100 ms of the SET, WRONGTYPE surfaces well before the BLOCK deadline. Verified: the test fails on parent commit 6173f58 (loop ran for the full 3 s without surfacing WRONGTYPE) and passes on this commit (returns WRONGTYPE in ~100 ms). Drive-by cleanups from the same review pass: - Collapse tryBZPopMin / tryBZPopMinFast wrappers; bzpopminTryAllKeys calls tryBZPopMinWithMode directly with the fast bool. Both wrappers had no other callers (per grep) and only added one level of indirection. - Move TestRedis_BZPopMinTimesOutOnEmptyKey's stranded doc comment back over its own function; the previous PR mistakenly prepended it to TestRedis_BZPopMinRejectsInitialWrongType. Self-review (CLAUDE.md 5 lenses): 1. Data loss -- None. Loop control flow only; persistBZPopMinResult path unchanged. 2. Concurrency -- The wall-time gate is a local variable; no shared state added. waitForBlockedCommandUpdate's signaled-bool contract is unchanged. 3. Performance -- One time.Since per loop iteration on top of the existing ScanAt cost; negligible. Worst-case fast-path utilisation drops from "always" to "first 100 ms of every full check window," but the fast-mode CPU win is still claimed for the dominant signal-driven case (most BZPOPMIN BLOCK windows complete within one full-check window because the consumer pops the first matched ZADD). 4. Data consistency -- WRONGTYPE detection ceiling is back to 100 ms regardless of signal rate. Fast-path correctness invariants from #677 (signal-driven wake guarantees no wrongType transition) are unchanged. 5. Test coverage -- regression-test-first per CLAUDE.md convention. Existing TestRedis_BZPopMinRejectsInitialWrongType, TestRedis_BZPopMinDetectsMidBlockWrongType, and TestRedis_BZPopMinTimesOutOnEmptyKey continue to lock down the other branches.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR adds a test case for BZPOPMIN wrong-type detection under sustained signal pressure and refactors the waiting loop to enforce time-bounded fast-mode checks instead of separate entry points, preventing starvation while maintaining wrong-type error detection. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in the BZPOPMIN command where sustained signal pressure could cause the command to remain in 'fast mode' indefinitely, potentially missing WRONGTYPE errors on co-registered keys. The implementation introduces a lastFullCheck timestamp to ensure a full probe is performed at least once every redisBlockWaitFallback interval (100ms), even under heavy signal load. A new regression test, TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad, has been added to verify the fix. Review feedback suggests evaluating whether a shorter fallback duration, such as 10ms, would be more appropriate for low-latency performance.
| } | ||
| fast = waitForBlockedCommandUpdate(handlerCtx, w.C, deadline) | ||
| signaled := waitForBlockedCommandUpdate(handlerCtx, w.C, deadline) | ||
| fast = signaled && time.Since(lastFullCheck) < redisBlockWaitFallback |
There was a problem hiding this comment.
The logic for demoting fast mode based on wall time is sound. However, for low-latency systems, consider if the 100ms ceiling (redisBlockWaitFallback) is too high; a shorter maximum backoff duration like 10ms is often preferred to handle contention more effectively.
References
- For low-latency systems, it is acceptable to use a short maximum backoff duration for retries (e.g., 10ms) as long-lived contention is not the primary scenario to handle.
|
@claude review |
|
Claude finished @bootjp's task in 2m 23s —— View job Code Review — PR #683
SummaryThis PR correctly fixes a liveness regression introduced in #677 (commit
|
| State | Action |
|---|---|
fast=false (full check) |
bzpopminTryAllKeys runs full probe; lastFullCheck = time.Now() after |
fast=true, within 100 ms |
Fast probe; lastFullCheck not updated |
fast=true, >100 ms since last full check |
time.Since(lastFullCheck) >= 100ms → fast = false on next iteration |
| Timer/cancellation wake | waitForBlockedCommandUpdate returns false → fast = false unconditionally |
All four branches correctly reach a full check within one fallback interval.
One observation on lastFullCheck initialization
fast := false
lastFullCheck := time.Now() // ← initial valueSince fast starts as false, the very first iteration runs a full check, and then lastFullCheck = time.Now() is immediately overwritten. The initial value has no effect on correctness but is mildly redundant — time.Time{} (zero value) would work equally well (since time.Since(time.Time{}) is ~56 years, well above 100 ms, which would keep fast=false). Not worth changing; just noting it's a no-op.
On Gemini's suggestion to use 10 ms instead of 100 ms
The existing redisBlockWaitFallback = 100 ms constant is deliberately scoped to be low-CPU (1/10th of a busy-poll per the constant's own comment) while bounding stale-poll latency to values clients already tolerate from network round-trips. Using the same constant for the wall-time gate keeps the two ceilings aligned: WRONGTYPE takes at most ~100 ms whether detected via the timer path or the wall-time path. Introducing a separate tighter constant would diverge the two paths for no clear benefit. The 100 ms ceiling is correct as-is.
adapter/redis_compat_commands.go — Drive-by cleanups
tryBZPopMin / tryBZPopMinFast wrapper collapse
Clean removal. bzpopminTryAllKeys now calls tryBZPopMinWithMode directly:
// before
if fast {
result, err = r.tryBZPopMinFast(key)
} else {
result, err = r.tryBZPopMin(key)
}
// after
result, err := r.tryBZPopMinWithMode(key, fast)One level of indirection removed; no callers of the wrappers remain. The tryBZPopMinWithMode doc comment is updated to describe the caller's obligation clearly (the caller MUST guarantee that the only mutations since the previous full check are signalling writes).
adapter/redis_bzpopmin_wake_test.go — New test
TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad
The test correctly locks down the regression:
-
Synthetic signal driver: directly invokes
signalServer.zsetWaiters.Signal(key)in a tight loop (same-package access, appropriate for an internal test).runtime.Gosched()lets the reader goroutine make forward progress — necessary to avoid starving the reader of CPU. -
Timing: 200 ms sleep before introducing the wrongType write. With the fix, a forced full check has already run within this window (100 ms budget). Pre-fix, the reader never demotes
fast, sobzpop-press-coldis never probed withkeyTypeAtExpect. -
Assertion: uses
require.Contains(t, res.err.Error(), "WRONGTYPE")rather than exact-matching an error code — good, as the wire encoding may vary. -
Timeout: 3500 ms catches the pre-fix regression (BZPOPMIN running full 3 s BLOCK without returning WRONGTYPE) while the fix surfaces WRONGTYPE well within that window.
Doc comment relocation (TestRedis_BZPopMinTimesOutOnEmptyKey): the misplaced comment (previously prepended to TestRedis_BZPopMinRejectsInitialWrongType) is correctly restored above its own function.
One minor nit: the test creates a full 3-node cluster via createNode(t, 3). The signal-pressure scenario is entirely in-process (same node), so nodes[0] is the only node exercised. A single-node setup via createNode(t, 1) would reduce test overhead, though this is a style preference — the existing tests in the file all use 3 nodes for consistency.
Self-review lens check (CLAUDE.md)
| Lens | Status |
|---|---|
| Data loss | No path changes to persistBZPopMinResult; loop-control only |
| Concurrency | lastFullCheck is a local variable; no shared state added; waitForBlockedCommandUpdate contract unchanged |
| Performance | One time.Since per loop iteration; negligible vs. existing ScanAt cost |
| Data consistency | WRONGTYPE ceiling restored to ≤100 ms regardless of signal rate |
| Test coverage | Regression test added before fix, per CLAUDE.md convention |
Verdict
Approve. The fix is minimal, correct, and well-tested. No blocking issues. The nit on 3-node vs. 1-node setup and the observation on lastFullCheck initialization are both non-blocking.
Address Codex P1 / Claude bot review on 6173f58:
The previous bzpopminWaitLoop set
fastdirectly from waitForBlockedCommandUpdate's return value. Under sustained zsetWaiters.Signal pressure (a ZADD/ZINCRBY hot key whose signals fire faster than each bzpopminTryAllKeys round finishes) the buffered-1 waiterC stays perpetually full: every wait returns via the signal arm, fast stays true forever, the 100 ms fallback timer never fires. A wrongType write (HSET / SET / etc.) on a co- registered key (multi-key BZPOPMIN) then goes undetected for the entire BLOCK window — a regression vs #666's strict 100 ms ceiling.Fix: track the wall time of the last full check inside bzpopminWaitLoop and demote
fastback to false once redisBlockWaitFallback has elapsed since that check. Signal-driven wakes still take the fast path during the first 100 ms after a full check; after that, the next wake (signal or timer) runs the full check and resets the budget. The 100 ms WRONGTYPE-detection ceiling is restored regardless of signal rate.Per CLAUDE.md regression-test convention, the failing scenario is locked down by TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad before the fix:
Verified: the test fails on parent commit 6173f58 (loop ran for the full 3 s without surfacing WRONGTYPE) and passes on this commit (returns WRONGTYPE in ~100 ms).
Drive-by cleanups from the same review pass:
Self-review (CLAUDE.md 5 lenses):
Summary by CodeRabbit
Bug Fixes
Tests