Skip to content

fix(redis): bound BZPOPMIN fast mode to redisBlockWaitFallback#683

Merged
bootjp merged 1 commit intomainfrom
perf/bzpopmin-empty-zset-fastpath
Apr 27, 2026
Merged

fix(redis): bound BZPOPMIN fast mode to redisBlockWaitFallback#683
bootjp merged 1 commit intomainfrom
perf/bzpopmin-empty-zset-fastpath

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 26, 2026

Address Codex P1 / Claude bot review on 6173f58:

The previous bzpopminWaitLoop set fast directly from waitForBlockedCommandUpdate's return value. Under sustained zsetWaiters.Signal pressure (a ZADD/ZINCRBY hot key whose signals fire faster than each bzpopminTryAllKeys round finishes) the buffered-1 waiterC stays perpetually full: every wait returns via the signal arm, fast stays true forever, the 100 ms fallback timer never fires. A wrongType write (HSET / SET / etc.) on a co- registered key (multi-key BZPOPMIN) then goes undetected for the entire BLOCK window — a regression vs #666's strict 100 ms ceiling.

Fix: track the wall time of the last full check inside bzpopminWaitLoop and demote fast back to false once redisBlockWaitFallback has elapsed since that check. Signal-driven wakes still take the fast path during the first 100 ms after a full check; after that, the next wake (signal or timer) runs the full check and resets the budget. The 100 ms WRONGTYPE-detection ceiling is restored regardless of signal rate.

Per CLAUDE.md regression-test convention, the failing scenario is locked down by TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad before the fix:

  • A signal-pressure goroutine drives zsetWaiters.Signal in a tight loop on bzpop-press-hot. The reader's waiter, registered on both bzpop-press-cold and bzpop-press-hot, wakes on every signal.
  • After the reader settles into fast mode (~200 ms), a string is SET at bzpop-press-cold.
  • Pre-fix: the loop never demotes fast=false, the fast-path probe on bzpop-press-cold never detects the wrongType, BZPOPMIN times out at the 3 s BLOCK deadline.
  • Post-fix: the wall-time gate forces a full check within ~100 ms of the SET, WRONGTYPE surfaces well before the BLOCK deadline.

Verified: the test fails on parent commit 6173f58 (loop ran for the full 3 s without surfacing WRONGTYPE) and passes on this commit (returns WRONGTYPE in ~100 ms).

Drive-by cleanups from the same review pass:

  • Collapse tryBZPopMin / tryBZPopMinFast wrappers; bzpopminTryAllKeys calls tryBZPopMinWithMode directly with the fast bool. Both wrappers had no other callers (per grep) and only added one level of indirection.
  • Move TestRedis_BZPopMinTimesOutOnEmptyKey's stranded doc comment back over its own function; the previous PR mistakenly prepended it to TestRedis_BZPopMinRejectsInitialWrongType.

Self-review (CLAUDE.md 5 lenses):

  1. Data loss -- None. Loop control flow only; persistBZPopMinResult path unchanged.
  2. Concurrency -- The wall-time gate is a local variable; no shared state added. waitForBlockedCommandUpdate's signaled-bool contract is unchanged.
  3. Performance -- One time.Since per loop iteration on top of the existing ScanAt cost; negligible. Worst-case fast-path utilisation drops from "always" to "first 100 ms of every full check window," but the fast-mode CPU win is still claimed for the dominant signal-driven case (most BZPOPMIN BLOCK windows complete within one full-check window because the consumer pops the first matched ZADD).
  4. Data consistency -- WRONGTYPE detection ceiling is back to 100 ms regardless of signal rate. Fast-path correctness invariants from perf(redis): empty-zset fast path for BZPOPMIN signal-driven wakes #677 (signal-driven wake guarantees no wrongType transition) are unchanged.
  5. Test coverage -- regression-test-first per CLAUDE.md convention. Existing TestRedis_BZPopMinRejectsInitialWrongType, TestRedis_BZPopMinDetectsMidBlockWrongType, and TestRedis_BZPopMinTimesOutOnEmptyKey continue to lock down the other branches.

Summary by CodeRabbit

  • Bug Fixes

    • BZPOPMIN now correctly detects wrong-type errors even under sustained signal-driven conditions, preventing timeouts in edge cases
  • Tests

    • Added test to verify BZPOPMIN wrong-type error detection under high signal load

Address Codex P1 / Claude bot review on 6173f58:

The previous bzpopminWaitLoop set `fast` directly from
waitForBlockedCommandUpdate's return value. Under sustained
zsetWaiters.Signal pressure (a ZADD/ZINCRBY hot key whose signals
fire faster than each bzpopminTryAllKeys round finishes) the
buffered-1 waiterC stays perpetually full: every wait returns via
the signal arm, fast stays true forever, the 100 ms fallback timer
never fires. A wrongType write (HSET / SET / etc.) on a co-
registered key (multi-key BZPOPMIN) then goes undetected for the
entire BLOCK window — a regression vs #666's strict 100 ms
ceiling.

Fix: track the wall time of the last full check inside
bzpopminWaitLoop and demote `fast` back to false once
redisBlockWaitFallback has elapsed since that check. Signal-driven
wakes still take the fast path during the first 100 ms after a
full check; after that, the next wake (signal or timer) runs the
full check and resets the budget. The 100 ms WRONGTYPE-detection
ceiling is restored regardless of signal rate.

Per CLAUDE.md regression-test convention, the failing scenario is
locked down by TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad
before the fix:

- A signal-pressure goroutine drives zsetWaiters.Signal in a tight
  loop on bzpop-press-hot. The reader's waiter, registered on both
  bzpop-press-cold and bzpop-press-hot, wakes on every signal.
- After the reader settles into fast mode (~200 ms), a string is
  SET at bzpop-press-cold.
- Pre-fix: the loop never demotes fast=false, the fast-path probe
  on bzpop-press-cold never detects the wrongType, BZPOPMIN times
  out at the 3 s BLOCK deadline.
- Post-fix: the wall-time gate forces a full check within ~100 ms
  of the SET, WRONGTYPE surfaces well before the BLOCK deadline.

Verified: the test fails on parent commit 6173f58 (loop ran for
the full 3 s without surfacing WRONGTYPE) and passes on this
commit (returns WRONGTYPE in ~100 ms).

Drive-by cleanups from the same review pass:

- Collapse tryBZPopMin / tryBZPopMinFast wrappers; bzpopminTryAllKeys
  calls tryBZPopMinWithMode directly with the fast bool. Both
  wrappers had no other callers (per grep) and only added one
  level of indirection.
- Move TestRedis_BZPopMinTimesOutOnEmptyKey's stranded doc comment
  back over its own function; the previous PR mistakenly prepended
  it to TestRedis_BZPopMinRejectsInitialWrongType.

Self-review (CLAUDE.md 5 lenses):
1. Data loss -- None. Loop control flow only; persistBZPopMinResult
   path unchanged.
2. Concurrency -- The wall-time gate is a local variable; no
   shared state added. waitForBlockedCommandUpdate's signaled-bool
   contract is unchanged.
3. Performance -- One time.Since per loop iteration on top of the
   existing ScanAt cost; negligible. Worst-case fast-path
   utilisation drops from "always" to "first 100 ms of every full
   check window," but the fast-mode CPU win is still claimed for
   the dominant signal-driven case (most BZPOPMIN BLOCK windows
   complete within one full-check window because the consumer pops
   the first matched ZADD).
4. Data consistency -- WRONGTYPE detection ceiling is back to 100
   ms regardless of signal rate. Fast-path correctness invariants
   from #677 (signal-driven wake guarantees no wrongType
   transition) are unchanged.
5. Test coverage -- regression-test-first per CLAUDE.md convention.
   Existing TestRedis_BZPopMinRejectsInitialWrongType,
   TestRedis_BZPopMinDetectsMidBlockWrongType, and
   TestRedis_BZPopMinTimesOutOnEmptyKey continue to lock down the
   other branches.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe5b5002-f418-4658-8a69-0aa8e6063b64

📥 Commits

Reviewing files that changed from the base of the PR and between 98d38b5 and fe5b718.

📒 Files selected for processing (2)
  • adapter/redis_bzpopmin_wake_test.go
  • adapter/redis_compat_commands.go

📝 Walkthrough

Walkthrough

The PR adds a test case for BZPOPMIN wrong-type detection under sustained signal pressure and refactors the waiting loop to enforce time-bounded fast-mode checks instead of separate entry points, preventing starvation while maintaining wrong-type error detection.

Changes

Cohort / File(s) Summary
BZPOPMIN Signal Load Test
adapter/redis_bzpopmin_wake_test.go
Added TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad to verify BZPOPMIN returns WRONGTYPE error under repeated signal-driven wakes; relocated existing timeout contract test.
BZPOPMIN Fast Mode Refactoring
adapter/redis_compat_commands.go
Removed separate tryBZPopMin/tryBZPopMinFast entry points; bzpopminWaitLoop now uses lastFullCheck to enforce time-bounded fast mode, demoting to full mode if threshold exceeded to prevent starvation and ensure wrong-type detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A signal wakes the waiting pop,
Through fast-path loops it won't just stop—
But time keeps watch with lastFullCheck,
So wrong-type errors won't be wreck'd!
Hopping safe, both quick and slow. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main fix: bounding BZPOPMIN's fast mode to the redisBlockWaitFallback timeout, which directly addresses the regression bug described in the objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/bzpopmin-empty-zset-fastpath

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the BZPOPMIN command where sustained signal pressure could cause the command to remain in 'fast mode' indefinitely, potentially missing WRONGTYPE errors on co-registered keys. The implementation introduces a lastFullCheck timestamp to ensure a full probe is performed at least once every redisBlockWaitFallback interval (100ms), even under heavy signal load. A new regression test, TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad, has been added to verify the fix. Review feedback suggests evaluating whether a shorter fallback duration, such as 10ms, would be more appropriate for low-latency performance.

}
fast = waitForBlockedCommandUpdate(handlerCtx, w.C, deadline)
signaled := waitForBlockedCommandUpdate(handlerCtx, w.C, deadline)
fast = signaled && time.Since(lastFullCheck) < redisBlockWaitFallback
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for demoting fast mode based on wall time is sound. However, for low-latency systems, consider if the 100ms ceiling (redisBlockWaitFallback) is too high; a shorter maximum backoff duration like 10ms is often preferred to handle contention more effectively.

References
  1. For low-latency systems, it is acceptable to use a short maximum backoff duration for retries (e.g., 10ms) as long-lived contention is not the primary scenario to handle.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 2m 23s —— View job


Code Review — PR #683

  • Gather context
  • Read changed files
  • Review implementation correctness
  • Review test quality
  • Post final review

Summary

This PR correctly fixes a liveness regression introduced in #677 (commit 6173f584) and follows the CLAUDE.md regression-test-first convention. The core bug and fix are sound; review details below.


adapter/redis_compat_commands.go — Core fix

The bug (pre-fix)

fast = waitForBlockedCommandUpdate(handlerCtx, w.C, deadline)

Under sustained signal pressure, waiterC (buffered-1) is perpetually full, so the select in waitForBlockedCommandUpdate always picks the <-waiterC arm. The fallback timer never fires, fast stays true forever, and a wrongType write on any co-registered key goes undetected for the entire BLOCK window.

The fix

signaled := waitForBlockedCommandUpdate(handlerCtx, w.C, deadline)
fast = signaled && time.Since(lastFullCheck) < redisBlockWaitFallback

lastFullCheck is updated only when !fast (i.e., when a full check actually ran). After redisBlockWaitFallback (100 ms) elapses, time.Since(lastFullCheck) >= 100ms, so fast is demoted to false even when signals keep arriving. This restores the 100 ms WRONGTYPE-detection ceiling from #666.

Logic trace — no correctness issues found

State Action
fast=false (full check) bzpopminTryAllKeys runs full probe; lastFullCheck = time.Now() after
fast=true, within 100 ms Fast probe; lastFullCheck not updated
fast=true, >100 ms since last full check time.Since(lastFullCheck) >= 100msfast = false on next iteration
Timer/cancellation wake waitForBlockedCommandUpdate returns falsefast = false unconditionally

All four branches correctly reach a full check within one fallback interval.

One observation on lastFullCheck initialization

fast := false
lastFullCheck := time.Now()  // ← initial value

Since fast starts as false, the very first iteration runs a full check, and then lastFullCheck = time.Now() is immediately overwritten. The initial value has no effect on correctness but is mildly redundant — time.Time{} (zero value) would work equally well (since time.Since(time.Time{}) is ~56 years, well above 100 ms, which would keep fast=false). Not worth changing; just noting it's a no-op.

On Gemini's suggestion to use 10 ms instead of 100 ms

The existing redisBlockWaitFallback = 100 ms constant is deliberately scoped to be low-CPU (1/10th of a busy-poll per the constant's own comment) while bounding stale-poll latency to values clients already tolerate from network round-trips. Using the same constant for the wall-time gate keeps the two ceilings aligned: WRONGTYPE takes at most ~100 ms whether detected via the timer path or the wall-time path. Introducing a separate tighter constant would diverge the two paths for no clear benefit. The 100 ms ceiling is correct as-is.


adapter/redis_compat_commands.go — Drive-by cleanups

tryBZPopMin / tryBZPopMinFast wrapper collapse

Clean removal. bzpopminTryAllKeys now calls tryBZPopMinWithMode directly:

// before
if fast {
    result, err = r.tryBZPopMinFast(key)
} else {
    result, err = r.tryBZPopMin(key)
}

// after
result, err := r.tryBZPopMinWithMode(key, fast)

One level of indirection removed; no callers of the wrappers remain. The tryBZPopMinWithMode doc comment is updated to describe the caller's obligation clearly (the caller MUST guarantee that the only mutations since the previous full check are signalling writes).


adapter/redis_bzpopmin_wake_test.go — New test

TestRedis_BZPopMinDetectsWrongTypeUnderSignalLoad

The test correctly locks down the regression:

  1. Synthetic signal driver: directly invokes signalServer.zsetWaiters.Signal(key) in a tight loop (same-package access, appropriate for an internal test). runtime.Gosched() lets the reader goroutine make forward progress — necessary to avoid starving the reader of CPU.

  2. Timing: 200 ms sleep before introducing the wrongType write. With the fix, a forced full check has already run within this window (100 ms budget). Pre-fix, the reader never demotes fast, so bzpop-press-cold is never probed with keyTypeAtExpect.

  3. Assertion: uses require.Contains(t, res.err.Error(), "WRONGTYPE") rather than exact-matching an error code — good, as the wire encoding may vary.

  4. Timeout: 3500 ms catches the pre-fix regression (BZPOPMIN running full 3 s BLOCK without returning WRONGTYPE) while the fix surfaces WRONGTYPE well within that window.

Doc comment relocation (TestRedis_BZPopMinTimesOutOnEmptyKey): the misplaced comment (previously prepended to TestRedis_BZPopMinRejectsInitialWrongType) is correctly restored above its own function.

One minor nit: the test creates a full 3-node cluster via createNode(t, 3). The signal-pressure scenario is entirely in-process (same node), so nodes[0] is the only node exercised. A single-node setup via createNode(t, 1) would reduce test overhead, though this is a style preference — the existing tests in the file all use 3 nodes for consistency.


Self-review lens check (CLAUDE.md)

Lens Status
Data loss No path changes to persistBZPopMinResult; loop-control only
Concurrency lastFullCheck is a local variable; no shared state added; waitForBlockedCommandUpdate contract unchanged
Performance One time.Since per loop iteration; negligible vs. existing ScanAt cost
Data consistency WRONGTYPE ceiling restored to ≤100 ms regardless of signal rate
Test coverage Regression test added before fix, per CLAUDE.md convention

Verdict

Approve. The fix is minimal, correct, and well-tested. No blocking issues. The nit on 3-node vs. 1-node setup and the observation on lastFullCheck initialization are both non-blocking.

@bootjp bootjp merged commit 303ebe2 into main Apr 27, 2026
11 checks passed
@bootjp bootjp deleted the perf/bzpopmin-empty-zset-fastpath branch April 27, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant