Skip to content

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57

Open
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
prestwich/cold-permit-refactor
Open

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
prestwich/cold-permit-refactor

Conversation

@prestwich
Copy link
Copy Markdown
Member

Summary

Extends PR #56's semaphore-based fix with a broader refactor that addresses three concerns surfaced in review:

  1. Write-drain is not cancellable. A single stuck reader holds its permit, blocks acquire_many_owned(64), and wedges shutdown — same class of bug as the original, just a narrower failure envelope.
  2. Two backpressure mechanisms (256-slot mpsc + 64-permit semaphore) overlap without a purpose. The channel buffer is 4× the in-flight cap.
  3. No handler-side operation deadline. At 12s write cadence, a single stuck reader extends every drain and eventually re-fills the write mpsc — regenerating the exact production failure PR fix(cold): replace read-arm TaskTracker backpressure with Semaphore #56 fixes.

Design

See docs/superpowers/specs/2026-04-16-cold-read-write-permit-refactor-design.md on the design discussion thread (PR #56 comment).

Permit-attached messages. ColdStorageHandle acquires a semaphore permit before sending; the permit travels in PermittedReadRequest and is released when the spawned handler's future drops. One semaphore is now the only backpressure mechanism; the read channel is sized to match permit count so try_send on the handle side is infallible (modulo shutdown).

Split task runner. run_dispatcher pulls PermittedReadRequests and spawns handlers. run_writer consumes writes sequentially, drains via acquire_many_owned(64) wrapped in a cancel-select, then executes the write. Dispatcher runs continuously so permits attached to queued messages never strand during drain.

Per-request deadline. ColdStorageTask::with_read_deadline(Duration) (default 5s) wraps each non-stream handler in tokio::time::timeout. On expiry the caller receives ColdStorageError::Timeout, a WARN is emitted with the operation variant, and the permit returns to the pool.

Tests

crates/cold/tests/concurrency.rs expanded with a GatedBackend helper that blocks every read call on a test-controlled semaphore:

  • reads_above_concurrency_cap_do_not_deadlock (carried over)
  • write_after_saturating_reads_makes_progress (carried over)
  • fairness_write_serves_before_later_readers (new) — verifies tokio FIFO fairness keeps the writer ahead of later readers
  • cancel_during_reader_backpressure_shuts_down (new)
  • cancel_during_write_drain_shuts_down (new) — would fail without the cancel-select on the writer's drain
  • operation_deadline_releases_permit (new) — verifies Timeout is returned and the permit rejoins the pool

Behavioral note

UnifiedStorage::append_blocks dispatches to cold asynchronously. With dispatcher and writer now on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read. Production code at components/crates/node-tests/src/context.rs:380-393 already polls for cold to catch up; two in-repo unit tests (append_and_read_back, drain_above_empty_when_at_tip) were implicitly relying on the old biased-select ordering and are updated to use the same polling pattern.

Test plan

  • cargo test -p signet-cold (conformance + 6 concurrency tests)
  • cargo test --workspace
  • cargo +nightly fmt -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo clippy --workspace --all-targets --no-default-features -- -D warnings
  • RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps
  • Reviewer spot-check: cut a patch release of signet-cold + signet-storage, bump init4tech/node-components, rebuild signet-sidecar:latest, redeploy to dev mainnet, confirm no backpressure-induced crashes over a full day.

🤖 Generated with Claude Code

prestwich and others added 2 commits April 17, 2026 00:09
Moves semaphore permit acquisition to `ColdStorageHandle` so permits
travel with read requests into the channel. The task runner splits
into two concurrent subtasks:

- **Dispatcher**: pulls `PermittedReadRequest`s and spawns handlers,
  wrapping each in a per-request deadline (default 5s).
- **Writer**: consumes writes sequentially. Drain-before-write uses
  `Semaphore::acquire_many_owned(64)`, now wrapped in a cancel-select
  so shutdown cannot hang on a stuck reader.

The semaphore is now the single backpressure mechanism. The read
channel is sized to match permit count, so `try_send` from a caller
holding a permit is guaranteed to have capacity.

New `ColdStorageError::Timeout` is returned to callers whose handler
exceeds the deadline; dropping the handler future releases its permit
back to the pool, so a stuck backend call self-heals.

Tests (`crates/cold/tests/concurrency.rs`) add a `GatedBackend`
helper and four new regression cases: fairness under saturation,
cancel during reader backpressure, cancel during write drain, and
operation-deadline permit release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`UnifiedStorage::append_blocks` dispatches to cold asynchronously.
With the cold task's dispatcher and writer now running on separate
subtasks, there is no biased ordering between a fire-and-forget
write and a subsequent read — the tests that assumed one were
relying on an implementation detail that production code already
polls around (see `components/crates/node-tests/src/context.rs`).

Add a `wait_for_cold_height` helper matching the production pattern
and use it in the two tests that issued a read immediately after
`append_blocks`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prestwich prestwich requested review from Evalir and Fraser999 April 17, 2026 04:15
@prestwich
Copy link
Copy Markdown
Member Author

[Claude Code]

@rswanson tagging you for visibility — this stacks on top of #56 and would replace its diff if it lands.

@Fraser999 @Evalir requesting review.

@Evalir
Copy link
Copy Markdown
Member

Evalir commented Apr 17, 2026

[Claude Code]

Second-pass review surfaced a deadlock hazard that I'd like to flag before this merges.

Critical: StreamLogs setup holds the outer read-semaphore permit across unbounded awaits

The exemption at runner.rs:236-242 skips run_with_deadline for the entire handle_stream_logs call, not just the spawned producer. The dispatcher-spawned task still holds its outer permit for the full duration, and inside handle_stream_logs there are two awaits with no bound visible to the writer's drain:

  1. stream_semaphore.acquire_owned().await (runner.rs:264-269) — blocks when all 8 stream slots are held; upper bound is the existing stream's remaining deadline, up to max_stream_deadline = 60s.
  2. self.read_backend.get_latest_block().await (runner.rs:275) — unbounded backend call, no deadline applied.

Throughout both, acquire_many_owned(64) on the writer side is waiting on that permit.

Scenario A — stream saturation. 8 long-running streams in flight; a 9th stream_logs call parks in handle_stream_logs for up to 60s while holding its outer permit. Writer drain blocks for the same window. At the 12s write cadence, writes stack in the 256-slot mpsc; not a permanent wedge, but enough to cause visible backpressure under a steady state of 8 saturated streams.

Scenario B — backend hang on get_latest_block. A single stream_logs call with to_block = None hits a stuck pool connection. No deadline applies. Outer permit held indefinitely. Writer drain never completes. Write mpsc fills at 12s cadence → Backpressure errors → the original production crash, verbatim. The deadline doc comment explicitly sizes the 5s deadline "well below the 12s write cadence at which a stuck reader would otherwise repressurize the write mpsc" — this path is the one that can exceed it.

Scenario B is the sharper concern, especially with the SQL backend index paths from #54 now live.

Fix options (cheapest first)

  1. Wrap handle_stream_logs in tokio::time::timeout(read_deadline, …). One-line bound on the setup phase. The producer runs inside stream_tracker.spawn, so it's already decoupled from the outer permit and untouched by this change.
  2. Timeout only the inner get_latest_block inside handle_stream_logs. Addresses (B) only. Marginal win over (1).
  3. Don't take an outer permit for StreamLogs at all. Streams are explicitly exempt from the drain-before-write invariant, so consuming the drain barrier for stream setup is incidental and harmful. Structural fix: either detect StreamLogs in the dispatcher and drop the permit before dispatching, or split the handle-side path so stream_logs uses stream_semaphore alone.

(1) is the lowest-risk ship-blocker fix. (3) is the architecturally correct answer if a follow-up refactor is acceptable.

Secondary: shutdown time bounded by the longest active stream

run() awaits task_tracker.wait() before closing stream_tracker. A StreamLogs handler parked in handle_stream_logs at cancellation time blocks shutdown until it returns naturally — up to 60s. The option-1 fix above also bounds this.

Re-verified minor items from the previous pass

  • Handle-side acquire_owned after shutdown still waits for in-flight readers to release before reporting TaskTerminated. Calling read_semaphore.close() at the end of run() would make the failure immediate.
  • Backend cancel-safety on deadline expiry — still worth soak-validating on the SQL backend.
  • Permit-attached-message invariant, cancel-select on acquire_many_owned, and FIFO drain fairness all hold on re-trace.

The permit/drain machinery itself is well-designed — this is specifically about the StreamLogs path not having been updated for the new invariant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants