refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline by prestwich · Pull Request #57 · init4tech/storage

prestwich · 2026-04-17T04:10:55Z

Summary

Extends PR #56's semaphore-based fix with a broader refactor that addresses three concerns surfaced in review:

Write-drain is not cancellable. A single stuck reader holds its permit, blocks acquire_many_owned(64), and wedges shutdown — same class of bug as the original, just a narrower failure envelope.
Two backpressure mechanisms (256-slot mpsc + 64-permit semaphore) overlap without a purpose. The channel buffer is 4× the in-flight cap.
No handler-side operation deadline. At 12s write cadence, a single stuck reader extends every drain and eventually re-fills the write mpsc — regenerating the exact production failure PR fix(cold): replace read-arm TaskTracker backpressure with Semaphore #56 fixes.

Design

See docs/superpowers/specs/2026-04-16-cold-read-write-permit-refactor-design.md on the design discussion thread (PR #56 comment).

Permit-attached messages. ColdStorageHandle acquires a semaphore permit before sending; the permit travels in PermittedReadRequest and is released when the spawned handler's future drops. One semaphore is now the only backpressure mechanism; the read channel is sized to match permit count so try_send on the handle side is infallible (modulo shutdown).

Split task runner. run_dispatcher pulls PermittedReadRequests and spawns handlers. run_writer consumes writes sequentially, drains via acquire_many_owned(64) wrapped in a cancel-select, then executes the write. Dispatcher runs continuously so permits attached to queued messages never strand during drain.

Per-request deadline. ColdStorageTask::with_read_deadline(Duration) (default 5s) wraps each non-stream handler in tokio::time::timeout. On expiry the caller receives ColdStorageError::Timeout, a WARN is emitted with the operation variant, and the permit returns to the pool.

Tests

crates/cold/tests/concurrency.rs expanded with a GatedBackend helper that blocks every read call on a test-controlled semaphore:

reads_above_concurrency_cap_do_not_deadlock (carried over)
write_after_saturating_reads_makes_progress (carried over)
fairness_write_serves_before_later_readers (new) — verifies tokio FIFO fairness keeps the writer ahead of later readers
cancel_during_reader_backpressure_shuts_down (new)
cancel_during_write_drain_shuts_down (new) — would fail without the cancel-select on the writer's drain
operation_deadline_releases_permit (new) — verifies Timeout is returned and the permit rejoins the pool

Behavioral note

UnifiedStorage::append_blocks dispatches to cold asynchronously. With dispatcher and writer now on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read. Production code at components/crates/node-tests/src/context.rs:380-393 already polls for cold to catch up; two in-repo unit tests (append_and_read_back, drain_above_empty_when_at_tip) were implicitly relying on the old biased-select ordering and are updated to use the same polling pattern.

Test plan

cargo test -p signet-cold (conformance + 6 concurrency tests)
cargo test --workspace
cargo +nightly fmt -- --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo clippy --workspace --all-targets --no-default-features -- -D warnings
RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps
Reviewer spot-check: cut a patch release of signet-cold + signet-storage, bump init4tech/node-components, rebuild signet-sidecar:latest, redeploy to dev mainnet, confirm no backpressure-induced crashes over a full day.

🤖 Generated with Claude Code

Moves semaphore permit acquisition to `ColdStorageHandle` so permits travel with read requests into the channel. The task runner splits into two concurrent subtasks: - **Dispatcher**: pulls `PermittedReadRequest`s and spawns handlers, wrapping each in a per-request deadline (default 5s). - **Writer**: consumes writes sequentially. Drain-before-write uses `Semaphore::acquire_many_owned(64)`, now wrapped in a cancel-select so shutdown cannot hang on a stuck reader. The semaphore is now the single backpressure mechanism. The read channel is sized to match permit count, so `try_send` from a caller holding a permit is guaranteed to have capacity. New `ColdStorageError::Timeout` is returned to callers whose handler exceeds the deadline; dropping the handler future releases its permit back to the pool, so a stuck backend call self-heals. Tests (`crates/cold/tests/concurrency.rs`) add a `GatedBackend` helper and four new regression cases: fairness under saturation, cancel during reader backpressure, cancel during write drain, and operation-deadline permit release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`UnifiedStorage::append_blocks` dispatches to cold asynchronously. With the cold task's dispatcher and writer now running on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read — the tests that assumed one were relying on an implementation detail that production code already polls around (see `components/crates/node-tests/src/context.rs`). Add a `wait_for_cold_height` helper matching the production pattern and use it in the two tests that issued a read immediately after `append_blocks`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prestwich · 2026-04-17T04:15:36Z

[Claude Code]

@rswanson tagging you for visibility — this stacks on top of #56 and would replace its diff if it lands.

@Fraser999 @Evalir requesting review.

Evalir · 2026-04-17T10:31:15Z

[Claude Code]

Second-pass review surfaced a deadlock hazard that I'd like to flag before this merges.

Critical: `StreamLogs` setup holds the outer read-semaphore permit across unbounded awaits

The exemption at runner.rs:236-242 skips run_with_deadline for the entire handle_stream_logs call, not just the spawned producer. The dispatcher-spawned task still holds its outer permit for the full duration, and inside handle_stream_logs there are two awaits with no bound visible to the writer's drain:

stream_semaphore.acquire_owned().await (runner.rs:264-269) — blocks when all 8 stream slots are held; upper bound is the existing stream's remaining deadline, up to max_stream_deadline = 60s.
self.read_backend.get_latest_block().await (runner.rs:275) — unbounded backend call, no deadline applied.

Throughout both, acquire_many_owned(64) on the writer side is waiting on that permit.

Scenario A — stream saturation. 8 long-running streams in flight; a 9th stream_logs call parks in handle_stream_logs for up to 60s while holding its outer permit. Writer drain blocks for the same window. At the 12s write cadence, writes stack in the 256-slot mpsc; not a permanent wedge, but enough to cause visible backpressure under a steady state of 8 saturated streams.

Scenario B — backend hang on get_latest_block. A single stream_logs call with to_block = None hits a stuck pool connection. No deadline applies. Outer permit held indefinitely. Writer drain never completes. Write mpsc fills at 12s cadence → Backpressure errors → the original production crash, verbatim. The deadline doc comment explicitly sizes the 5s deadline "well below the 12s write cadence at which a stuck reader would otherwise repressurize the write mpsc" — this path is the one that can exceed it.

Scenario B is the sharper concern, especially with the SQL backend index paths from #54 now live.

Fix options (cheapest first)

Wrap handle_stream_logs in tokio::time::timeout(read_deadline, …). One-line bound on the setup phase. The producer runs inside stream_tracker.spawn, so it's already decoupled from the outer permit and untouched by this change.
Timeout only the inner get_latest_block inside handle_stream_logs. Addresses (B) only. Marginal win over (1).
Don't take an outer permit for StreamLogs at all. Streams are explicitly exempt from the drain-before-write invariant, so consuming the drain barrier for stream setup is incidental and harmful. Structural fix: either detect StreamLogs in the dispatcher and drop the permit before dispatching, or split the handle-side path so stream_logs uses stream_semaphore alone.

(1) is the lowest-risk ship-blocker fix. (3) is the architecturally correct answer if a follow-up refactor is acceptable.

Secondary: shutdown time bounded by the longest active stream

run() awaits task_tracker.wait() before closing stream_tracker. A StreamLogs handler parked in handle_stream_logs at cancellation time blocks shutdown until it returns naturally — up to 60s. The option-1 fix above also bounds this.

Re-verified minor items from the previous pass

Handle-side acquire_owned after shutdown still waits for in-flight readers to release before reporting TaskTerminated. Calling read_semaphore.close() at the end of run() would make the failure immediate.
Backend cancel-safety on deadline expiry — still worth soak-validating on the SQL backend.
Permit-attached-message invariant, cancel-select on acquire_many_owned, and FIFO drain fairness all hold on re-trace.

The permit/drain machinery itself is well-designed — this is specifically about the StreamLogs path not having been updated for the new invariant.

prestwich and others added 2 commits April 17, 2026 00:09

prestwich requested review from Evalir and Fraser999 April 17, 2026 04:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
prestwich/cold-permit-refactor

prestwich commented Apr 17, 2026

Uh oh!

prestwich commented Apr 17, 2026

Uh oh!

Evalir commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prestwich commented Apr 17, 2026

Summary

Design

Tests

Behavioral note

Test plan

Uh oh!

prestwich commented Apr 17, 2026

Uh oh!

Evalir commented Apr 17, 2026

Critical: StreamLogs setup holds the outer read-semaphore permit across unbounded awaits

Fix options (cheapest first)

Secondary: shutdown time bounded by the longest active stream

Re-verified minor items from the previous pass

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Critical: `StreamLogs` setup holds the outer read-semaphore permit across unbounded awaits