refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57
Conversation
Moves semaphore permit acquisition to `ColdStorageHandle` so permits travel with read requests into the channel. The task runner splits into two concurrent subtasks: - **Dispatcher**: pulls `PermittedReadRequest`s and spawns handlers, wrapping each in a per-request deadline (default 5s). - **Writer**: consumes writes sequentially. Drain-before-write uses `Semaphore::acquire_many_owned(64)`, now wrapped in a cancel-select so shutdown cannot hang on a stuck reader. The semaphore is now the single backpressure mechanism. The read channel is sized to match permit count, so `try_send` from a caller holding a permit is guaranteed to have capacity. New `ColdStorageError::Timeout` is returned to callers whose handler exceeds the deadline; dropping the handler future releases its permit back to the pool, so a stuck backend call self-heals. Tests (`crates/cold/tests/concurrency.rs`) add a `GatedBackend` helper and four new regression cases: fairness under saturation, cancel during reader backpressure, cancel during write drain, and operation-deadline permit release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`UnifiedStorage::append_blocks` dispatches to cold asynchronously. With the cold task's dispatcher and writer now running on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read — the tests that assumed one were relying on an implementation detail that production code already polls around (see `components/crates/node-tests/src/context.rs`). Add a `wait_for_cold_height` helper matching the production pattern and use it in the two tests that issued a read immediately after `append_blocks`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] @rswanson tagging you for visibility — this stacks on top of #56 and would replace its diff if it lands. @Fraser999 @Evalir requesting review. |
|
[Claude Code] Second-pass review surfaced a deadlock hazard that I'd like to flag before this merges. Critical:
|
Summary
Extends PR #56's semaphore-based fix with a broader refactor that addresses three concerns surfaced in review:
acquire_many_owned(64), and wedges shutdown — same class of bug as the original, just a narrower failure envelope.Design
See
docs/superpowers/specs/2026-04-16-cold-read-write-permit-refactor-design.mdon the design discussion thread (PR #56 comment).Permit-attached messages.
ColdStorageHandleacquires a semaphore permit before sending; the permit travels inPermittedReadRequestand is released when the spawned handler's future drops. One semaphore is now the only backpressure mechanism; the read channel is sized to match permit count sotry_sendon the handle side is infallible (modulo shutdown).Split task runner.
run_dispatcherpullsPermittedReadRequests and spawns handlers.run_writerconsumes writes sequentially, drains viaacquire_many_owned(64)wrapped in a cancel-select, then executes the write. Dispatcher runs continuously so permits attached to queued messages never strand during drain.Per-request deadline.
ColdStorageTask::with_read_deadline(Duration)(default 5s) wraps each non-stream handler intokio::time::timeout. On expiry the caller receivesColdStorageError::Timeout, a WARN is emitted with the operation variant, and the permit returns to the pool.Tests
crates/cold/tests/concurrency.rsexpanded with aGatedBackendhelper that blocks every read call on a test-controlled semaphore:reads_above_concurrency_cap_do_not_deadlock(carried over)write_after_saturating_reads_makes_progress(carried over)fairness_write_serves_before_later_readers(new) — verifies tokio FIFO fairness keeps the writer ahead of later readerscancel_during_reader_backpressure_shuts_down(new)cancel_during_write_drain_shuts_down(new) — would fail without the cancel-select on the writer's drainoperation_deadline_releases_permit(new) — verifies Timeout is returned and the permit rejoins the poolBehavioral note
UnifiedStorage::append_blocksdispatches to cold asynchronously. With dispatcher and writer now on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read. Production code atcomponents/crates/node-tests/src/context.rs:380-393already polls for cold to catch up; two in-repo unit tests (append_and_read_back,drain_above_empty_when_at_tip) were implicitly relying on the old biased-select ordering and are updated to use the same polling pattern.Test plan
cargo test -p signet-cold(conformance + 6 concurrency tests)cargo test --workspacecargo +nightly fmt -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningscargo clippy --workspace --all-targets --no-default-features -- -D warningsRUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-depssignet-cold+signet-storage, bumpinit4tech/node-components, rebuildsignet-sidecar:latest, redeploy to devmainnet, confirm no backpressure-induced crashes over a full day.🤖 Generated with Claude Code