fix(cold): replace read-arm TaskTracker backpressure with Semaphore#56
fix(cold): replace read-arm TaskTracker backpressure with Semaphore#56
Conversation
The cold storage task's run loop wedges once `MAX_CONCURRENT_READERS` (64) read handlers are in flight. The read arm's backpressure waits on `TaskTracker::wait()`, but that future only resolves when the tracker is both *closed* and *empty*. The read arm never closes the tracker (only the write arm does, during drain-before-write), so the inner `wait()` never completes, the outer `select!` never re-polls, and no further reads *or* writes are dispatched. Replace the TaskTracker-based backpressure with a `Semaphore`: - Each reader acquires one permit before being spawned; the permit is released when the spawned task completes or panics. - Writes acquire all `MAX_CONCURRENT_READERS` permits via `acquire_many_owned`, which unblocks only after every in-flight reader has released its permit — preserving the existing drain-before-write invariant with no close/reopen dance. `TaskTracker` is kept solely for graceful shutdown, which is where its close+wait semantics are actually wanted. Add regression tests that spawn 256 concurrent reads (4× the in-flight cap) and verify both reads and an interleaved write complete within a 15s guard. Before this change, both tests hit the guard; after, they pass in milliseconds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
gonna have to sit down with the laptop to review this feel free to patch node to this branch in the meantime |
|
this does seem to fix things when patched locally and built into an image |
|
[Claude Code] @Fraser999 — could I get a second pair of eyes on this design spec? It extends the semaphore fix in this PR with a broader refactor. Landing this design would replace the current diff. Cold storage read/write permit refactorContextPR #56 on The semaphore fix is correct and resolves the observed crash. During
This spec describes a refactor that resolves all four in one change. Goals
Non-goals
DesignArchitectureReplace the single-run-loop model with two concurrently-running tasks:
Both tasks share one Permit-attached-to-message
pub async fn get_header(&self, spec: HeaderSpecifier)
-> ColdResult<Option<SealedHeader>>
{
let permit = self.read_sem.clone().acquire_owned().await
.map_err(|_| ColdStorageError::Cancelled)?;
let (tx, rx) = oneshot::channel();
self.read_sender
.send(PermittedReadRequest {
permit,
req: ColdReadRequest::GetHeader { spec, resp: tx },
})
.await
.map_err(|_| ColdStorageError::Cancelled)?;
rx.await.map_err(|_| ColdStorageError::Cancelled)?
}The permit travels in the message, is moved into the spawned handler Channel sizing:
|
|
[Claude Code] Spec is directionally right — the permit-attached-to-message, split dispatcher/writer, cancellable drain, and Core issue: tokio-level
|
Summary
Fixes a deadlock in
signet_cold::ColdStorageTaskthat wedges the entire cold-storage task the moment 64 concurrent reads are in flight. Once wedged, no reads or writes are dispatched; upstream RPCs stall, write mpsc fills (~256 chain advances ≈ 51 min at 12 s/block),try_sendindispatch_append_blocksreturnsBackpressure, and the node crashes via?atsignet-node/src/node.rs:285.Observed in production on signet-sidecar running in the dev
mainnetnamespace: every ~50 min the pod crashed withcold storage backpressure: channel full; all cold-path RPCs (eth_getBlockByNumber,eth_getLogs,eth_gasPrice,eth_feeHistory) hung at the 60 s client timeout while hot-only RPCs (eth_getBalance,eth_blockNumber,eth_chainId) stayed sub-millisecond. Aurora sat idle at its min-pool 5 connections with zeroReadIOPS.The bug
crates/cold/src/task/runner.rs, read arm (before this change):tokio_util::TaskTracker::wait()documents:The read arm never calls
close()— only the write arm does, as part of its drain-before-write (close(); wait(); reopen();). So the instant the 65th concurrent read arrives, the innerwait()future has no path to completion.tokio::select!inside aselect!arm does not cause the outerselect!to re-poll — the outer future is still "inside" the read arm's body. The whole run loop is pinned there forever.Observable fallout, all matching telemetry:
read_receivermpsc;send().awaiton the RPC side blocks; clients see exactly-60 s hangs when they time out.ReadIOPSand 100 % buffer cache ratio because nothing new queries it.write_receiver(256 slots). Chain advances every ~12 s ⇒ 256 × 12 s ≈ 51 min, after whichtry_send(AppendBlocks)returnsTrySendError::Full→ColdStorageError::Backpressure, which bubbles throughself.storage.append_blocks(...)?insignet-nodeand kills the process.handle_new_head/commitspans look fast in tracing because they measure the node-sidetry_send, which is fire-and-forget. They don't indicate the cold task is actually making progress.The fix
Replace the
TaskTracker-based backpressure with aSemaphoresized toMAX_CONCURRENT_READERS:Key properties:
acquire_owned()wakes immediately when any sibling reader releases a permit on completion. No closed-tracker invariant required.acquire_many_owned(64)only resolves once every reader has released its permit, which is exactly the oldclose() + wait()semantics — minus the footgun.TaskTrackeris retained solely for graceful shutdown, whereclose() + wait()semantics are what we want (the shutdown path explicitly closes the tracker after the main loop exits).JoinHandle's future (on handler panic) drops_permit, so the semaphore count is never leaked.Tests
Added
crates/cold/tests/concurrency.rswith two regression tests:reads_above_concurrency_cap_do_not_deadlock— issues 256 concurrent reads (4× the in-flight cap) and asserts all 256 complete within a 15 s guard.write_after_saturating_reads_makes_progress— interleaves a write into the same 256-reader flood and asserts it completes within the same guard.Both pass in milliseconds with this change. With the fix reverted, both reliably hit the 15 s deadlock guard:
Test plan
cargo test -p signet-cold(existing conformance + new concurrency tests)cargo +nightly fmt -- --checkcargo clippy --workspace --all-targets -- -D warningsRUSTDOCFLAGS=\"-D warnings\" cargo doc --workspace --no-depsorigin/main'srunner.rsand pass with this change.signet-cold+signet-storage, bumpinit4tech/node-components, rebuildsignet-sidecar:latest, redeploy tomainnetdev, confirm the sidecar stops crashing every 51 min andeth_getBlockByNumberstops timing out.🤖 Generated with Claude Code