docs(sqs): proposals for Phase 3.C (throttling) + 3.D (split-queue FIFO)#664
docs(sqs): proposals for Phase 3.C (throttling) + 3.D (split-queue FIFO)#664
Conversation
Two design proposals for the remaining Phase 3 SQS items per docs/design/2026_04_24_proposed_sqs_compatible_adapter.md (Section 14, Phase 3 bullets). Both are explicitly called out as needing separate design docs before implementation; this lands them so implementation work has a reviewed architecture to start from. 3.C — per-queue token-bucket throttling, configured on queue meta, evaluated at the SQS adapter layer on the leader. Default-off, AWS-shape Throttling error envelope, per-leader buckets (no Raft per request). Proposal covers: bucket model, charging per verb / batch entry, configuration validation rules, AWS-compatibility posture (admin-only attribute, hidden from standard principals), multi-shard correctness, observability, alternatives, open questions. 3.D — high-throughput FIFO via per-MessageGroupId hash partitioning across multiple Raft groups. Multi-PR rollout plan with an explicit "gate of no return" called out at the data-plane PR. Existing single-partition FIFO queues stay byte-identical (PartitionCount=0 path). Proposal covers: data model and key encoding, routing, SendMessage / ReceiveMessage / PurgeQueue / DeleteMessage flows, shard router config syntax, reaper implications, migration story (no live re-partitioning), failure modes, mixed-version cluster behaviour, alternatives, open questions. No code changes; review-only PR.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughTwo new design documents propose enhancements to the elastickv SQS adapter: per-queue token-bucket throttling with leader-evaluated rate limiting, and HT-FIFO-style queue partitioning for Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces two design documents for the SQS adapter: one for per-queue throttling and another for split-queue FIFO partitioning. The throttling proposal aims to ensure tenant fairness and prevent resource exhaustion, while the split-queue FIFO design targets high-throughput requirements by distributing queues across multiple Raft groups. Review feedback identifies a critical backward compatibility risk in the proposed keyspace schema that would break existing data access. Additionally, there are concerns regarding potential performance bottlenecks from coarse-grained locking in the throttling store and a design flaw in the partitioned ReceiveMessage logic that could lead to 'false empty' responses for clients.
| A partition is identified by the tuple `(queueName, partitionIndex)` where `partitionIndex ∈ [0, N)`. Every existing message-keyspace key gains the `partitionIndex` as a fixed-width segment immediately after the queue name. Concretely, today's `!sqs|msg|data|<queue>|<gen>|<msgID>` becomes `!sqs|msg|data|<queue>|<partition>|<gen>|<msgID>`. The `<partition>` segment is `appendU64(partitionIndex)` (fixed 8 bytes BE), so: | ||
|
|
||
| - Single-partition FIFOs (and all Standard queues) write `partitionIndex = 0` and read with the same prefix as today — the schema is byte-identical for `N=1`. |
There was a problem hiding this comment.
The proposed key format !sqs|msg|data|<queue>|<partition>|<gen>|<msgID> is not backward compatible with the existing format. Inserting the 8-byte <partition> segment shifts the offsets for <gen> and <msgID>, meaning the new code will fail to find existing messages on disk. This introduces a significant operational risk and prevents rolling upgrades. Per repository rules, the design must detail mitigation strategies (e.g., a temporary 'bridge' or 'proxy' mode) and strategies for zero-downtime cutovers (e.g., dual-write proxy or logical replication) to avoid service interruption.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| mu sync.Mutex | ||
| buckets map[bucketKey]*tokenBucket // (queueName, action) -> bucket |
There was a problem hiding this comment.
Using a single sync.Mutex to protect the buckets map will likely become a performance bottleneck, as this lock must be acquired on every SendMessage and ReceiveMessage call. Per the general rules, frequently accessed fields on hot paths should prefer more granular locking or atomic operations to improve performance.
| mu sync.Mutex | |
| buckets map[bucketKey]*tokenBucket // (queueName, action) -> bucket | |
| buckets sync.Map // map[bucketKey]*tokenBucket |
References
- For frequently accessed fields that require monotonic updates, prefer atomic operations or granular locking over mutexes to improve performance on hot paths.
| A consumer making one ReceiveMessage call sees results from **only the partitions whose leader is the local node**. For a deployment where every partition has its own leader, a single client's ReceiveMessage hits one partition's worth of messages per call. This matches AWS HT-FIFO behaviour: clients are expected to spread their consumer pool across partitions, and an idle partition does not block a busy one. | ||
|
|
||
| For deployments that co-locate partitions on one Raft group, all partitions' leaders are the same node and a single ReceiveMessage scans all partitions in sequence. The scan is bounded by `MaxNumberOfMessages` and the existing per-partition page limit — no separate budget. | ||
|
|
There was a problem hiding this comment.
Restricting ReceiveMessage to only scan partitions led by the local node is a significant departure from the SQS model. In a multi-node deployment, this can lead to "false empty" responses if a client hits a node that is not the leader for partitions containing messages. This forces clients to implement cluster-aware round-robin logic, which leaks internal sharding details and breaks the SQS abstraction.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 341b0deb14
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| b. If this node is not the leader for that partition, skip (the | ||
| consumer's next call will land on a different node and that | ||
| leader will serve its partition). |
There was a problem hiding this comment.
Proxy non-local partitions during ReceiveMessage
Step 2b explicitly skips partitions whose leader is not the local node, which means a consumer pinned to one endpoint will only ever drain that node’s leader-owned partitions while messages on other partitions can backlog indefinitely. This turns ReceiveMessage into node-local visibility rather than queue-wide visibility and breaks normal single-endpoint client deployments unless they add explicit endpoint round-robin behavior.
Useful? React with 👍 / 👎.
| | `adapter/sqs.go` | After `authorizeSQSRequest`, call `bucketStore.charge(queueName, action, count)`. On reject, write the `Throttling` envelope and return. | | ||
| | `adapter/sqs_throttle_test.go` (new) | Unit tests for bucket math (edge cases: idle drift, burst, partial refill, batch over-charge, default-off). ~300 lines. | | ||
| | `adapter/sqs_throttle_integration_test.go` (new) | End-to-end: configure a queue with low limits, send N messages back-to-back, confirm the (N+1)th gets `Throttling` with `Retry-After`. ~150 lines. | | ||
| | `monitoring/registry.go` | Two new counter vectors: `sqs_throttled_requests_total{queue, action}` and `sqs_throttle_tokens_remaining{queue, action}`. | |
There was a problem hiding this comment.
Define tokens_remaining as a gauge metric
This section defines sqs_throttle_tokens_remaining as a counter, but token-bucket remaining capacity must go up and down over time. If implemented as a counter, the value can only increase and any dashboard based on “trending toward zero” becomes meaningless, masking real throttling pressure; this metric needs to be specified as a gauge.
Useful? React with 👍 / 👎.
Four issues raised by Codex + Gemini Code Assist on PR #664: 3.D split-queue FIFO doc: 1. Backward-compatibility risk in keyspace (Codex P1 + Gemini high). The original §3.1 said "every existing key gains a <partition> segment" but also "the schema is byte-identical for N=1" — those are contradictory; inserting bytes shifts every downstream offset and breaks readback of existing data on disk. Rewrote §3.1 to make the key shape conditional on whether the queue is partitioned: legacy / Standard queues use today's keys byte-identically, partitioned queues use a new prefix (!sqs|msg|data|p|<queue>|<partition>|...) with an explicit "p|" discriminator that cannot collide with the legacy prefix even when partition=0. Constructor sketch updated. 2. ReceiveMessage "false empty" on non-local-leader partitions (Codex P1 + Gemini medium). Original §4.2 had clients pinned to one node only seeing partitions whose leader is local — messages on other partitions backlog forever and clients hit empty replies that look like "queue is drained". Rewrote §4.2 so ReceiveMessage proxies the per-partition fanout to the right leader via the existing leader-proxy machinery (extended to take a partition argument). Added a partitionOrder rotation seeded by RequestId so successive calls don't bias toward partition 0. Co-located deployments still pay nothing; spread deployments pay one extra hop per non-local partition only until MaxNumberOfMessages are collected. 3.C per-queue throttling doc: 3. Single sync.Mutex on the bucket store would be a hot-path contention point (Gemini medium). Switched §3.1 to a sync.Map for the bucket lookup (read-mostly access pattern) plus a per-bucket sync.Mutex for charge / refill. Cross-queue traffic never serialises on the same lock. Documented the three-step charge operation explicitly so readers see the locking story. 4. sqs_throttle_tokens_remaining must be a gauge, not a counter (Codex P2). Token budgets go up and down; a counter would mask exactly the depletion operators need to see. Updated both §4.1 (files-touched table) and §7 (operational dashboard) to call the metric a gauge. Counter is correct for the rejections total. Docs-only PR; no code changes, no implementation gate.
|
Pushed addressing all four review findings (Codex + Gemini converged): 3.D split-queue FIFO doc:
3.C per-queue throttling doc:
Docs-only changes; no implementation gate. @claude review |
|
Claude finished @bootjp's task in 3m 1s —— View job Review
All four issues from the Codex/Gemini review have been correctly addressed. New issues found below, one of them medium-severity. All four previous issues: resolved ✓3.D — Keyspace backward-compat (Gemini high + Codex P1) ✓ 3.D — ReceiveMessage false-empty (Gemini medium + Codex P1) ✓ 3.C — Hot-path lock contention (Gemini medium) ✓ 3.C — Metric type (Codex P2) ✓ New issues in the updated doc[Medium]
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6fa02faa35
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ## 8. Failure Modes and Edge Cases | ||
|
|
||
| 1. **Consumer never sees a partition's leader**: a consumer that always lands on node A sees only the partitions whose leader is A. If partitions are spread evenly, this is `1/N` of the queue. Mitigation: AWS SDK consumers naturally distribute across endpoints; for elastickv, document that clients should round-robin endpoints when consuming a partitioned FIFO. |
There was a problem hiding this comment.
Remove node-local receive behavior from failure guidance
Fresh evidence: §8.1 explicitly says an endpoint-pinned consumer only sees 1/N of partitions, which contradicts §4.2’s required proxy fanout behavior. Keeping this guidance in the proposal makes the implementation ambiguous, and following it would regress ReceiveMessage to node-local visibility where messages on remote-leader partitions can backlog indefinitely for single-endpoint consumers.
Useful? React with 👍 / 👎.
| if meta.PartitionCount <= 1 { | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Align partition routing with perQueue throughput mode
Section 3.2 says FifoThroughputLimit=perQueue should effectively collapse routing to one partition, but the partitionFor pseudocode only checks PartitionCount and will hash across all partitions whenever PartitionCount > 1. If this is implemented as written, perQueue mode cannot behave as documented and will still distribute groups across partitions.
Useful? React with 👍 / 👎.
| | JSON | HTTP 400, body `{"__type":"Throttling","message":"Rate exceeded for queue '<name>' action '<action>'"}`, header `x-amzn-ErrorType: Throttling`, header `Retry-After: 1` | | ||
| | Query | HTTP 400, body `<ErrorResponse><Error><Type>Sender</Type><Code>Throttling</Code><Message>...</Message></Error><RequestId>...</RequestId></ErrorResponse>`, headers as above | | ||
|
|
||
| `Retry-After: 1` is the conservative default — at the configured refill rate, one second is enough for at least one fresh token. A future iteration could compute the precise wait from `(1 - currentTokens) / refillRate` but the constant is enough for SDK backoff logic. |
There was a problem hiding this comment.
Compute Retry-After from refill rate
This claim assumes every bucket refills at least one token per second, but the proposed validator allows any positive refill rate. With configurations like RefillPerSecond=0.1, a Retry-After: 1 response guarantees another immediate throttle, creating avoidable retry pressure and misleading backoff behavior; the wait should be derived from the bucket’s refill rate (or the config should enforce a minimum of 1).
Useful? React with 👍 / 👎.
Five issues raised by Claude on PR #664 (one medium, four low — all docs-only edits to the partial doc trio): 3.D split-queue FIFO doc: 1. (Medium) §8.1 stale failure-mode description contradicted §4.2. Original §8.1 said "Consumer never sees a partition's leader → mitigation: round-robin endpoints" — but §4.2 was rewritten earlier in this PR to proxy non-local partitions server-side, eliminating the false-empty risk entirely. Implementors reading §8.1 in isolation would have built the wrong mental model and possibly suppressed the proxy logic. §8.1 now describes the actual remaining failure mode — proxy round-trip latency under spread deployments — and explains that single-partition / co-located deployments pay nothing. 2. (Low) §3.1 variadic sqsMsgDataKey was a footgun. The variadic form would silently accept extra args without compiler diagnosis. Replaced with two named constructors (legacyMsgDataKey / partitionedMsgDataKey) and an explicit dispatch at the call site based on meta.PartitionCount > 1. 3. (Low) §3.1 p| discriminator safety constraint was implicit. Added an explicit paragraph explaining why the prefix cannot collide with the legacy keyspace: queue names admit only [A-Za-z0-9_-] (+ optional .fifo suffix), the validator must continue to reject |, and base32 segments never start with the literal ASCII p. Future relaxations of the name validator have to revisit this prefix scheme. 4. (Low) §8.5 upgrade-gate mechanism was underspecified. Added the concrete protocol: each node's /sqs_health endpoint advertises capabilities: ["htfifo"] once this PR's code is in the binary; the catalog reads the live node set from the distribution layer's node registry, polls /sqs_health on each, and gates PartitionCount > 1 on every node reporting htfifo. Fail-closed on timeouts. References PR #644 as the prior art for the same pattern. 3.C per-queue throttling doc: 5. (Low) §3.4 Retry-After: 1 lied for slow-refill queues. A queue with SendRefillPerSecond = 0.1 needs 10 s for the next token; returning 1 burned 9 unnecessary client retries before the request would succeed. Replaced with a per-queue computation: ceil((1 - currentTokens) / refillRate), floored at 1 s to match Retry-After's integer-second granularity. Fast-refill queues (>= 1 RPS) preserve the existing Retry-After: 1 behaviour. Docs-only PR; no code changes, no implementation gate.
|
Pushed addressing all five findings from your follow-up review (1 medium, 4 low): 3.D split-queue FIFO doc:
3.C per-queue throttling doc:
Docs-only changes; no implementation gate. @claude review |
|
Claude finished @bootjp's task in 5m 50s —— View job Review
All five issues from the last Claude review are correctly resolved. One pre-existing Codex P2 finding from the second-pass review was not in the trigger's scope but remains open; surfacing it here. Two low-severity findings are new. All five previous issues: resolved ✓3.D — §8.1 stale failure-mode description (medium) ✓ 3.D — §3.1 variadic 3.D — §3.1 3.D — §8.5 upgrade-gate mechanism (low) ✓ 3.C — §3.4 New findings[Medium]
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f289d6c291
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 2. On miss, build the bucket from queue meta and `LoadOrStore` it (one-shot insert race tolerated — both racers will agree on the same configuration). | ||
| 3. Acquire the bucket's own `mu`, refill based on elapsed time, take or reject the requested tokens, release `mu`. |
There was a problem hiding this comment.
Invalidate cached buckets on throttle config updates
The design caches each (queue, action) bucket on first use via LoadOrStore, but it never defines a refresh/invalidatation path when SetQueueAttributes changes throttle values. For active queues this means updated limits can stay stale until idle eviction or failover, so operators may keep over-throttling or under-throttling traffic long after they reconfigure the queue.
Useful? React with 👍 / 👎.
| secondsToNextToken := math.Ceil((1.0 - currentTokens) / refillRate) | ||
| retryAfter := max(1, int(secondsToNextToken)) // never less than 1 |
There was a problem hiding this comment.
Base Retry-After on requested token cost
The proposed Retry-After calculation always targets the next single token, but this proposal charges batch requests by entry count. When a large SendMessageBatch is throttled, clients can be told to retry in 1 second even though many more seconds are needed to refill enough tokens for that same batch, which guarantees repeated throttles and adds avoidable retry pressure.
Useful? React with 👍 / 👎.
| `kv/shard_router.go` today routes by queue name. With partitions, the routing key becomes `(queueName, partitionIndex)`. The existing `--raftSqsMap` flag accepts entries keyed by queue name; the syntax extends to accept a partition suffix: | ||
|
|
||
| ``` | ||
| --raftSqsMap "orders.fifo:8=group-7,group-8,group-9,group-10,group-11,group-12,group-13,group-14" |
There was a problem hiding this comment.
Split partition routing from raftSqsMap endpoint mapping
This section assigns queue/partition placement to --raftSqsMap, but that flag is currently defined as raftAddr=sqsAddr endpoint mapping for leader proxying. Reusing it for partition-to-group routing makes the config surface ambiguous and can break proxy address resolution, especially in the receive fanout flow that depends on correct leader endpoint lookup.
Useful? React with 👍 / 👎.
Three findings (1 medium, 2 low):
3.D split-queue FIFO doc:
1. (Medium) §3.3 partitionFor missed FifoThroughputLimit=perQueue.
§3.2 documents that perQueue "reduces routing to a single partition
regardless of PartitionCount" but the §3.3 pseudocode only checked
PartitionCount <= 1. A queue created with PartitionCount=8 +
FifoThroughputLimit=perQueue would hash across all 8 partitions,
directly contradicting the documented semantics. Updated the early-
return guard to:
if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" {
return 0
}
Codex P2 caught this in the second-pass review but it was outside
the scope of the previous fix iteration.
2. (Low) §4.1 step 5 still referenced the old variadic
sqsMsgDataKey(queue, partitionIndex, gen, msgID). §3.1 had already
moved to two named constructors (legacyMsgDataKey /
partitionedMsgDataKey) but §4.1's pseudocode still used the
variadic form, contradicting §3.1. Replaced with explicit if/else
dispatch through the named constructors so an implementor reading
§4.1 in isolation gets the same naming as §3.1.
3.C per-queue throttling doc:
3. (Low) §3.4 Retry-After formula's numerator was hardcoded to 1.0
regardless of charge size. SendMessageBatch of 10 entries against
a 0-token bucket with refillRate = 1.0 would have returned
Retry-After: 1 (insufficient) instead of 10. Updated formula to:
needed := float64(requestedCount) - currentTokens
secondsToNextRefill := math.Ceil(needed / refillRate)
retryAfter := max(1, int(secondsToNextRefill))
requestedCount is the same value §3.3's charge step uses
(1 for single, len(Entries) for batch).
Docs-only PR; no code changes, no implementation gate.
|
Pushed addressing all three findings from your third review (1 medium, 2 low): 3.D split-queue FIFO doc:
3.C per-queue throttling doc:
Docs-only changes; @claude review |
|
Claude finished @bootjp's task in 2m 9s —— View job Review
All three issues from the trigger are correctly resolved. Two Codex P1 findings from the previous review remain open and are surfaced below (they were not in scope of this push). Two new low-severity issues found. All three previous issues: resolved ✓3.D — §3.3 if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" {
return 0
}The comment explicitly documents that 3.D — §4.1 step 5 named constructor dispatch (low) ✓ 3.C — §3.4 Both cases are correctly handled: a Open Codex P1 findings from the previous round (not addressed in this push)These were raised against [Medium]
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d424d18d7c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| queue string | ||
| action string // "Send" | "Receive" | "*" |
There was a problem hiding this comment.
Key throttle buckets by queue generation
The bucket key is defined as only (queue, action), so a deleted/recreated queue with the same name will reuse the old in-memory bucket state until idle eviction, inheriting stale token balance and potentially stale limits. In practice this can throttle a brand-new queue immediately (or let it burst unexpectedly) after DeleteQueue/recreate, which breaks queue incarnation isolation; include generation (or another meta version) in the key or define explicit invalidation on queue lifecycle changes.
Useful? React with 👍 / 👎.
| 3. For each partitionIndex in partitionOrder, until MaxNumberOfMessages | ||
| are collected or every partition has been tried: | ||
| a. Resolve the leader for (queue, partitionIndex). | ||
| b. If this node is the leader: scan locally, deliver candidates. | ||
| c. Otherwise: forward the request to the leader-of-partition via |
There was a problem hiding this comment.
Bound per-partition proxy calls by shared wait budget
This flow fans out ReceiveMessage across partitions by forwarding requests partition-by-partition, but it does not require reducing WaitTimeSeconds across hops. If each forwarded call long-polls independently, an empty queue can take roughly PartitionCount × WaitTimeSeconds, which can exceed SQS-style wait expectations and cause client/LB timeouts; the design should require a single request-level deadline and pass only remaining budget to each partition probe.
Useful? React with 👍 / 👎.
Four findings (2 medium, 2 low): 3.C per-queue throttling doc: 1. (Medium) §3.1 missed bucket cache invalidation on SetQueueAttributes. The doc described LoadOrStore on first use but no path for refreshing the in-memory bucket when SetQueueAttributes changes the throttle config. An operator lowering a limit to stop a noisy tenant would wait up to the idle-eviction window (1h default) before the change took effect. Added a paragraph: SetQueueAttributes calls buckets.Delete(key) for every action key after the Raft commit so the next request rebuilds from the new meta. The race with LoadOrStore is benign (the rebuilt bucket starts at full capacity, same semantics as the failover case). 2. (Low) §4.3 forward-compat sentence used the wrong key for the per-partition bucket. "sharded by MessageGroupId" was wrong — MessageGroupId is the input to partitionFor, not the bucket key. Using it directly would create one bucket per unique group value (unbounded, attacker-amplifiable). The bucket key is (queueName, partitionID) which is bounded by PartitionCount. Replaced the sentence with the correct keying and added the bucketKey-grows-a-partition-field note. 3.D split-queue FIFO doc: 3. (Medium) §5 reused --raftSqsMap for partition-to-Raft-group assignment. That flag already maps raftAddr=sqsAddr for proxyToLeader's endpoint resolution; overloading the same parser with partition assignments creates a parsing ambiguity that could silently produce the wrong proxy target in §4.2's fanout. Replaced with a new dedicated --sqsFifoPartitionMap flag. The existing --raftSqsMap is unchanged. Updated §11's rollout table accordingly. 4. (Low) §9 unit-test list was missing a case for FifoThroughputLimit=perQueue routing. The §3.3 short-circuit guard (added in the previous fix iteration) would silently regress without a test. Added: "FifoThroughputLimit=perQueue with PartitionCount=8 routes every group ID to partition 0." Docs-only PR; no code changes, no implementation gate.
|
Pushed addressing all four findings from your fourth review (2 medium, 2 low): 3.C per-queue throttling doc:
3.D split-queue FIFO doc:
Docs-only changes; @claude review |
|
Claude finished @bootjp's task in 3m 26s —— View job Review
All four issues from the trigger are correctly resolved. Two Codex P1 findings first surfaced in my fourth-round review — and noted there as "not in scope of this push" — remain open in the current doc. One new low-severity finding. All four trigger fixes: resolved ✓3.C — §3.1 3.C — §4.3 bucket key 3.D — §5 3.D — §9 Still-open Codex P1 findings from
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f6bd33ab70
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - All four `Send*` / `Recv*` fields must be either both zero (disabled) or both positive. | ||
| - Capacity ≥ refill (otherwise the bucket can never burst above the steady state). | ||
| - A hard ceiling per queue (e.g. 100,000 RPS) so a typo (`SendCapacity = 1e9`) does not silently mean "no limit at all" but rejects with `InvalidAttributeValue`. |
There was a problem hiding this comment.
Require bucket capacity to cover max batch token cost
The validator currently only enforces paired positivity and capacity >= refill, but §3.3 charges batch APIs by len(Entries) (up to 10). With configs like SendCapacity=5, a valid SendMessageBatch of 10 can never succeed because the bucket can never accumulate 10 tokens, so clients get permanent throttling and misleading Retry-After values. Please add validation that per-action capacity is at least the maximum single-request charge (or change the charging rule) to avoid creating unrecoverable throttling states.
Useful? React with 👍 / 👎.
| 5. **Mixed-version cluster**: a rolling upgrade where some nodes have HT-FIFO and others don't. The new feature gates on the queue's `PartitionCount > 1` field, which is set at create time; old nodes that try to scan a partitioned queue's keyspace will simply not find anything (the prefix has changed). The catalog rejects `CreateQueue` with `PartitionCount > 1` until every node in the cluster reports the new feature flag. | ||
|
|
||
| **The capability advertisement mechanism**: each node's existing `/sqs_health` endpoint (`adapter/sqs.go: serveSQSHealthz`) gains a new field in its JSON body — `capabilities: ["htfifo"]` once this PR's code is in the binary. The catalog's CreateQueue handler reads the live node set from the distribution layer's node registry (the same registry used by `proxyToLeader` to locate leaders), polls `/sqs_health` on each, and gates `PartitionCount > 1` on every node reporting the `htfifo` capability. Nodes that don't respond within a short timeout are treated as not-yet-upgraded — a deliberate fail-closed default so a network blip does not let a partitioned queue land in a partially-upgraded cluster. This mirrors the §3.3.2 admin-forwarding upgrade gate from the admin dashboard design (PR #644), which uses the same "all-nodes-must-report" pattern for `AdminForward`. |
There was a problem hiding this comment.
Enforce HT-FIFO capability beyond queue creation
This guard only blocks CreateQueue when not all nodes advertise htfifo, but the same section notes older nodes will not see partitioned-keyspace data. That leaves a failure mode where a downgraded/replaced old node joins after creation, becomes leader for a partitioned queue, and serves false-empty reads/writes against the wrong prefix. The proposal should require a runtime safeguard (startup or leadership admission) that prevents non-HT-FIFO binaries from serving once any PartitionCount > 1 queue exists.
Useful? React with 👍 / 👎.
…conds, §3.4) Three findings from the fifth-round Claude review on PR #664: 3.C per-queue throttling — DeleteQueue bucket lifecycle (medium): §3.1 now requires DeleteQueue to call buckets.Delete(key) for every bucketKey of the deleted queue (Send, Recv, Default), so a same-name recreation does not inherit stale token balance/limits for up to 1h. Alternative (generation in bucketKey) considered and rejected as more expensive than an explicit Delete. 3.D split-queue FIFO — WaitTimeSeconds budget across fanout (medium): §4.2 now threads a shared deadline through the partition fanout (step 3) and passes remainingWait per sub-call (step 4a/c/d). Without this, an empty queue with WaitTimeSeconds=20 + PartitionCount=8 would hold the connection for up to 160s. §8.1 also drops the contradictory "long-poll mitigates the cost" sentence — long-poll was actually the trigger for the bug. 3.D split-queue FIFO — §3.4 --raftSqsMap reference (low): rephrased to distinguish the router infrastructure (kv/shard_router.go, which already supports multi-group routing by key) from the configuration surface (the new --sqsFifoPartitionMap flag in §5). Eliminates the contradiction with §5.
|
Pushed addressing all three findings from your fifth-round review (2 medium, 1 low): 3.C per-queue throttling — §3.1
3.D split-queue FIFO — §4.2
3.D split-queue FIFO — §3.4
@claude review |
|
Claude finished @bootjp's task in 3m 20s —— View job Review
All three fixes from the trigger are correctly applied. Two Codex P1 items raised in the sixth-round review remain open and are the only findings I would consider blocking before implementation PRs. One new low-severity finding. All three trigger fixes: resolved ✓3.C — §3.1 3.D — §4.2 3.D — §3.4 Still-open Codex P1 findings from
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 624d411a93
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| No global lock is held during step 3; concurrent traffic on different queues runs in parallel. | ||
|
|
||
| **Cache invalidation on `SetQueueAttributes`**: when an operator updates the throttle config via `SetQueueAttributes`, the handler — *after* the Raft commit that persists the new `sqsQueueThrottle` — calls `buckets.Delete(key)` for every `bucketKey` belonging to the updated queue (`Send`, `Recv`, `Default`). Without this step the in-memory bucket would keep enforcing the old limits until the idle-eviction sweep removes the stale entry (default 1 h window), defeating the operator's intent to throttle a noisy tenant in real time. The `LoadOrStore` race with the `Delete` call is benign: the next request rebuilds from the freshly-committed meta, and the rebuilt bucket starts at full capacity (same semantics as the failover case documented below). Claude P1 on PR #664 caught the gap. |
There was a problem hiding this comment.
Use canonical action keys when deleting throttle buckets
The design defines bucketKey.action as "Send" | "Receive" | "*", but this invalidation step deletes ("Send", "Recv", "Default"). If implemented as written, Receive and default buckets will not be removed on SetQueueAttributes, so updated limits can remain stale until idle eviction (and the same mismatch appears in the DeleteQueue path), which directly defeats the real-time reconfiguration behavior this section is trying to guarantee.
Useful? React with 👍 / 👎.
| } | ||
| ``` | ||
|
|
||
| `PartitionCount` is **immutable after first SendMessage**. The validator on `SetQueueAttributes` rejects any change; operators who want a different partition count create a new queue. Why immutable: changing it would require re-hashing every existing message into a new partition, which (a) breaks ordering for in-flight messages of every group whose hash bucket changed, and (b) is a multi-second / multi-minute operation that cannot be expressed as one OCC transaction. |
There was a problem hiding this comment.
Keep FIFO routing attributes immutable after first send
This section only freezes PartitionCount, but routing also depends on FifoThroughputLimit (perQueue forces partition 0 in partitionFor). Allowing SetQueueAttributes to flip throughput mode after messages exist can move the same MessageGroupId to a different partition mid-stream, splitting one group across partitions and violating FIFO within-group ordering; the proposal should lock routing-affecting attributes after first send (or require queue recreation).
Useful? React with 👍 / 👎.
…rade, key) Three findings from the sixth-round review on PR #664: 3.C per-queue throttling §3.2 — capacity bound on max batch charge (Codex P1, medium): SetQueueAttributes validator now requires capacity >= max single-request charge. Send/Recv buckets must be >= 10 when non-zero (full SendMessageBatch / DeleteMessageBatch is 10 entries). Without this rule, SendCapacity=5 makes every full batch permanently unserviceable — the bucket can never accumulate enough tokens. 3.D split-queue FIFO §8 item 5 — runtime safeguard for downgraded leaders (Codex P1, medium): create-time gate only blocks new partitioned queues. Once one exists, a downgraded node lacking htfifo capability could be elected leader and produce false-empty reads + key-prefix corruption. Adds a node-admission rule: on startup + leadership acquisition, enumerate local-shard queues with PartitionCount > 1; if any and the binary lacks htfifo, refuse leadership and step down via TransferLeadership. 3.C per-queue throttling §3.1 — DeleteQueue paragraph bucketKey text (Claude low): "(queueName, partitionID, action)" contradicted the struct definition above it (no partitionID until Phase 3.D §4.3). Corrected to "(queueName, action) for Phase 3.C".
|
Pushed addressing all three findings from your sixth-round review (2 medium Codex P1 + 1 low Claude): 3.C per-queue throttling — §3.2 capacity ≥ max batch charge (Codex P1, medium):
3.D split-queue FIFO — §8 item 5 runtime safeguard for downgraded leaders (Codex P1, medium):
3.C per-queue throttling — §3.1
@claude review |
|
Claude finished @bootjp's task in 3m 10s —— View job Review
All three fixes from this push are correctly applied. Two Codex P1 findings from the previous round's Codex review (on commit All three trigger fixes: resolved ✓3.C — §3.2 capacity ≥ max batch charge (Codex P1, medium) ✓ 3.D — §8 item 5 runtime safeguard for downgraded leaders (Codex P1, medium) ✓ 3.C — §3.1 Still-open Codex P1 findings from the commit
|
…immutability, test capacity) Three findings from the seventh-round review on PR #664: 3.C per-queue throttling §3.1 -- action key naming consistency (Codex P1, medium): bucketKey.action canonical values are "Send"|"Receive"|"*" but cache invalidation paragraphs were saying "Send","Recv","Default" (the JSON config field-name prefixes). An implementation following the struct would never match the wrong key names, leaving Receive and "*" buckets stale after config changes. Changed both invalidation paragraphs to use the struct vocabulary, and added a "Config-field -> bucket-action mapping" paragraph after the sqsQueueThrottle struct so the short-form / canonical-form distinction is explicit. 3.D split-queue FIFO §3.2 -- FifoThroughputLimit and DeduplicationScope must also be immutable (Codex P1, medium): prior text only marked PartitionCount as immutable. But changing FifoThroughputLimit (perMessageGroupId<->perQueue) flips the §3.3 short-circuit on partition routing -- in-flight messages stay where they are, new sends route differently, within-group FIFO ordering silently breaks. Changing DeduplicationScope reshapes the dedup key so the prior scope's de-duped messages can resurrect or new sends can be incorrectly suppressed. Extended the immutability sentence to cover all three with one bullet per attribute explaining the failure mode. 3.C per-queue throttling §6 item 2 -- test capacity inconsistent with new validator (Claude low): integration test used SendCapacity=5 but §3.2's new fourth bullet requires SendCapacity >= 10. Test would have failed at setup, not on the 6th send. Bumped to SendCapacity=10 / 10 messages back-to-back / 11th throttled, with an inline note explaining why the value changed.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Pushed addressing all three findings from your seventh-round review (2 medium Codex P1 + 1 low Claude): 3.C per-queue throttling — §3.1 action key naming consistency (Codex P1, medium):
3.D split-queue FIFO — §3.2 immutability extended to FifoThroughputLimit + DeduplicationScope (Codex P1, medium):
3.C per-queue throttling — §6 item 2 test capacity (Claude low):
@claude review |
…e, PR 2 dormancy gate) Two Codex P1 findings from the tenth-round Codex review on commit 14b4d88 that were carried forward to today's eleventh-round Claude review: 3.D split-queue FIFO §3.2 / §4.1 -- {PartitionCount > 1, DeduplicationScope = "queue"} validation moved to control plane (Codex P1, medium): the old §4.1 paragraph rejected this combination at SendMessage time, so an operator who mis-configured at CreateQueue would get a successful response and only discover the problem when every send failed -- a created-but-unserviceable queue with no recovery short of DeleteQueue+CreateQueue. Added a new "Cross-attribute validation at CreateQueue and SetQueueAttributes" paragraph to §3.2 with the rejection rule (InvalidParameterValue plus the AWS-shaped reason "queue-scoped deduplication is incompatible with multi-partition FIFO because the dedup key cannot be globally unique across partitions without a cross-partition OCC transaction"). Reframed the §4.1 paragraph as a "cannot reach this code" pointer to the §3.2 gate so the runtime rejection is impossible. 3.D split-queue FIFO §11 -- explicit dormancy gate for PR 2-4 (Codex P1, medium): "feature is dormant" was editorial intent, not a runtime guarantee. A cluster on PR 2-4 would accept CreateQueue(PartitionCount=4), dispatch every SendMessage against the legacy single-partition keyspace (partitionIndex=0 because the data plane has not been wired yet), then have those messages be invisible to the partition-aware fanout reader and reaper that land in PR 5. PR 2 row now adds a temporary CreateQueue rejection ("PartitionCount > 1 requires HT-FIFO data plane -- not yet enabled"); PR 3-4 rows note the gate is still in place; PR 5 row explicitly removes it in the same commit that wires the fanout so the gate-and-lift land atomically. New paragraph below the table documents why the gate exists and why atomic gate-and-lift makes the wrong-layout-data class of bug impossible.
|
Pushed addressing the two Codex P1 mediums you flagged as still open from the tenth-round Codex review on 3.D split-queue FIFO — control-plane gate for
3.D split-queue FIFO — explicit PR 2 dormancy gate (Codex P1, medium):
Both fixes lock down real production-time bugs the design would otherwise leave unspecified for the implementer to discover. @claude review |
|
Claude finished @bootjp's task in 3m 48s —— View job Review
|
…ion, PR 4 hook scope) Two low-severity findings from the thirteenth-round Claude review on PR #664: 3.D split-queue FIFO §3.2 -- cross-attribute rule at SetQueueAttributes is dead code (Claude low): the prior wording said the rule applies at both CreateQueue and SetQueueAttributes, but immutability fires first on SetQueueAttributes so the {PartitionCount > 1, DeduplicationScope = queue} combination can never reach the cross-attribute check on that path. Added a clarifying paragraph so PR 2 author knows the SetQueueAttributes side is unreachable rather than missing test coverage. 3.D split-queue FIFO §11 PR 4 -- "Mixed-version gate" expanded (Claude low): the prior label could be read as covering only the §8.5 capability advertisement (/sqs_health + catalog polling). The §8 leadership-refusal hook in kv/lease_state.go is a distinct implementation concern and §8 explicitly says it must land before PR 4 marks the binary htfifo-eligible. Expanded the parenthetical to enumerate both components so the PR 4 author cannot miss the runtime hook.
|
Pushed addressing both low findings from your thirteenth-round review: 3.D split-queue FIFO — §3.2 cross-attribute rule at
3.D split-queue FIFO — §11 PR 4 "Mixed-version gate" expanded (low):
@claude review |
|
Claude finished @bootjp's task in 2m 5s —— View job Review
Both fixes from this push are correctly applied. No open items remain from any prior round. No new issues found. Both trigger fixes: resolved ✓3.D — §3.2 dead-code clarification for The reasoning is correct: any 3.D — §11 PR 4 "Mixed-version gate" expanded (low) ✓ Both the capability-advertisement side (§8.5, needed for the No open items from prior rounds ✓All Codex P1 and Claude-flagged items across all prior review rounds are resolved. No new findings ✓I read both docs end-to-end. One minor observation that does not warrant a fix request: line 115 still names Summary and merge recommendationBoth trigger fixes are cleanly applied, and the docs have no remaining open items from any review round. The proposal documents for Phase 3.C (per-queue throttling) and Phase 3.D (split-queue FIFO) are in excellent shape. Ready to merge. Implementation PR 2 can begin from this base. | Branch |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d246cc3e99
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| **Cache invalidation on `SetQueueAttributes`**: when an operator updates the throttle config via `SetQueueAttributes`, the handler — *after* the Raft commit that persists the new `sqsQueueThrottle` — calls `buckets.Delete(key)` for every `bucketKey` belonging to the updated queue (`Send`, `Receive`, `*` — the canonical action values from the `bucketKey` struct above). Without this step the in-memory bucket would keep enforcing the old limits until the idle-eviction sweep removes the stale entry (default 1 h window), defeating the operator's intent to throttle a noisy tenant in real time. The `LoadOrStore` race with the `Delete` call is benign: the next request rebuilds from the freshly-committed meta, and the rebuilt bucket starts at full capacity (same semantics as the failover case documented below). Claude P1 on PR #664 caught the gap. | ||
|
|
||
| **Cache invalidation on `DeleteQueue`**: when a queue is deleted, the handler — *after* the Raft commit that purges the queue meta — calls `buckets.Delete(key)` for every `bucketKey` belonging to the deleted queue (`Send`, `Receive`, `*` — the canonical action values from the `bucketKey` struct above), mirroring the `SetQueueAttributes` path above. Without this step, a `DeleteQueue` immediately followed by `CreateQueue` with the same name would inherit the previous incarnation's in-memory bucket (current token balance, capacity, refill rate) until the 1 h idle-eviction sweep removes the stale entry — so an operator using `DeleteQueue`+`CreateQueue` to reset a noisy queue's state would be surprised to see the old throttle still in effect. (Alternative considered: include the queue `generation` in `bucketKey` so old and new incarnations cannot collide structurally; the explicit `Delete` is cheaper and matches the `SetQueueAttributes` pattern, so for Phase 3.C we keep `bucketKey = (queueName, action)` per the struct definition above — the `partition` field is added later by Phase 3.D, see §4.3 — and document the lifecycle requirement here.) Claude P1 on PR #664 fifth-round review. | ||
|
|
||
| The bucket map is per-process. On leader failover, a fresh bucket starts at full capacity on the new leader — there is no Raft replication of bucket state. **Why this is correct**: the worst-case behaviour of "fresh bucket on failover" is that a noisy queue gets one extra burst worth of bandwidth right after a leader change. Replicating bucket state would cost a Raft commit per token decrement, which would defeat the entire point of the token bucket. AWS's own rate limiter has the same property at region failover boundaries. |
There was a problem hiding this comment.
Invalidate throttle caches on leader changes
This design only deletes bucket entries in the SetQueueAttributes/DeleteQueue handler on the current leader, while the cache is explicitly per-process; a follower that previously led the queue can keep an old (queue, action) bucket until idle eviction and later reuse it after failover. In that scenario the new leader can enforce stale limits/token balance (or a prior queue incarnation) despite recent control-plane updates, so the documented “fresh bucket on failover” behavior is not guaranteed. Please add a leadership-time invalidation/version check (or keying that includes queue incarnation/config version) so failover cannot resurrect stale throttle state.
Useful? React with 👍 / 👎.
Implements the design from
docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md.
Schema (sqs_catalog.go):
- sqsQueueMeta gains an optional Throttle *sqsQueueThrottle field
with six float64 sub-fields (Send/Recv/Default x Capacity/RefillPerSecond).
- Six new ThrottleSendCapacity / ThrottleSendRefillPerSecond /
ThrottleRecvCapacity / ThrottleRecvRefillPerSecond /
ThrottleDefaultCapacity / ThrottleDefaultRefillPerSecond attributes
go through the standard sqsAttributeAppliers dispatch.
- validateThrottleConfig enforces the cross-field rules from the
design: each (capacity, refill) pair must be both-zero or
both-positive; capacity must be >= refill so the bucket can burst;
Send/Recv capacities must be >= 10 (the SendMessageBatch /
DeleteMessageBatch max charge) so a small capacity does not make
every full batch permanently unserviceable; per-field hard ceiling
of 100k. Validator runs once at the end of applyAttributes so a
multi-field update sees the post-apply state as a whole.
- queueMetaToAttributes surfaces the configured Throttle* fields so
GetQueueAttributes("All") round-trips. Extracted into
addThrottleAttributes to stay under the cyclop ceiling.
Bucket store (sqs_throttle.go, new):
- bucketStore: sync.Map[bucketKey]*tokenBucket; per-bucket sync.Mutex
on the hot path so cross-queue traffic never serialises on a
process-wide lock. Per Gemini medium on PR #664 the single-mutex
alternative was rejected. Lazy idle-evict sweep removes inactive
buckets after 1h to bound memory.
- charge(): lock-free Load, LoadOrStore on miss (race-tolerant -- both
racers compute identical capacity/refill from the same meta), then
per-bucket lock for refill + take + release. Refill is elapsed *
refillRate, capped at capacity. On reject, computes Retry-After
from the actual refillRate and the requestedCount per the §3.4
formula (numerator is requested count, not 1, so a batch verb does
not get told to retry in 1s when it really needs 10s).
- resolveActionConfig() handles the Default* fall-through: when only
DefaultCapacity is set, Send and Recv requests share the same
bucketKey{action:"*"} so Default behaves as one shared cap rather
than three independent quotas.
- invalidateQueue() drops every bucket belonging to a queue. Called
after the Raft commit on SetQueueAttributes / DeleteQueue so new
limits take effect on the very next request, not after the 1h
idle-evict sweep. Without this step the §3.1 cache-invalidation
contract fails and operators see stale enforcement.
Charging (sqs_messages.go, sqs_messages_batch.go):
- All seven message-plane handlers wired:
SendMessage / SendMessageBatch (Send bucket; batch charges by entry
count), ReceiveMessage / DeleteMessage / DeleteMessageBatch /
ChangeMessageVisibility / ChangeMessageVisibilityBatch (Receive
bucket).
- Throttle check sits OUTSIDE the OCC retry loop per §4.2 -- a
rejected request never reaches the coordinator, so the existing
sendMessageWithRetry et al. cannot busy-loop on a permanent
rate-limit failure.
- sendMessage extracted into prepareSendMessage + validateSend +
body to stay under the cyclop ceiling once the throttle branch
was added.
Error envelope (sqs.go):
- New sqsErrThrottling code + writeSQSThrottlingError helper that
writes 400 + AWS-shaped JSON body + x-amzn-ErrorType + Retry-After
header. The envelope is the same shape AWS uses; SDKs that key off
x-amzn-ErrorType handle it without changes.
Tests:
- adapter/sqs_throttle_test.go: 18 unit tests covering bucket math
(fresh capacity, refill elapsed, refill cap, batch reject preserves
partial credit, Retry-After uses requested count, sub-1-RPS floor,
per-action / per-queue / Default-fallthrough isolation,
invalidateQueue, concurrent -race, default-off short-circuit) and
the validator (nil/empty canonicalisation, both-zero-or-both-positive,
capacity >= 10 batch floor, Default* exempt, capacity >= refill,
parseThrottleFloat range checks, computeRetryAfter floor).
- adapter/sqs_throttle_integration_test.go: 8 end-to-end tests
covering the §6 testing strategy (default-off allows unbounded,
send/recv reject after capacity with correct envelope + Retry-After,
batch charges by entry count, SetQueueAttributes invalidation,
DeleteQueue + CreateQueue lifecycle invalidation, GetQueueAttributes
round-trip, validator rejects below batch min). Run with -race
clean.
Out of scope for this PR (deferred to a follow-up):
- Prometheus counter (sqs_throttled_requests_total) and gauge
(sqs_throttle_tokens_remaining) per the design's §4.1 monitoring/
registry.go entry. The wiring needs a new prometheus collector and
hooking through the Registry seam; punted to keep this PR focused
on the data-plane behaviour. The bucket store's chargeOutcome
already exposes tokensAfter so the gauge wiring is one site.
- §6.5 cross-protocol (Query) parity test: the Query protocol layer
lives on PR #662 and the throttle envelope already has the same
shape on both protocols (writeSQSError handles both), so the test
is a follow-up after #662 lands.
- §6.6 failover behaviour test: the §3.1 "fresh bucket on failover"
contract is implementation-correct (the bucket map is per-process,
no Raft state) but a 3-node failover test needs the cluster
scaffolding and is best added with the Jepsen workload.
|
@claude review |
|
Claude finished @bootjp's task in 2m 25s —— View job Review
Both trigger fixes from last push: resolved ✓3.D — §3.2 dead-code clarification for 3.D — §11 PR 4 "Mixed-version gate" expanded (low) ✓ Still-open Codex P1 from the
|
…ini high/medium) Six fixes from the first round of automated reviews on PR #679: Codex P1 -- bucket reconciliation on stale config (sqs_throttle.go): loadOrInit returned cached buckets without checking that capacity / refillRate still match the queue's current Throttle config. After a leadership change, a node retaining buckets from a prior leader term would keep enforcing the prior term's limits even after a new SetQueueAttributes had committed -- the invalidation only runs on the leader that processed the commit, so a different leader's stale buckets survive. Now compares cap/refill on every Load hit and rebuilds (Delete + LoadOrStore) on mismatch. Codex P1 -- invalidate only on actual throttle change (sqs_catalog.go, sqs_throttle.go): cache invalidation in setQueueAttributes ran unconditionally after every successful commit, including unrelated- field updates and no-op writes. Result: any caller could silently restore a noisy tenant's burst capacity by writing a no-op SetQueueAttributes. Now gated on throttleAttributesPresent(in.Attributes) which checks the request for any Throttle* key. Bucket reconciliation above acts as the safety net if a future code path bypasses the gate. Codex P2 -- attributesEqual covers Throttle (sqs_catalog.go): CreateQueue idempotency relied on attributesEqual which did not include Throttle*. A re-create with different limits was treated as idempotent and silently kept the old limits. Now compares the full Throttle struct via throttleConfigEqual; baseAttributesEqual extracted to keep cyclop under the ceiling. Gemini high -- thread throttle through existing meta load (sqs_messages.go, sqs_throttle.go): chargeQueue did one Pebble read per request even though the hot-path handlers (sendMessage, receiveMessage) load the meta moments later. Added chargeQueueWithThrottle that takes pre-loaded throttle config; both hot-path handlers now load meta once and pass throttle in. Throttle check now sits AFTER the QueueDoesNotExist branch so a missing queue no longer consumes a token. Batch + delete handlers keep chargeQueue (one extra meta read) -- low-QPS verbs where the simplification of not pulling meta out of the retry loop is worth the per-call cost. Gemini high -- move sweep off hot path (sqs.go, sqs_throttle.go): maybeSweep ran the O(N) sync.Map.Range on whichever request was unlucky enough to trigger the per-minute sweep, causing latency spikes on many-queue clusters. Replaced with runSweepLoop on a background ticker tied to s.reaperCtx (started in Run alongside the existing message reaper, cleaned up by the same reaperCancel in Stop). The hot-path charge() no longer calls into the sweep at all. Gemini medium -- cap retry-after duration (sqs_throttle.go): computeRetryAfter could compute a multi-day Retry-After (or worse, overflow time.Duration arithmetic) for pathologically small refillRate / large requested values. Capped at throttleRetryAfterCap (1h, matching the bucket idle-evict window). Cap is applied before the Duration multiplication so overflow is impossible. New tests: - TestBucketStore_ReconcilesBucketOnConfigChange pins the Codex P1 reconciliation contract. - TestComputeRetryAfter_CapsAtMaximum pins the Gemini medium cap. - TestThrottleAttributesPresent covers the request-gate helper used by the conditional invalidation. All tests pass under -race; golangci-lint clean.
…est (PR #679 round 2) Two items from the round-2 Claude review on PR #679: (1) bucketStore.sweepMu and bucketStore.lastSweep became dead code when the move from on-hot-path maybeSweep to the background runSweepLoop landed in round 1. The sweep() ticker is the only caller, so the serialisation primitive that protected against concurrent on-hot-path callers no longer has a job. Removed both fields and the sweepMu.Lock/lastSweep=now/sweepMu.Unlock block; sweep() now reads b.clock() inline. Field comment updated to describe the post-runSweepLoop design (single-goroutine driver). (2) TestSQSServer_CatalogCreateIsIdempotent now exercises throttleConfigEqual via a fourth case: same name, same VisibilityTimeout, but the second create adds Throttle* attributes. attributesEqual must notice the diff and the call must reject as QueueNameExists. Without this case a future bug in throttleConfigEqual (e.g. always returning true) would slip past the existing VisibilityTimeout-only test. Both findings noted in the round-2 Claude review on PR #679.
Round-3 Claude review on PR #679 caught two stale comments left over from the runSweepLoop refactor: throttleIdleEvictAfter said "no goroutine; lookups call sweep() opportunistically" -- both clauses are now false (runSweepLoop is the goroutine, hot-path charge() never calls sweep()). throttleEvictSweepEvery said "bounds how often the sweep runs from the hot path" -- the hot path no longer runs sweep at all; this constant is now just the background ticker interval. Both updated to describe the post-runSweepLoop design and to reinforce that the sweep cost is amortised across the goroutine ticker rather than concentrated on whichever request was unlucky enough to trigger it (which was the old behaviour Gemini high flagged in PR #679 round 1).
…(PR #679 round 4) Round-4 Claude review on PR #679 surfaced a TOCTOU race in the loadOrInit reconciliation path the round-1 fix introduced. The race: two concurrent goroutines hit the same stale bucket. A detects mismatch, Delete + LoadOrStore stores freshA. B (which also Loaded the stale bucket before A's Delete) then runs its own Delete -- which under the previous unconditional Delete would evict freshA. B's LoadOrStore then stores freshB. Net effect: a charge() that obtained freshA via the in-flight LoadOrStore from A's path keeps charging a bucket that is no longer in the map, while subsequent requests get freshB at full capacity. Total tokens consumed across the mismatch window can exceed the new capacity by one bucket's worth. Fix: replace the unconditional b.buckets.Delete(key) with b.buckets.CompareAndDelete(key, v) where v is the original stale pointer. CompareAndDelete is a no-op when the map already holds someone else's fresh bucket, so the racer that arrives second takes the winner's fresh bucket via LoadOrStore (which returns the existing entry) instead of orphaning it. Test: TestBucketStore_ConcurrentReconciliationRespectsNewCapacity seeds a stale bucket (cfgOld, drained), then races 200 goroutines through a charge with cfgNew (capacity 50). Asserts exactly 50 successes -- a Delete-after-replace race would let some past the cap. Runs under -race; the new test plus the existing TestBucketStore_ConcurrentChargesPreserveCount and TestBucketStore_ReconcilesBucketOnConfigChange cover the concurrent-fresh-bucket, concurrent-stale-mismatch, and sequential-mismatch cases.
Implements PR 2 of the §11 multi-PR rollout from
docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md. The schema
fields land on sqsQueueMeta with the design's wire types; the
validator enforces the §3.2 cross-attribute rules; a temporary
CreateQueue gate rejects PartitionCount > 1 until PR 5 lifts the
gate atomically with the data-plane fanout.
Schema (sqs_catalog.go):
- sqsQueueMeta gains three optional fields:
* PartitionCount uint32 — number of FIFO partitions; 0/1 means
legacy single-partition; > 1 enables HT-FIFO.
* FifoThroughputLimit string — "perMessageGroupId" (default for
HT-FIFO) or "perQueue" (collapses every group to partition 0).
* DeduplicationScope string — "messageGroup" (default for HT-FIFO)
or "queue" (legacy single-window).
- attributesEqual extended via baseAttributesEqual + htfifoAttributesEqual
split (kept under cyclop ceiling).
- queueMetaToAttributes surfaces the configured HT-FIFO fields via
addHTFIFOAttributes so GetQueueAttributes("All") round-trips.
Routing (sqs_partitioning.go, new):
- partitionFor(meta, messageGroupId) uint32 implements §3.3:
FNV-1a over MessageGroupId & (PartitionCount - 1). Edge cases
documented in the godoc — PartitionCount 0/1 → 0, FifoThroughputLimit
perQueue short-circuits to 0, empty MessageGroupId → 0.
- The bitwise-mask optimisation requires PartitionCount be a power
of two; the validator enforces it. A future bug that leaks a
non-power-of-two value is caught by TestPartitionFor_PowerOfTwoMaskingMatchesMod.
Validation (sqs_partitioning.go):
- validatePartitionConfig enforces the §3.2 cross-attribute rules:
* PartitionCount must be a power of two in [1, htfifoMaxPartitions=32].
* FifoThroughputLimit / DeduplicationScope are FIFO-only.
* {PartitionCount > 1, DeduplicationScope = "queue"} rejects with
InvalidParameterValue (incoherent params, vs InvalidAttributeValue
for malformed individual values) per §3.2 cross-attribute gate.
- Runs in parseAttributesIntoMeta (after resolveFifoQueueFlag, so
the IsFIFO check sees the post-resolution flag) and in
trySetQueueAttributesOnce (after applyAttributes; IsFIFO comes
from the loaded meta).
- validatePartitionImmutability is the §3.2 rule that PartitionCount,
FifoThroughputLimit, and DeduplicationScope are immutable from
CreateQueue. SetQueueAttributes is all-or-nothing: a request that
touches a mutable attribute alongside an attempted immutable
change rejects the whole request before persisting either.
- snapshotImmutableHTFIFO captures the pre-apply values so the
immutability check has a clean before/after pair.
Dormancy gate (sqs_partitioning.go + sqs_catalog.go):
- validatePartitionDormancyGate is the §11 PR 2 temporary gate that
rejects CreateQueue with PartitionCount > 1 with
InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data
plane — not yet enabled"). The schema field exists in the meta
type but no partitioned data can land. Removed in PR 5 in the same
commit that wires the data-plane fanout — gate-and-lift is atomic
so a half-deployed cluster can never accept a partitioned queue
without the data plane to serve it.
Tests:
- adapter/sqs_partitioning_test.go: 14 unit tests covering
partitionFor (legacy zero/one, perQueue short-circuit, empty
MessageGroupId, determinism, distribution within ±5% across 8
partitions on 100k samples, power-of-two masking < N), isPowerOfTwo,
validatePartitionConfig (power-of-two, max cap, FIFO-only,
cross-attr InvalidParameterValue, single-partition + queue-dedup OK),
validatePartitionDormancyGate (rejects > 1, allows 0/1),
validatePartitionImmutability (per-attribute change rejects, same-
value no-op succeeds), htfifoAttributesPresent.
- adapter/sqs_partitioning_integration_test.go: 7 end-to-end tests
against a real createNode cluster covering the wire surface:
dormancy gate rejects PartitionCount > 1 (and the gate's reason
surfaces to the operator), allows PartitionCount=1, validator
rejects non-power-of-two, FIFO-only rejection on Standard queue,
cross-attr {partitions, queue dedup} rejects, immutability rejects
SetQueueAttributes change with same-value no-op succeeds, all-or-
nothing — combined mutable+immutable rejects entirely without
persisting the mutable change, GetQueueAttributes round-trip.
- All tests pass under -race; golangci-lint clean.
Out of scope for this PR (§11 PR 3-8):
- PR 3: keyspace threading — partitionIndex through every sqsMsg*Key
constructor, defaulting to 0 so existing queues stay byte-identical.
Gate from PR 2 still in place.
- PR 4: routing layer + --sqsFifoPartitionMap flag + mixed-version
gate (§8.5 capability advertisement + §8 leadership-refusal hook
in kv/lease_state.go). Gate still in place.
- PR 5: send/receive partition fanout, receipt-handle v2, removes
the dormancy gate atomically with the data-plane fanout.
- PR 6: PurgeQueue/DeleteQueue partition iteration + tombstone +
reaper.
- PR 7: Jepsen HT-FIFO workload + metrics.
- PR 8: doc lifecycle bump.
Three medium-priority Gemini findings + one cyclop fix from the rebase onto the updated PR #679 throttling branch. (1) sqsErrInvalidParameterValue removed (sqs_catalog.go, sqs_partitioning.go, sqs_partitioning_test.go): the constant duplicated sqsErrValidation which already carries "InvalidParameterValue". Reuses the existing constant; comment on the validator clarifies the choice. (2) snapshotImmutableHTFIFO heap allocation gated on htfifoAttributesPresent (sqs_catalog.go): the snapshot is only needed when the request actually carries an HT-FIFO attribute. A SetQueueAttributes that touches only mutable attributes (e.g. VisibilityTimeout — the common path) now skips the allocation entirely. The validator short-circuit on preApply == nil keeps correctness identical: a request with no HT-FIFO attribute can never change an HT-FIFO field. (3) partitionFor inlined FNV-1a (sqs_partitioning.go): the hash/fnv.New64a + h.Write([]byte(messageGroupID)) path heap- allocates the byte slice and pays the hash.Hash interface dispatch on every SendMessage. Replaced with a manual FNV-1a loop over the string — measurably faster on the routing hot path. MessageGroupId is capped at 128 chars by validation so the loop is bounded. (4) cyclop (trySetQueueAttributesOnce): the conditional preApply branch + immutability check + Throttle validator pushed cyclop to 12. Extracted applyAndValidateSetAttributes so the function stays under the ceiling; the helper is also useful for any future admin-surface callers that want the same apply+validate flow without the OCC dispatch. All tests pass under -race; golangci-lint clean.
…eanup Three items from the round-2 Claude review on PR #681: (1) Bug (latent, live in PR 5): validatePartitionConfig only guarded FifoThroughputLimit / DeduplicationScope as FIFO-only, not PartitionCount > 1. A Standard queue with PartitionCount=2 would slip past the validator after PR 5 lifts the dormancy gate. Added a guard inside the !meta.IsFIFO block: PartitionCount > 1 → InvalidAttributeValue("PartitionCount > 1 is only valid on FIFO queues"). PartitionCount 0 / 1 stay accepted on Standard (both mean "single-partition layout"). Test TestValidatePartitionConfig_StandardQueueRejectsHTFIFOAttrs extended to cover PartitionCount=2/4/8/16/32 → reject and PartitionCount=0/1 → accept. (2) Low: TestSQSServer_HTFIFO_RejectsQueueScopedDedupOnPartitioned had two contradictory comments. The function-level doc said "Test sets PartitionCount=1 to bypass dormancy" but the test actually sends PartitionCount=2; the inline comment claimed "dormancy fires first" but validatePartitionConfig (inside parseAttributesIntoMeta) runs before validatePartitionDormancyGate, so the cross-attr rule fires first. Rewrote the doc to describe the actual control-plane order (validatePartitionConfig → cross-attr rejection → wire 400) and removed the misleading "dormancy fires first" assertion. (3) Nit: replaced the custom 11-line contains/indexOf helpers with strings.Contains. Identical behaviour, idiomatic. All findings noted in the round-2 Claude review on PR #681.
…2) (#681) ## Summary Phase 3.D **PR 2** (schema + validators + dormancy gate) per the §11 multi-PR rollout in [`docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md`](https://github.com/bootjp/elastickv/blob/docs/sqs-phase3-proposals/docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md) (currently on PR #664). Stacks on **PR #679 (Phase 3.C)**: branched off `feat/sqs-throttling-phase3c` so the 3.D schema work shares the same base as the throttling implementation. Will be rebased once PR #679 merges to main. - `sqsQueueMeta` gains `PartitionCount uint32`, `FifoThroughputLimit string`, `DeduplicationScope string`. - `partitionFor(meta, messageGroupId) uint32` implements §3.3 — FNV-1a over `MessageGroupId`, masked by `PartitionCount-1` (validator-enforced power-of-two so the bitwise mask is equivalent to mod). Edge cases: `PartitionCount` 0/1 → 0, `FifoThroughputLimit=perQueue` short-circuits to 0, empty `MessageGroupId` → 0. - `validatePartitionConfig` enforces the §3.2 cross-attribute rules: power-of-two `[1, htfifoMaxPartitions=32]`, `FifoThroughputLimit`/`DeduplicationScope` are FIFO-only, `{PartitionCount > 1, DeduplicationScope = "queue"}` rejects with `InvalidParameterValue` at control-plane time so the operator sees the error before the first `SendMessage`. - `validatePartitionImmutability` enforces the §3.2 immutability rule (the three HT-FIFO fields are immutable from `CreateQueue` onward). `SetQueueAttributes` is **all-or-nothing** — a request that touches a mutable attribute alongside an attempted immutable change rejects the whole request before persisting either. - `validatePartitionDormancyGate` is the §11 PR 2 temporary gate: `CreateQueue` with `PartitionCount > 1` rejects with `InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data plane — not yet enabled")`. The schema field exists in the meta type but no partitioned data can land. Removed in PR 5 in the same commit that wires the data-plane fanout — gate-and-lift is **atomic** so a half-deployed cluster can never accept a partitioned queue without the data plane to serve it. - `GetQueueAttributes("All")` round-trips the configured HT-FIFO fields via `addHTFIFOAttributes`. ## Test plan - [x] `adapter/sqs_partitioning_test.go` — 14 unit tests covering `partitionFor` (legacy zero/one, perQueue short-circuit, empty MessageGroupId, determinism, distribution within ±5% across 8 partitions on 100k samples, power-of-two masking always < N), `isPowerOfTwo`, `validatePartitionConfig` (power-of-two, max cap, FIFO-only, cross-attr `InvalidParameterValue`, single-partition + queue-dedup OK), `validatePartitionDormancyGate` (rejects > 1, allows 0/1, gate reason surfaces to the operator), `validatePartitionImmutability` (per-attribute change rejects, same-value no-op succeeds), `htfifoAttributesPresent`. - [x] `adapter/sqs_partitioning_integration_test.go` — 7 end-to-end tests: dormancy gate rejects `PartitionCount > 1` (and the gate's "not yet enabled" message surfaces), allows `PartitionCount=1`, validator rejects non-power-of-two, FIFO-only rejection on Standard queue, cross-attr `{PartitionCount=2, DeduplicationScope=queue}` rejects, immutability `SetQueueAttributes` change rejects with same-value no-op succeeding, all-or-nothing combined mutable+immutable rejects without persisting the mutable change, `GetQueueAttributes` round-trip omits unset fields. - [x] All tests pass under `-race`. - [x] `golangci-lint run ./adapter/...` → 0 issues. ## Self-review (5 lenses) 1. **Data loss** — Schema fields land on `sqsQueueMeta` and are persisted via the existing OCC `Put` on `sqsQueueMetaKey`. The dormancy gate runs at `CreateQueue` admission time; once PR 5 lifts the gate, no schema migration is needed (existing queues have `PartitionCount=0` which is equivalent to 1). The keyspace stays untouched in this PR (PR 3 threads `partitionIndex` through `sqsMsg*Key`). 2. **Concurrency / distributed failures** — Validation is purely function-of-input; no shared state. The immutability check loads the meta in the same `nextTxnReadTS` snapshot that `applyAttributes` uses, so a concurrent `SetQueueAttributes` from another writer is caught by the existing OCC `StartTS + ReadKeys` fence (already in place for `setQueueAttributesWithRetry`). Dormancy gate runs at `CreateQueue` admission so a concurrent gate-lift mid-request is impossible. 3. **Performance** — `partitionFor` is one FNV-1a hash + one bitwise mask; constant time. `validatePartitionConfig` is field-comparison only, runs once per `CreateQueue` / `SetQueueAttributes`. `addHTFIFOAttributes` adds at most three map writes and only when fields are non-empty. Hot path `SendMessage` / `ReceiveMessage` is unaffected by this PR (no routing changes — those land in PR 5). 4. **Data consistency** — Cross-attribute rule `{partitions, queue dedup}` rejects at control-plane time so an operator can never end up with a created-but-unserviceable queue (the runtime path that would have surfaced the same error at first send is unreachable). Immutability rule prevents mid-life routing-key reconfiguration that would silently violate within-group FIFO ordering. 5. **Test coverage** — 14 unit + 7 integration tests cover the validator's full decision tree (each rule has a dedicated table-driven case), the dormancy gate's accept/reject paths with the operator-visible reason, the immutability all-or-nothing semantics, and `partitionFor`'s edge cases plus a 100k-sample distribution test that catches non-uniform routing immediately. Lint is clean. ## Out of scope (subsequent §11 rollout PRs) - **PR 3** — keyspace threading: `partitionIndex` through every `sqsMsg*Key` constructor, defaulting to 0 so existing queues stay byte-identical. Dormancy gate from this PR remains in place. - **PR 4** — routing layer + `--sqsFifoPartitionMap` flag + mixed-version gate (§8.5 capability advertisement + §8 leadership-refusal hook in `kv/lease_state.go`). Dormancy gate still in place. - **PR 5** — send/receive partition fanout + receipt-handle v2 codec. **Removes the dormancy gate atomically with the data-plane fanout** so the gate-and-lift land in one commit. - **PR 6** — `PurgeQueue` / `DeleteQueue` partition iteration + tombstone schema update + reaper. - **PR 7** — Jepsen HT-FIFO workload + metrics. - **PR 8** — partial-doc lifecycle bump. ## Branch base Branched off `feat/sqs-throttling-phase3c` (PR #679) which itself branches off `docs/sqs-phase3-proposals` (PR #664). Will rebase against `main` after both ancestors merge.
…round 5) Two findings from the round-5 Codex review: (1) P1 — Default* batch-capacity floor: validateThrottleConfig disabled the >=10 batch floor for ThrottleDefault* on the assumption that Default never serves a batch verb. But resolveActionConfig in sqs_throttle.go falls Send and Receive traffic through to the Default bucket whenever the corresponding Send*/Recv* pair is unset. So a config like ThrottleDefaultCapacity=5, ThrottleDefaultRefillPerSecond=1 was accepted at the validator, then made every full SendMessageBatch / DeleteMessageBatch (charge=10) permanently unserviceable because the Default bucket could never accumulate 10 tokens. Flipped the third validateThrottlePair call's requireBatchCapacity from false to true; updated the inline comment to spell out why Default* needs the floor (the design doc note in §3.2 about Default* being exempt was wrong about the fall-through). New unit test TestValidateThrottleConfig_DefaultBucketBatchFloor covers DefaultCapacity=1/5 reject and DefaultCapacity=10 accept. (2) P2 — sweep eviction race: Earlier sweep computed idle under bucket.mu, released the lock, then unconditionally Deleted the entry. A concurrent charge() in that window could refill+take a token, then Delete evicted the just-used bucket and the next request got a fresh full-cap bucket — the just-taken token was effectively undone, allowing excess throughput. Fixed in two parts: - Hold bucket.mu across the Delete so the idle observation cannot be invalidated between check and delete. A concurrent charge() that wants the bucket either runs to completion before sweep acquires mu (sweep then sees the freshened lastRefill and skips the delete) or waits for sweep to release mu (charge then takes the bucket — but sweep has already removed it from the map, so the charge succeeds against an orphan bucket and only the next-after-charge request gets the full-cap rebuild — bounded one-token leak). - CompareAndDelete(k, v) ensures sweep does not evict a replacement bucket inserted by invalidateQueue or a future reconciliation path. Holding bucket.mu across sync.Map.Delete is safe — sync.Map.Load is wait-free on the read-only path and never blocks on bucket.mu, so there is no AB-BA cycle with charge(). New regression test TestBucketStore_SweepDoesNotEvictRecentlyUsedBucket backdates a bucket's lastRefill by 2h, advances the clock by 2h to force the bucket past the idle cutoff, then races sweep against a 50-iteration charge loop. After both finish the bucket must still be in the store — at least one charge interleaves with sweep's re-check window and sweep observes the freshened lastRefill. All tests pass under -race; golangci-lint clean.
… round 6) Round 5's sweep+CompareAndDelete fix closed half the eviction race — sweep can no longer evict a bucket whose lastRefill was freshened by a concurrent charge — but left a second window. Goroutines that loaded the bucket from sync.Map *before* sweep removed it from the map were still blocked on bucket.mu, and once sweep released mu they would charge the orphan bucket while later requests created a fresh full-capacity entry via loadOrInit. A burst aligned with a sweep tick could therefore consume up to 2x the configured capacity. Three changes plug the window: - tokenBucket gains an `evicted bool` field, protected by mu. - sweep, invalidateQueue, and the loadOrInit reconciliation path all set evicted=true under mu *after* successfully removing the bucket from the map (CompareAndDelete success or LoadAndDelete return). - charge() loops on a chargeBucket helper that checks evicted under mu and signals retry; the caller drops the orphan reference and loops back through loadOrInit, which converges on whatever live bucket the map now holds. CreateQueue now calls invalidateQueue on a genuine commit (tryCreateQueueOnce post-Dispatch) — DeleteQueue invalidated, but a sendMessage holding pre-delete meta could recreate a bucket between DeleteQueue's invalidate and the next CreateQueue commit, so a same- name recreate would inherit used token state. Invalidating after the genuine create resets the bucket regardless of in-flight traffic to the prior incarnation. The idempotent-return path (existing queue with identical attributes) deliberately skips invalidation. Tests: - TestBucketStore_SweepRaceDoesNotInflateBudget: -race stress test that runs 4 sweeps against 64 chargers on a single backdated bucket; total successful charges must stay <= capacity. The old code path could yield up to 2x capacity. - TestBucketStore_OrphanedBucketRetriesToLiveEntry: deterministic exercise of the evicted-flag retry — sweep evicts, the next charge allocates a fresh bucket via loadOrInit. - TestBucketStore_InvalidateMarksOrphanEvicted: pins the round-6 invalidateQueue change (LoadAndDelete + flag set under mu). - The pre-existing TestBucketStore_SweepDoesNotEvictRecentlyUsedBucket is replaced; the prior assertion (stillThere) was vacuously true because the for-range 50 charge loop would always recreate the bucket via loadOrInit after any eviction (Claude Low on round 5). go test -race -count=20 of the new tests passes; full SQS test package passes under -race; golangci-lint clean. Round-6 review pulled four other items not actioned: - Codex P2 "CreateQueue idempotency must compare Throttle" — already addressed: attributesEqual at sqs_catalog.go:726 calls baseAttributesEqual + throttleConfigEqual + htfifoAttributesEqual. - Gemini high "chargeQueue performs redundant storage read" — already addressed: hot-path handlers (sendMessage, receiveMessage) call chargeQueueWithThrottle to skip the second meta load. - Gemini high "maybeSweep on hot path" — already addressed: runSweepLoop runs sweep on a background ticker wired to reaperCtx. - Gemini medium "cap retryAfter at 10ms to prevent overflow" — already addressed: throttleRetryAfterCap=1h cap before multiply prevents overflow; 10ms cap is inappropriate for SQS rate-limit semantics (clients would retry constantly under contention).
Round 6's invalidateQueue used LoadAndDelete-then-lock — a charger
that loaded the pointer pre-LoadAndDelete could win the mu race and
charge the just-removed bucket while evicted was still false (Codex
P2 on round 6). The fix mirrors sweep's ordering:
1. Load the pointer from the map (without deleting).
2. Acquire bucket.mu.
3. CompareAndDelete with the loaded v (rejects a replacement
bucket inserted by a concurrent reconciliation).
4. Set evicted=true under mu.
5. Unlock.
A charger blocked on mu now sees evicted=true on entry and retries
against the live entry, instead of charging the orphan.
Replaced the over-strict TestBucketStore_InvalidateRace... with
TestBucketStore_InvalidateUnderConcurrencyIsRaceFree: invalidate's
reset-to-full-capacity is the *intended* semantics of an invalidation
event (DeleteQueue / SetQueueAttributes / CreateQueue all want the
new policy applied to a fresh bucket), so the post-invalidate fresh
bucket can legitimately absorb up to capacity additional tokens —
that 2x window is structural, not a bug. Asserting successful-charge
count <= capacity would also pass on the buggy code path under most
schedulings, so the assertion would not differentiate old vs new.
The deterministic TestBucketStore_InvalidateMarksOrphanEvicted
already pins the actual mechanism (evicted=true under mu on dropped
bucket); the new stress test exists to surface any -race finding the
new lock ordering might introduce.
go test -race -count=20 of every TestBucketStore_* test passes;
golangci-lint clean.
… round 6.2) validatePartitionConfig documents PartitionCount=0 (unset) and =1 (explicit single partition) as semantically identical legacy / single-partition routing, but htfifoAttributesEqual was using strict numeric equality. So a queue created without PartitionCount (stored as 0) followed by a CreateQueue retry that explicitly passes PartitionCount=1 was rejected as "different attributes" — broken idempotency for any provisioning flow that normalises defaults differently between attempts. (Codex P2 on PR #679 round 6.1.) Fix: normalisePartitionCount maps 0 -> 1 before comparison; values >1 pass through unchanged so HT-FIFO partition counts still require exact match. The on-disk schema is unchanged. Tests: - TestHTFIFOAttributesEqual_PartitionCountZeroAndOneEquivalent — pins (0, 1) equality, symmetry, and that real divergence (2 vs 0, 2 vs 1) still rejects. - TestNormalisePartitionCount — pins the helper directly. go test -race ./adapter/... pass; golangci-lint clean.
#679 round 6.3) Round 6.2 normalised the CreateQueue idempotency path (htfifoAttributesEqual) but left the SetQueueAttributes immutability path (validatePartitionImmutability) using strict equality. Same 0 == 1 gap (Claude Low on round 6.2): 1. CreateQueue with no PartitionCount -> stored as 0. 2. SetQueueAttributes(PartitionCount=1) -> requested.PartitionCount = 1. 3. validatePartitionImmutability: 0 != 1 -> rejects "PartitionCount is immutable", even though the partition layout is unchanged. Fix: validatePartitionImmutability now normalises both sides through normalisePartitionCount before comparison, mirroring the round-6.2 fix. The narrow gap surfaced under the §11 PR 2 dormancy gate (SetQueueAttributes does not call the gate, so this path is reachable on existing queues), so a fix is correct now rather than deferring to PR 5. Test: - TestValidatePartitionImmutability_PartitionCountZeroAndOneEquivalent table-driven over all (stored, requested) ∈ {0,1,2,4,8} pairs; no-op pairs accept, real-change pairs reject. Reverting either call to strict equality flips the relevant rows. go test -race ./adapter/... pass; golangci-lint clean.
## Summary Phase 3.C implementation: per-queue rate limiting via token-bucket throttling. Implements the design from [`docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md`](https://github.com/bootjp/elastickv/blob/docs/sqs-phase3-proposals/docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md) (currently on PR #664). - Token-bucket store on `*SQSServer`: `sync.Map[bucketKey]*tokenBucket`, per-bucket `sync.Mutex` so cross-queue traffic never serialises on a process-wide lock. Lazy idle-evict (1h) bounds memory. - New `sqsQueueMeta.Throttle *sqsQueueThrottle` field with six float64 sub-fields (Send/Recv/Default × Capacity/RefillPerSecond), wired through the existing `sqsAttributeAppliers` dispatch as `ThrottleSendCapacity` / `ThrottleSendRefillPerSecond` / etc. - Validator (`validateThrottleConfig`) enforces the §3.2 cross-attribute rules: each (capacity, refill) pair both-zero or both-positive; capacity ≥ refill; Send/Recv capacities ≥ 10 (the SendMessageBatch / DeleteMessageBatch max charge); per-field hard ceiling 100k. - All seven message-plane handlers wired (SendMessage[Batch], ReceiveMessage, DeleteMessage[Batch], ChangeMessageVisibility[Batch]). Throttle check sits **outside** the OCC retry loop per §4.2 — a rejected request never reaches the coordinator. - On rejection: `400 Throttling` with `x-amzn-ErrorType` + AWS-shaped JSON envelope + `Retry-After` header computed from the §3.4 formula (numerator is requested count, not 1, so a batch verb does not get told to retry in 1s when it really needs 10s). - Cache invalidation on `SetQueueAttributes` and `DeleteQueue` (§3.1): drops every bucket belonging to the queue after the Raft commit so new limits take effect on the very next request, not after the 1h idle-evict sweep. - `GetQueueAttributes("All")` round-trips Throttle* fields so SDKs can confirm the config landed. ## Test plan - [x] `adapter/sqs_throttle_test.go` — 18 unit tests: - bucket math: fresh capacity, refill elapsed, refill cap-at-capacity, batch reject preserves partial credit, Retry-After uses requested count, sub-1-RPS floor; - isolation: per-action, per-queue, Default-fallthrough shares one bucket; - lifecycle: `invalidateQueue` drops all action keys; - concurrency: `-race` clean, exactly-capacity successes; - default-off short-circuit; - validator: nil/empty canonicalisation, both-zero-or-both-positive, capacity ≥ 10 batch floor, Default* exempt, capacity ≥ refill, `parseThrottleFloat` range checks, `computeRetryAfter` floor. - [x] `adapter/sqs_throttle_integration_test.go` — 8 end-to-end tests against a real `createNode` cluster: - default-off allows unbounded; - send/recv reject after capacity with correct envelope + `Retry-After`; - batch charges by entry count; - `SetQueueAttributes` invalidation (raise-and-immediate-success); - `DeleteQueue` + `CreateQueue` lifecycle invalidation; - `GetQueueAttributes` round-trip; - validator rejects below batch min. - [x] `golangci-lint run` clean. - [x] `go test -race -run "TestBucketStore|TestValidateThrottleConfig|TestParseThrottleFloat|TestComputeRetryAfter|TestSQSServer_Throttle" ./adapter/...` clean. ## Self-review (5 lenses) 1. **Data loss** — Throttle config is on `sqsQueueMeta`, persisted via the existing `setQueueAttributesWithRetry` OCC commit. Bucket state is per-process and never written to storage, by design (§3.1: replicating bucket state would cost a Raft commit per token decrement, defeating the bucket). On leader failover the new leader rebuilds at full capacity — same as the AWS region-failover semantic. No write path can lose throttle config. 2. **Concurrency** — Per-bucket `sync.Mutex`, never held across `sync.Map` ops. `LoadOrStore` race on first insert is safe (both racers compute identical config from the same meta snapshot). Concurrent `SetQueueAttributes` + in-flight charge: the charge may briefly use the old bucket; the next request rebuilds from the fresh meta. `-race` test pins the count invariant: exactly `capacity` successes out of N concurrent goroutines, never more. 3. **Performance** — Hot path on unconfigured queues is one `nil` check (`Throttle == nil`) → return. On configured queues: one `sync.Map.Load` (lock-free), one mutex acquire on the bucket. No global lock. `sync.Map`'s read-mostly optimisation matches the access pattern. The lazy idle-evict sweep runs at most once per minute from the hot path so a many-queue cluster does not pay an O(N) cost per request. 4. **Data consistency** — Throttling sits outside OCC; a rejected request never touches the coordinator (§4.2). The OCC retry loop in `sendMessageWithRetry` cannot busy-loop on a permanent rate-limit failure. Cache invalidation after the Raft commit (not before) so the rebuilt bucket reads the freshly committed config — there is no window where a request can read a stale bucket built from the old config. 5. **Test coverage** — 18 unit tests cover bucket math, isolation, lifecycle, concurrency, default-off; 8 integration tests cover the end-to-end wire path. Both pass under `-race`. Validator has a dedicated table-driven test per rule. ## Out of scope (deferred to follow-ups) - **Prometheus metrics** (counter `sqs_throttled_requests_total{queue,action}` + gauge `sqs_throttle_tokens_remaining{queue,action}` per the design's §4.1 monitoring/registry.go entry). The bucket store's `chargeOutcome` already exposes `tokensAfter` so the gauge wiring is one site; the counter is a one-line increment after `writeSQSThrottlingError`. Punted to keep this PR focused on data-plane behaviour. - **§6.5 cross-protocol (Query) parity test** — the Query protocol layer lives on PR #662 (separate branch); the throttle envelope already has the same shape on both protocols (`writeSQSError` handles both), so the test is a follow-up after #662 lands. - **§6.6 failover behaviour test** — the §3.1 "fresh bucket on failover" contract is implementation-correct (the bucket map is per-process, no Raft state) but a 3-node failover test needs cluster scaffolding and is best added with the Jepsen workload. ## Branch base Branched off [`docs/sqs-phase3-proposals`](https://github.com/bootjp/elastickv/tree/docs/sqs-phase3-proposals) (PR #664) so the design doc is part of the reviewable surface. Will rebase against `main` after #664 merges.
…sals # Conflicts: # adapter/sqs_catalog.go
|
@claude review |
|
Claude finished @bootjp's task in 5m 42s —— View job Review
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2edf9d49d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| type bucketKey struct { | ||
| queue string | ||
| action string | ||
| } |
There was a problem hiding this comment.
Include queue generation in throttle bucket keys
Fresh evidence from this implementation: bucketKey is still only (queue, action), and reconciliation in loadOrInit only checks capacity/refill. If a queue is deleted/recreated with the same throttle values (or leadership moves to a node holding an old cached bucket), the old token balance is reused for the new incarnation because invalidation is process-local, so the recreated queue can start already drained (or unexpectedly bursty) instead of starting from a fresh full bucket.
Useful? React with 👍 / 👎.

Summary
Docs-only PR. Two design proposals for the remaining Phase 3 SQS items per
docs/design/2026_04_24_proposed_sqs_compatible_adapter.mdSection 14 (Phase 3 bullets). Both items were explicitly called out as needing separate design docs before any implementation work; this PR lands those proposals so the implementation PRs have a reviewed architecture to build on.3.C — Per-queue throttling and tenant fairness (proposal)
Per-queue token-bucket throttling configured on queue meta (no separate keyspace), evaluated at the SQS adapter layer on the leader (no Raft per request), surfaced as the AWS
Throttlingerror envelope so SDK retry/backoff just works.Key decisions:
SetQueueAttributes.SendMessage.InvalidAttributeNameon theThrottle*attributes (matches AWS behaviour for unknown attributes); the data-plane enforcement runs for everyone.3.D — Split-queue FIFO (high-throughput FIFO) (proposal)
Per-
MessageGroupIdhash partitioning across multiple Raft groups, mirroring AWS High Throughput FIFO. Within-group ordering preserved; across-group throughput scales with the partition count.Key decisions:
PartitionCount = 0path is the legacy layout; no migration runs implicitly).hash & (N-1)and future offline rebuilds stay tractable.MessageGroupIdpinning all traffic to one partition is documented and accepted (the feature is for cooperative operators).Test plan
_proposed_filename for a 1-character mismatch that is fine to leave for now).Self-review
This is a docs-only PR; the 5-lens self-review collapses to:
Stacking
This PR is independent of #650, #659, and #662. Branched from current
main. Merge whenever ready — landing the proposal docs early lets reviewers comment on the architecture before the implementation PRs go up.Summary by CodeRabbit