docs(sqs): proposals for Phase 3.C (throttling) + 3.D (split-queue FIFO) by bootjp · Pull Request #664 · bootjp/elastickv

bootjp · 2026-04-26T10:38:46Z

Summary

Docs-only PR. Two design proposals for the remaining Phase 3 SQS items per docs/design/2026_04_24_proposed_sqs_compatible_adapter.md Section 14 (Phase 3 bullets). Both items were explicitly called out as needing separate design docs before any implementation work; this PR lands those proposals so the implementation PRs have a reviewed architecture to build on.

3.C — Per-queue throttling and tenant fairness (proposal)

Per-queue token-bucket throttling configured on queue meta (no separate keyspace), evaluated at the SQS adapter layer on the leader (no Raft per request), surfaced as the AWS Throttling error envelope so SDK retry/backoff just works.

Key decisions:

Default-off. Existing queues are unaffected; operators opt in per queue via SetQueueAttributes.
Per-action buckets (Send / Receive / Default) so a slow consumer cannot pin the producer.
Per-leader buckets, no replication. Worst case on failover: one extra burst on the new leader. Acceptable per AWS-equivalent behaviour at region failover boundaries; replicating would cost a Raft commit per SendMessage.
Batch verbs charge by entry count, not call count, with all-or-nothing rejection (matches AWS).
Admin-only configuration plane. Standard SQS clients see InvalidAttributeName on the Throttle* attributes (matches AWS behaviour for unknown attributes); the data-plane enforcement runs for everyone.

3.D — Split-queue FIFO (high-throughput FIFO) (proposal)

Per-MessageGroupId hash partitioning across multiple Raft groups, mirroring AWS High Throughput FIFO. Within-group ordering preserved; across-group throughput scales with the partition count.

Key decisions:

Existing single-partition FIFO queues stay byte-identical (PartitionCount = 0 path is the legacy layout; no migration runs implicitly).
Power-of-two partition counts only (1, 2, 4, 8, 16, 32) so the routing step is hash & (N-1) and future offline rebuilds stay tractable.
Partition count is immutable after first SendMessage. Live re-partitioning would break ordering for in-flight messages of every group whose hash bucket changed; out of scope.
Multi-PR rollout plan with an explicit "gate of no return" called out at PR 5 (the data-plane PR). PRs 1–4 are reversible no-ops on data layout; once a partitioned FIFO holds real data, rollback means draining and recreating the queue.
FNV-1a hash (deterministic across processes / Go versions / architectures). Risk of attacker-controlled MessageGroupId pinning all traffic to one partition is documented and accepted (the feature is for cooperative operators).

Test plan

Markdown renders correctly on GitHub (manually previewed).
Cross-references resolve (the partial-doc rename in PR feat(sqs): Phase 3 — Approximate counters + admin SPA queues pages #659 is in flight; both proposals reference the partial filename — they will resolve once feat(sqs): Phase 3 — Approximate counters + admin SPA queues pages #659 merges, otherwise resolve to the current _proposed_ filename for a 1-character mismatch that is fine to leave for now).
No code changes; CI is irrelevant beyond the markdown lint pass.

Self-review

This is a docs-only PR; the 5-lens self-review collapses to:

Data loss / Concurrency / Performance / Consistency — N/A, no code touched.
Test coverage — N/A, no code touched. The proposals themselves include a Testing Strategy section (§6 / §9) so the implementation PRs have explicit acceptance criteria.

Stacking

This PR is independent of #650, #659, and #662. Branched from current main. Merge whenever ready — landing the proposal docs early lets reviewers comment on the architecture before the implementation PRs go up.

Summary by CodeRabbit

Documentation
- Added comprehensive design specification for queue-level throttling in SQS with configurable send/receive rate limits, error handling, and phased rollout plan
- Added comprehensive design specification for FIFO queue partitioning enabling parallel message processing across partitions while preserving message ordering within partitions

Two design proposals for the remaining Phase 3 SQS items per docs/design/2026_04_24_proposed_sqs_compatible_adapter.md (Section 14, Phase 3 bullets). Both are explicitly called out as needing separate design docs before implementation; this lands them so implementation work has a reviewed architecture to start from. 3.C — per-queue token-bucket throttling, configured on queue meta, evaluated at the SQS adapter layer on the leader. Default-off, AWS-shape Throttling error envelope, per-leader buckets (no Raft per request). Proposal covers: bucket model, charging per verb / batch entry, configuration validation rules, AWS-compatibility posture (admin-only attribute, hidden from standard principals), multi-shard correctness, observability, alternatives, open questions. 3.D — high-throughput FIFO via per-MessageGroupId hash partitioning across multiple Raft groups. Multi-PR rollout plan with an explicit "gate of no return" called out at the data-plane PR. Existing single-partition FIFO queues stay byte-identical (PartitionCount=0 path). Proposal covers: data model and key encoding, routing, SendMessage / ReceiveMessage / PurgeQueue / DeleteMessage flows, shard router config syntax, reaper implications, migration story (no live re-partitioning), failure modes, mixed-version cluster behaviour, alternatives, open questions. No code changes; review-only PR.

coderabbitai · 2026-04-26T10:38:52Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 12 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b710464a-f267-46a0-bb94-b04db59bc4d7

📥 Commits

Reviewing files that changed from the base of the PR and between 829cf97 and 2edf9d4.

📒 Files selected for processing (12)

adapter/sqs.go
adapter/sqs_catalog.go
adapter/sqs_catalog_test.go
adapter/sqs_messages.go
adapter/sqs_messages_batch.go
adapter/sqs_partitioning.go
adapter/sqs_partitioning_integration_test.go
adapter/sqs_partitioning_test.go
adapter/sqs_throttle.go
adapter/sqs_throttle_integration_test.go
adapter/sqs_throttle_test.go
docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md

📝 Walkthrough

Walkthrough

Two new design documents propose enhancements to the elastickv SQS adapter: per-queue token-bucket throttling with leader-evaluated rate limiting, and HT-FIFO-style queue partitioning for .fifo queues with deterministic routing and partition-aware message handling.

Changes

Cohort / File(s)	Summary
SQS Design Proposals `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md`, `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md`	Two design documents proposing SQS adapter enhancements: per-queue token-bucket throttling with Send/Recv independent buckets and queue-meta configuration; and HT-FIFO partitioning for .fifo queues with MessageGroupId-based routing, partition-aware scanning, receipt-handle versioning, and mixed-version leadership safeguards.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

🐰✨ Design docs dance through the digital spring,
Throttled buckets and partitions take wing,
HT-FIFO hops quicken the queue's steady beat,
SQS adventures make the release complete! 🎉

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly and concisely describes the primary changes: two design proposals for Phase 3.C (per-queue throttling) and Phase 3.D (split-queue FIFO) SQS features. It is clear, specific, and accurately reflects the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/sqs-phase3-proposals

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces two design documents for the SQS adapter: one for per-queue throttling and another for split-queue FIFO partitioning. The throttling proposal aims to ensure tenant fairness and prevent resource exhaustion, while the split-queue FIFO design targets high-throughput requirements by distributing queues across multiple Raft groups. Review feedback identifies a critical backward compatibility risk in the proposed keyspace schema that would break existing data access. Additionally, there are concerns regarding potential performance bottlenecks from coarse-grained locking in the throttling store and a design flaw in the partitioned ReceiveMessage logic that could lead to 'false empty' responses for clients.

gemini-code-assist · 2026-04-26T10:40:23Z

+A partition is identified by the tuple `(queueName, partitionIndex)` where `partitionIndex ∈ [0, N)`. Every existing message-keyspace key gains the `partitionIndex` as a fixed-width segment immediately after the queue name. Concretely, today's `!sqs|msg|data|<queue>|<gen>|<msgID>` becomes `!sqs|msg|data|<queue>|<partition>|<gen>|<msgID>`. The `<partition>` segment is `appendU64(partitionIndex)` (fixed 8 bytes BE), so:
+
+- Single-partition FIFOs (and all Standard queues) write `partitionIndex = 0` and read with the same prefix as today — the schema is byte-identical for `N=1`.


The proposed key format !sqs|msg|data|<queue>|<partition>|<gen>|<msgID> is not backward compatible with the existing format. Inserting the 8-byte <partition> segment shifts the offsets for <gen> and <msgID>, meaning the new code will fail to find existing messages on disk. This introduces a significant operational risk and prevents rolling upgrades. Per repository rules, the design must detail mitigation strategies (e.g., a temporary 'bridge' or 'proxy' mode) and strategies for zero-downtime cutovers (e.g., dual-write proxy or logical replication) to avoid service interruption.

References

When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

gemini-code-assist · 2026-04-26T10:40:23Z

+    mu      sync.Mutex
+    buckets map[bucketKey]*tokenBucket  // (queueName, action) -> bucket


Using a single sync.Mutex to protect the buckets map will likely become a performance bottleneck, as this lock must be acquired on every SendMessage and ReceiveMessage call. Per the general rules, frequently accessed fields on hot paths should prefer more granular locking or atomic operations to improve performance.

Suggested change

mu sync.Mutex

buckets map[bucketKey]*tokenBucket // (queueName, action) -> bucket

buckets sync.Map // map[bucketKey]*tokenBucket

References

For frequently accessed fields that require monotonic updates, prefer atomic operations or granular locking over mutexes to improve performance on hot paths.

gemini-code-assist · 2026-04-26T10:40:23Z

+A consumer making one ReceiveMessage call sees results from **only the partitions whose leader is the local node**. For a deployment where every partition has its own leader, a single client's ReceiveMessage hits one partition's worth of messages per call. This matches AWS HT-FIFO behaviour: clients are expected to spread their consumer pool across partitions, and an idle partition does not block a busy one.
+
+For deployments that co-locate partitions on one Raft group, all partitions' leaders are the same node and a single ReceiveMessage scans all partitions in sequence. The scan is bounded by `MaxNumberOfMessages` and the existing per-partition page limit — no separate budget.
+


Restricting ReceiveMessage to only scan partitions led by the local node is a significant departure from the SQS model. In a multi-node deployment, this can lead to "false empty" responses if a client hits a node that is not the leader for partitions containing messages. This forces clients to implement cluster-aware round-robin logic, which leaks internal sharding details and breaks the SQS abstraction.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 341b0deb14

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T10:40:56Z

+     b. If this node is not the leader for that partition, skip (the
+        consumer's next call will land on a different node and that
+        leader will serve its partition).


Proxy non-local partitions during ReceiveMessage

Step 2b explicitly skips partitions whose leader is not the local node, which means a consumer pinned to one endpoint will only ever drain that node’s leader-owned partitions while messages on other partitions can backlog indefinitely. This turns ReceiveMessage into node-local visibility rather than queue-wide visibility and breaks normal single-endpoint client deployments unless they add explicit endpoint round-robin behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T10:40:56Z

+| `adapter/sqs.go` | After `authorizeSQSRequest`, call `bucketStore.charge(queueName, action, count)`. On reject, write the `Throttling` envelope and return. |
+| `adapter/sqs_throttle_test.go` (new) | Unit tests for bucket math (edge cases: idle drift, burst, partial refill, batch over-charge, default-off). ~300 lines. |
+| `adapter/sqs_throttle_integration_test.go` (new) | End-to-end: configure a queue with low limits, send N messages back-to-back, confirm the (N+1)th gets `Throttling` with `Retry-After`. ~150 lines. |
+| `monitoring/registry.go` | Two new counter vectors: `sqs_throttled_requests_total{queue, action}` and `sqs_throttle_tokens_remaining{queue, action}`. |


Define tokens_remaining as a gauge metric

This section defines sqs_throttle_tokens_remaining as a counter, but token-bucket remaining capacity must go up and down over time. If implemented as a counter, the value can only increase and any dashboard based on “trending toward zero” becomes meaningless, masking real throttling pressure; this metric needs to be specified as a gauge.

Useful? React with 👍 / 👎.

Four issues raised by Codex + Gemini Code Assist on PR #664: 3.D split-queue FIFO doc: 1. Backward-compatibility risk in keyspace (Codex P1 + Gemini high). The original §3.1 said "every existing key gains a <partition> segment" but also "the schema is byte-identical for N=1" — those are contradictory; inserting bytes shifts every downstream offset and breaks readback of existing data on disk. Rewrote §3.1 to make the key shape conditional on whether the queue is partitioned: legacy / Standard queues use today's keys byte-identically, partitioned queues use a new prefix (!sqs|msg|data|p|<queue>|<partition>|...) with an explicit "p|" discriminator that cannot collide with the legacy prefix even when partition=0. Constructor sketch updated. 2. ReceiveMessage "false empty" on non-local-leader partitions (Codex P1 + Gemini medium). Original §4.2 had clients pinned to one node only seeing partitions whose leader is local — messages on other partitions backlog forever and clients hit empty replies that look like "queue is drained". Rewrote §4.2 so ReceiveMessage proxies the per-partition fanout to the right leader via the existing leader-proxy machinery (extended to take a partition argument). Added a partitionOrder rotation seeded by RequestId so successive calls don't bias toward partition 0. Co-located deployments still pay nothing; spread deployments pay one extra hop per non-local partition only until MaxNumberOfMessages are collected. 3.C per-queue throttling doc: 3. Single sync.Mutex on the bucket store would be a hot-path contention point (Gemini medium). Switched §3.1 to a sync.Map for the bucket lookup (read-mostly access pattern) plus a per-bucket sync.Mutex for charge / refill. Cross-queue traffic never serialises on the same lock. Documented the three-step charge operation explicitly so readers see the locking story. 4. sqs_throttle_tokens_remaining must be a gauge, not a counter (Codex P2). Token budgets go up and down; a counter would mask exactly the depletion operators need to see. Updated both §4.1 (files-touched table) and §7 (operational dashboard) to call the metric a gauge. Counter is correct for the rejections total. Docs-only PR; no code changes, no implementation gate.

bootjp · 2026-04-26T11:24:18Z

Pushed addressing all four review findings (Codex + Gemini converged):

3.D split-queue FIFO doc:

Keyspace backward-compat (CodexP1 + GeminiHigh): §3.1 rewritten — legacy queues use today's keys byte-identically, partitioned queues use a new !sqs|msg|data|p|... prefix with an explicit p| discriminator that cannot collide with legacy keys.
ReceiveMessage false-empty (CodexP1 + GeminiMedium): §4.2 rewritten — proxies non-local-leader partition scans via the existing leader-proxy (extended to take a partition argument) so a consumer pinned to one endpoint still sees every partition's messages. Added partitionOrder rotation to avoid head-of-line bias.

3.C per-queue throttling doc:

Hot-path lock contention (GeminiMedium): §3.1 switched to sync.Map lookup + per-bucket sync.Mutex for charge/refill. Cross-queue traffic never serialises on the same lock. Three-step charge operation documented explicitly.
Metric type (CodexP2): sqs_throttle_tokens_remaining is now correctly a gauge (token budgets go up and down). sqs_throttled_requests_total stays a counter.

Docs-only changes; no implementation gate. +64/-22 lines across the two files.

@claude review

claude · 2026-04-26T11:24:31Z

Claude finished @bootjp's task in 3m 1s —— View job

Review

Read updated design documents
Verify fix: keyspace backward-compat (3.D)
Verify fix: ReceiveMessage false-empty (3.D)
Verify fix: hot-path lock contention (3.C)
Verify fix: metric type correction (3.C)
Overall assessment

All four issues from the Codex/Gemini review have been correctly addressed. New issues found below, one of them medium-severity.

All four previous issues: resolved ✓

3.D — ReceiveMessage false-empty (Gemini medium + Codex P1) ✓
§4.2 now proxies non-local-leader partitions via the existing proxyToLeader machinery (extended with a partition argument) and the partitionOrder rotation seeded by RequestId is a nice head-of-line bias fix. The X-Elastickv-Receive-Partition: <k> header approach is clean and consistent with the existing proxy pattern.

3.C — Hot-path lock contention (Gemini medium) ✓
sync.Map for the bucket registry (lock-free reads) + per-bucket sync.Mutex for charge/refill is the right pattern. The three-step charge operation is well-documented. The race-on-insert via LoadOrStore is handled correctly (both racers agree on config).

3.C — Metric type (Codex P2) ✓
sqs_throttle_tokens_remaining is correctly a gauge in both §4.1 (files-touched table) and §7 (operational section). Both sites updated.

New issues in the updated doc

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §8.1 — stale failure mode description

The fix to §4.2 eliminated the false-empty problem server-side (proxy fanout handles it), but §8.1 still says:

"Consumer never sees a partition's leader: a consumer that always lands on node A sees only the partitions whose leader is A … Mitigation: document that clients should round-robin endpoints."

This directly contradicts §4.2 — the "round-robin endpoints" mitigation is no longer correct or necessary for this failure mode, since the proxy now handles it transparently. An implementor reading §8.1 in isolation would build the wrong mental model and might even suppress the proxy logic thinking clients are expected to handle distribution themselves.

The failure mode to document now is the performance cost, not the correctness risk: proxy latency under spread deployments (one extra RTT per non-local partition), and the trade-off that a consumer asking for 1 message touches at most 1 remote leader while an "is queue empty?" poll costs N proxy round-trips. Fix this →

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.1 — `sqsMsgDataKey` variadic signature is a footgun

The proposed constructor:

func sqsMsgDataKey(queueName string, gen uint64, messageID string, partitions ...uint32) []byte {

Using a variadic for the partition discriminator has two issues as implementation guidance:

sqsMsgDataKey(q, gen, id, p0, p1) silently ignores p1 — a caller could accidentally pass the wrong count and the compiler won't catch it.
Variadic calls allocate even for the zero-argument case in some Go versions, which matters on the legacy hot path.

The doc could instead name two distinct helpers (legacyMsgDataKey / partitionedMsgDataKey) with the thin dispatcher staying internal. The variadic form is fine as a unified constructor for now but the implementation PR should be aware of these edge cases.

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.1 — `p|` discriminator safety constraint is implicit

The doc correctly states the discriminator prevents collision, but doesn't state why it's safe: valid SQS queue names are alphanumeric + hyphen + underscore and cannot contain |, so no queue name can begin with p|, making the prefix unambiguous. One sentence in §3.1 would nail this closed explicitly — the implementation PR's name validator needs to enforce the | prohibition on queue names (which is standard SQS behaviour, but worth calling out as load-bearing here).

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §8.5 — upgrade gate mechanism underspecified

"The catalog rejects CreateQueue with PartitionCount > 1 until every node in the cluster reports the new feature flag" is the gate-of-no-return safety mechanism, but the doc doesn't describe how a node reports its feature flag or how the catalog queries all live nodes. The cross-reference to "§3.3.2 admin-forwarding upgrade gate from the admin dashboard design" is opaque without a link. Since this is the safety rail against accidentally creating partitioned queues in a mixed-version cluster, the mechanism deserves a sentence or two (e.g. "each node's health endpoint advertises a capabilities set; the catalog reads all registered nodes from the distribution layer's node registry and gates on all of them reporting HTFIFO").

[Low] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.4 — `Retry-After: 1` lies for slow-refill queues

The doc justifies the constant Retry-After: 1 as "at the configured refill rate, one second is enough for at least one fresh token" — but this is only true when refillRate >= 1 token/s. A queue configured with SendRefillPerSecond = 0.1 needs 10 seconds for the next token; returning Retry-After: 1 causes 9 unnecessary 400 retries before the client finally gets through. The §3.4 table comment should note this edge case and suggest max(1, ceil(1/refillRate)) seconds, or cap the minimum SendRefillPerSecond at 1 to keep the constant valid.

Summary

The four original issues are cleanly resolved. The §8.1 stale failure mode is the only finding I'd call blocking — it directly contradicts §4.2 and gives implementors the wrong mental model. The remaining three are low-severity clarifications that would improve the implementation PRs' accuracy.
| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fa02faa35

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T11:27:25Z

+
+## 8. Failure Modes and Edge Cases
+
+1. **Consumer never sees a partition's leader**: a consumer that always lands on node A sees only the partitions whose leader is A. If partitions are spread evenly, this is `1/N` of the queue. Mitigation: AWS SDK consumers naturally distribute across endpoints; for elastickv, document that clients should round-robin endpoints when consuming a partitioned FIFO.


Remove node-local receive behavior from failure guidance

Fresh evidence: §8.1 explicitly says an endpoint-pinned consumer only sees 1/N of partitions, which contradicts §4.2’s required proxy fanout behavior. Keeping this guidance in the proposal makes the implementation ambiguous, and following it would regress ReceiveMessage to node-local visibility where messages on remote-leader partitions can backlog indefinitely for single-endpoint consumers.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T11:27:25Z

+if meta.PartitionCount <= 1 {
+    return 0
+}


Align partition routing with perQueue throughput mode

Section 3.2 says FifoThroughputLimit=perQueue should effectively collapse routing to one partition, but the partitionFor pseudocode only checks PartitionCount and will hash across all partitions whenever PartitionCount > 1. If this is implemented as written, perQueue mode cannot behave as documented and will still distribute groups across partitions.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T11:27:25Z

+| JSON | HTTP 400, body `{"__type":"Throttling","message":"Rate exceeded for queue '<name>' action '<action>'"}`, header `x-amzn-ErrorType: Throttling`, header `Retry-After: 1` |
+| Query | HTTP 400, body `<ErrorResponse><Error><Type>Sender</Type><Code>Throttling</Code><Message>...</Message></Error><RequestId>...</RequestId></ErrorResponse>`, headers as above |
+
+`Retry-After: 1` is the conservative default — at the configured refill rate, one second is enough for at least one fresh token. A future iteration could compute the precise wait from `(1 - currentTokens) / refillRate` but the constant is enough for SDK backoff logic.


Compute Retry-After from refill rate

This claim assumes every bucket refills at least one token per second, but the proposed validator allows any positive refill rate. With configurations like RefillPerSecond=0.1, a Retry-After: 1 response guarantees another immediate throttle, creating avoidable retry pressure and misleading backoff behavior; the wait should be derived from the bucket’s refill rate (or the config should enforce a minimum of 1).

Useful? React with 👍 / 👎.

Five issues raised by Claude on PR #664 (one medium, four low — all docs-only edits to the partial doc trio): 3.D split-queue FIFO doc: 1. (Medium) §8.1 stale failure-mode description contradicted §4.2. Original §8.1 said "Consumer never sees a partition's leader → mitigation: round-robin endpoints" — but §4.2 was rewritten earlier in this PR to proxy non-local partitions server-side, eliminating the false-empty risk entirely. Implementors reading §8.1 in isolation would have built the wrong mental model and possibly suppressed the proxy logic. §8.1 now describes the actual remaining failure mode — proxy round-trip latency under spread deployments — and explains that single-partition / co-located deployments pay nothing. 2. (Low) §3.1 variadic sqsMsgDataKey was a footgun. The variadic form would silently accept extra args without compiler diagnosis. Replaced with two named constructors (legacyMsgDataKey / partitionedMsgDataKey) and an explicit dispatch at the call site based on meta.PartitionCount > 1. 3. (Low) §3.1 p| discriminator safety constraint was implicit. Added an explicit paragraph explaining why the prefix cannot collide with the legacy keyspace: queue names admit only [A-Za-z0-9_-] (+ optional .fifo suffix), the validator must continue to reject |, and base32 segments never start with the literal ASCII p. Future relaxations of the name validator have to revisit this prefix scheme. 4. (Low) §8.5 upgrade-gate mechanism was underspecified. Added the concrete protocol: each node's /sqs_health endpoint advertises capabilities: ["htfifo"] once this PR's code is in the binary; the catalog reads the live node set from the distribution layer's node registry, polls /sqs_health on each, and gates PartitionCount > 1 on every node reporting htfifo. Fail-closed on timeouts. References PR #644 as the prior art for the same pattern. 3.C per-queue throttling doc: 5. (Low) §3.4 Retry-After: 1 lied for slow-refill queues. A queue with SendRefillPerSecond = 0.1 needs 10 s for the next token; returning 1 burned 9 unnecessary client retries before the request would succeed. Replaced with a per-queue computation: ceil((1 - currentTokens) / refillRate), floored at 1 s to match Retry-After's integer-second granularity. Fast-refill queues (>= 1 RPS) preserve the existing Retry-After: 1 behaviour. Docs-only PR; no code changes, no implementation gate.

bootjp · 2026-04-26T11:45:28Z

Pushed addressing all five findings from your follow-up review (1 medium, 4 low):

3.D split-queue FIFO doc:

(medium) §8.1 stale failure-mode description rewritten — was contradicting §4.2's proxy fix. Now describes the actual remaining failure (proxy RTT cost under spread deployment) instead of the eliminated false-empty risk.
(low) §3.1 variadic sqsMsgDataKey replaced with two named constructors (legacyMsgDataKey / partitionedMsgDataKey) + explicit call-site dispatch — no silent argument loss.
(low) §3.1 p| discriminator safety made explicit: queue-name validator's | prohibition + base32-encoded segments never starting with literal p together guarantee no collision.
(low) §8.5 upgrade-gate mechanism now concrete: /sqs_health advertises capabilities: ["htfifo"]; catalog polls every registered node, fail-closed on timeout. References PR admin: AdminForward follower-side client + handler integration (P1, partial) #644 as prior art.

3.C per-queue throttling doc:

(low) §3.4 Retry-After now computed per-queue from ceil((1 - currentTokens) / refillRate) (floored at 1 s for HTTP integer-second granularity) instead of the constant 1. Slow-refill queues no longer cause busy-loops.

Docs-only changes; no implementation gate. +24/-13 lines net across the two files.

@claude review

claude · 2026-04-26T11:45:39Z

Claude finished @bootjp's task in 5m 50s —— View job

Review

Read updated design documents
Verify fix: §8.1 stale failure-mode description (3.D) — medium
Verify fix: §3.1 named constructors replacing variadic (3.D) — low
Verify fix: §3.1 p| discriminator safety explicit (3.D) — low
Verify fix: §8.5 upgrade-gate mechanism concrete (3.D) — low
Verify fix: §3.4 Retry-After computed from refill rate (3.C) — low
Check for new issues
Overall assessment

All five issues from the last Claude review are correctly resolved. One pre-existing Codex P2 finding from the second-pass review was not in the trigger's scope but remains open; surfacing it here. Two low-severity findings are new.

All five previous issues: resolved ✓

3.D — §8.1 stale failure-mode description (medium) ✓
The failure mode now reads "Proxy RTT under spread deployment" with the explicit note that §4.2 proxies them server-side so false-empty is gone. The N-proxy-round-trip pathological case (asking "is anything here?" against an empty queue) and the long-poll mitigation are both called out. No contradiction with §4.2.

3.D — §3.1 variadic sqsMsgDataKey (low) ✓
Two distinct named constructors: legacyMsgDataKey(queueName, gen, messageID) and partitionedMsgDataKey(queueName, partition, gen, messageID). Call-site dispatch on meta.PartitionCount > 1. No variadic, no silent argument loss.

3.D — §3.1 p| discriminator safety (low) ✓
The argument is now explicit: (1) queue names are restricted to [A-Za-z0-9_-] so | can never appear; (2) the generation field is base32-encoded so it uses only uppercase A-Z and digits 2–7, meaning base32 cannot produce |, which prevents any legacy-key prefix scan from matching a new partitioned-key scan prefix; (3) a note that the implementation PR's name validator must continue to reject |. The argument is present and ultimately correct.

3.D — §8.5 upgrade-gate mechanism (low) ✓
Concrete now: /sqs_health advertises capabilities: ["htfifo"]; the catalog's CreateQueue handler polls every registered node from the distribution layer's node registry (same registry as proxyToLeader); fail-closed on timeout; explicit prior art reference to PR #644's AdminForward gate.

3.C — §3.4 Retry-After formula (low) ✓
Formula is ceil((1.0 - currentTokens) / refillRate) with max(1, ...) floor, inline example: refillRate=0.1 → Retry-After: 10. The "validator keeps refillRate > 0" divide-by-zero note is a good touch.

New findings

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.3 — `FifoThroughputLimit=perQueue` not implemented in `partitionFor`

This was flagged as Codex P2 in the second-pass review but was not in the trigger's scope (trigger addressed Claude's five findings only). It remains open in the doc.

§3.2 documents FifoThroughputLimit:

"the alternative 'perQueue' reduces the partition assignment to a single partition (effectively N=1)"

But §3.3's partitionFor pseudocode only gates on meta.PartitionCount <= 1:

if meta.PartitionCount <= 1 {
    return 0
}
// falls through to hash for any PartitionCount > 1

A queue created with PartitionCount=8 + FifoThroughputLimit=perQueue would hash across all 8 partitions, directly contradicting §3.2's documented semantics. The fix is one line: Fix this →

// Fixed guard:
if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" {
    return 0
}

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §4.1 — pseudocode still uses the old `sqsMsgDataKey` name

§3.1 renamed the constructor, but §4.1 step 5 still references the old form:

5. Build the OCC OperationGroup with partition-aware keys:
     dataKey  = sqsMsgDataKey(queue, partitionIndex, gen, msgID)

An implementor reading §4.1 in isolation gets conflicting guidance vs. §3.1. The pseudocode should dispatch via the named constructors introduced in §3.1: Fix this →

[Low] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.4 — `Retry-After` formula hardcodes `1.0` regardless of batch size

The fix correctly computes Retry-After from refillRate. But the formula numerator is always 1.0:

secondsToNextToken := math.Ceil((1.0 - currentTokens) / refillRate)

For a SendMessageBatch of 10 entries (charge = 10) against a bucket with 0 tokens and refillRate = 1.0, this gives Retry-After: 1 — but 10 seconds are needed for the batch to succeed. The client burns 9 retries before accumulating enough tokens. The formula should use the requested count instead of the constant 1.0:

secondsToNextToken := math.Ceil((float64(requestedCount) - currentTokens) / refillRate)

This matters most for large batches with slow-to-medium refill rates; for fast refill or single operations the constant-1 floor masks the gap anyway.

Summary

All five issues from the last review are cleanly fixed. The FifoThroughputLimit=perQueue routing gap (Codex P2 from the second-pass review, not previously in Claude's scope) is the only finding here I'd call blocking before the implementation PRs — it means the routing logic in §3.3 contradicts the semantics documented in §3.2. The other two are low-severity doc cleanups.
| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f289d6c291

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T11:54:13Z

+2. On miss, build the bucket from queue meta and `LoadOrStore` it (one-shot insert race tolerated — both racers will agree on the same configuration).
+3. Acquire the bucket's own `mu`, refill based on elapsed time, take or reject the requested tokens, release `mu`.


Invalidate cached buckets on throttle config updates

The design caches each (queue, action) bucket on first use via LoadOrStore, but it never defines a refresh/invalidatation path when SetQueueAttributes changes throttle values. For active queues this means updated limits can stay stale until idle eviction or failover, so operators may keep over-throttling or under-throttling traffic long after they reconfigure the queue.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T11:54:13Z

+secondsToNextToken := math.Ceil((1.0 - currentTokens) / refillRate)
+retryAfter := max(1, int(secondsToNextToken))   // never less than 1


Base Retry-After on requested token cost

The proposed Retry-After calculation always targets the next single token, but this proposal charges batch requests by entry count. When a large SendMessageBatch is throttled, clients can be told to retry in 1 second even though many more seconds are needed to refill enough tokens for that same batch, which guarantees repeated throttles and adds avoidable retry pressure.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T11:54:13Z

+`kv/shard_router.go` today routes by queue name. With partitions, the routing key becomes `(queueName, partitionIndex)`. The existing `--raftSqsMap` flag accepts entries keyed by queue name; the syntax extends to accept a partition suffix:
+
+```
+--raftSqsMap "orders.fifo:8=group-7,group-8,group-9,group-10,group-11,group-12,group-13,group-14"


Split partition routing from raftSqsMap endpoint mapping

This section assigns queue/partition placement to --raftSqsMap, but that flag is currently defined as raftAddr=sqsAddr endpoint mapping for leader proxying. Reusing it for partition-to-group routing makes the config surface ambiguous and can break proxy address resolution, especially in the receive fanout flow that depends on correct leader endpoint lookup.

Useful? React with 👍 / 👎.

Three findings (1 medium, 2 low): 3.D split-queue FIFO doc: 1. (Medium) §3.3 partitionFor missed FifoThroughputLimit=perQueue. §3.2 documents that perQueue "reduces routing to a single partition regardless of PartitionCount" but the §3.3 pseudocode only checked PartitionCount <= 1. A queue created with PartitionCount=8 + FifoThroughputLimit=perQueue would hash across all 8 partitions, directly contradicting the documented semantics. Updated the early- return guard to: if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" { return 0 } Codex P2 caught this in the second-pass review but it was outside the scope of the previous fix iteration. 2. (Low) §4.1 step 5 still referenced the old variadic sqsMsgDataKey(queue, partitionIndex, gen, msgID). §3.1 had already moved to two named constructors (legacyMsgDataKey / partitionedMsgDataKey) but §4.1's pseudocode still used the variadic form, contradicting §3.1. Replaced with explicit if/else dispatch through the named constructors so an implementor reading §4.1 in isolation gets the same naming as §3.1. 3.C per-queue throttling doc: 3. (Low) §3.4 Retry-After formula's numerator was hardcoded to 1.0 regardless of charge size. SendMessageBatch of 10 entries against a 0-token bucket with refillRate = 1.0 would have returned Retry-After: 1 (insufficient) instead of 10. Updated formula to: needed := float64(requestedCount) - currentTokens secondsToNextRefill := math.Ceil(needed / refillRate) retryAfter := max(1, int(secondsToNextRefill)) requestedCount is the same value §3.3's charge step uses (1 for single, len(Entries) for batch). Docs-only PR; no code changes, no implementation gate.

bootjp · 2026-04-26T11:54:22Z

Pushed addressing all three findings from your third review (1 medium, 2 low):

3.D split-queue FIFO doc:

(medium) §3.3 partitionFor now also short-circuits when FifoThroughputLimit == "perQueue" — fixes the contradiction with §3.2's documented semantics. Codex P2 from the second-pass review, now resolved.
(low) §4.1 step 5 pseudocode now dispatches through the §3.1 named constructors (legacyMsgDataKey / partitionedMsgDataKey) instead of the old variadic form. No more conflicting guidance between §3.1 and §4.1.

3.C per-queue throttling doc:

(low) §3.4 Retry-After formula now uses requestedCount instead of the hardcoded 1.0 numerator. SendMessageBatch of 10 against refillRate=1.0 correctly returns Retry-After: 10 instead of 1.

Docs-only changes; +18/-7 lines net.

@claude review

claude · 2026-04-26T11:54:34Z

Claude finished @bootjp's task in 2m 9s —— View job

Review

Read updated design documents
Verify fix: §3.3 partitionFor FifoThroughputLimit check (medium)
Verify fix: §4.1 step 5 named constructor dispatch (low)
Verify fix: §3.4 Retry-After uses requestedCount (low)
Check for new issues (including open Codex P1 findings)
Overall assessment

All three issues from the trigger are correctly resolved. Two Codex P1 findings from the previous review remain open and are surfaced below (they were not in scope of this push). Two new low-severity issues found.

All three previous issues: resolved ✓

3.D — §3.3 FifoThroughputLimit=perQueue routing (medium) ✓
The guard is now:

if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" {
    return 0
}

The comment explicitly documents that perQueue collapses all groups to partition 0 regardless of PartitionCount, and calls out the Codex P2 catch. The contradiction with §3.2 is resolved.

3.D — §4.1 step 5 named constructor dispatch (low) ✓
Step 5 now uses the explicit if meta.PartitionCount > 1 dispatch branching to partitionedMsgDataKey / legacyMsgDataKey (and analogues for visKey / groupKey). No more reference to the old variadic sqsMsgDataKey. §3.1 and §4.1 are now consistent.

3.C — §3.4 Retry-After uses requestedCount (low) ✓
Formula is now:

needed              := float64(requestedCount) - currentTokens
secondsToNextRefill := math.Ceil(needed / refillRate)
retryAfter          := max(1, int(secondsToNextRefill))

Both cases are correctly handled: a SendMessageBatch of 10 against refillRate=1.0 gives Retry-After: 10; a single SendMessage against refillRate=0.1 also gives Retry-After: 10. The floor-1 and divide-by-zero notes are preserved.

Open Codex P1 findings from the previous round (not addressed in this push)

These were raised against f289d6c291 and are still open in the current doc.

[Medium] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.1 — bucket cache not invalidated on `SetQueueAttributes`

§3.1 describes LoadOrStore on first use: "build the bucket from queue meta and LoadOrStore it." Once a bucket is stored, the doc describes no path for invalidating or refreshing it when SetQueueAttributes changes the throttle config. A queue created with SendRefillPerSecond = 10, then lowered to 1 via SetQueueAttributes, continues to allow 10 RPS until the idle-eviction sweep removes the old bucket entry (default 1h window). An operator who lowers limits to stop a noisy tenant would see the effect delayed by up to an hour.

The fix is a sentence in §3.1: SetQueueAttributes must call buckets.Delete(key) for all action-keys of the updated queue after committing the meta change to Raft, so the next request re-initialises the bucket from the new config. Fix this →

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §5 — `--raftSqsMap` reuse conflicts with endpoint-address semantics

§5 proposes extending --raftSqsMap with a new queueName:N=group1,...,groupN syntax. The existing flag maps raftAddr=sqsAddr (endpoint routing for proxyToLeader). Reusing the same flag for a completely different concern — partition-to-Raft-group assignment — makes the config surface ambiguous: a single flag entry must now be parsed as either an address mapping or a partition assignment, and the proxyToLeader fanout in §4.2 depends on correctly resolving the per-partition leader's SQS endpoint from this same flag. Mixing the two intents in one parsing path risks a subtle bug where a partitioned-queue config entry is accidentally interpreted as a raftAddr=sqsAddr pair (or vice versa), silently producing the wrong proxy target.

A new flag (e.g. --raftSqsFifoMap) or a clearly separated config section avoids the parsing collision and lets the two concerns evolve independently. Fix this →

New findings

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §9 — no test case for `FifoThroughputLimit=perQueue`

§9's unit test list covers PartitionCount <= 1 and empty MessageGroupId fast-paths but does not include a test that verifies FifoThroughputLimit=perQueue routes to partition 0 even when PartitionCount = 8. The perQueue short-circuit is the fix this push added to §3.3 — without an explicit test the correctness is trivially regressionable. One test case: partitionFor({PartitionCount: 8, FifoThroughputLimit: "perQueue"}, "any-group-id") == 0. Fix this →

[Low] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §4.3 — "sharded by `MessageGroupId`" is the wrong key

§4.3 describes the forward-compatibility path for Phase 3.D:

"each partition gets its own bucket, sharded by MessageGroupId"

The bucket key should be (queueName, partitionID), not (queueName, MessageGroupId). MessageGroupId is the input to partitionFor — the partition index it maps to is what identifies which leader owns the bucket. Using MessageGroupId as the key would create one bucket per group value (unbounded, attacker-amplifiable map size); using partitionID creates exactly PartitionCount buckets per queue (bounded by design). The §4.3 sentence should read "each partition gets its own bucket, keyed by (queueName, partitionID)." Fix this →

Summary

All three fixes from this push are cleanly applied. The two Codex P1 findings from the previous review — bucket cache invalidation on SetQueueAttributes (3.C) and --raftSqsMap parsing ambiguity (3.D) — remain the only items I'd consider blocking before implementation PRs. Both are design-level gaps that implementation code will have to resolve anyway; better to get the doc right first. The two new findings are low-severity doc corrections.
| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d424d18d7c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T11:59:06Z

+    queue  string
+    action string  // "Send" | "Receive" | "*"


Key throttle buckets by queue generation

The bucket key is defined as only (queue, action), so a deleted/recreated queue with the same name will reuse the old in-memory bucket state until idle eviction, inheriting stale token balance and potentially stale limits. In practice this can throttle a brand-new queue immediately (or let it burst unexpectedly) after DeleteQueue/recreate, which breaks queue incarnation isolation; include generation (or another meta version) in the key or define explicit invalidation on queue lifecycle changes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T11:59:06Z

+3. For each partitionIndex in partitionOrder, until MaxNumberOfMessages
+   are collected or every partition has been tried:
+     a. Resolve the leader for (queue, partitionIndex).
+     b. If this node is the leader: scan locally, deliver candidates.
+     c. Otherwise: forward the request to the leader-of-partition via


Bound per-partition proxy calls by shared wait budget

This flow fans out ReceiveMessage across partitions by forwarding requests partition-by-partition, but it does not require reducing WaitTimeSeconds across hops. If each forwarded call long-polls independently, an empty queue can take roughly PartitionCount × WaitTimeSeconds, which can exceed SQS-style wait expectations and cause client/LB timeouts; the design should require a single request-level deadline and pass only remaining budget to each partition probe.

Useful? React with 👍 / 👎.

Four findings (2 medium, 2 low): 3.C per-queue throttling doc: 1. (Medium) §3.1 missed bucket cache invalidation on SetQueueAttributes. The doc described LoadOrStore on first use but no path for refreshing the in-memory bucket when SetQueueAttributes changes the throttle config. An operator lowering a limit to stop a noisy tenant would wait up to the idle-eviction window (1h default) before the change took effect. Added a paragraph: SetQueueAttributes calls buckets.Delete(key) for every action key after the Raft commit so the next request rebuilds from the new meta. The race with LoadOrStore is benign (the rebuilt bucket starts at full capacity, same semantics as the failover case). 2. (Low) §4.3 forward-compat sentence used the wrong key for the per-partition bucket. "sharded by MessageGroupId" was wrong — MessageGroupId is the input to partitionFor, not the bucket key. Using it directly would create one bucket per unique group value (unbounded, attacker-amplifiable). The bucket key is (queueName, partitionID) which is bounded by PartitionCount. Replaced the sentence with the correct keying and added the bucketKey-grows-a-partition-field note. 3.D split-queue FIFO doc: 3. (Medium) §5 reused --raftSqsMap for partition-to-Raft-group assignment. That flag already maps raftAddr=sqsAddr for proxyToLeader's endpoint resolution; overloading the same parser with partition assignments creates a parsing ambiguity that could silently produce the wrong proxy target in §4.2's fanout. Replaced with a new dedicated --sqsFifoPartitionMap flag. The existing --raftSqsMap is unchanged. Updated §11's rollout table accordingly. 4. (Low) §9 unit-test list was missing a case for FifoThroughputLimit=perQueue routing. The §3.3 short-circuit guard (added in the previous fix iteration) would silently regress without a test. Added: "FifoThroughputLimit=perQueue with PartitionCount=8 routes every group ID to partition 0." Docs-only PR; no code changes, no implementation gate.

bootjp · 2026-04-26T12:04:23Z

Pushed addressing all four findings from your fourth review (2 medium, 2 low):

3.C per-queue throttling doc:

(medium) §3.1 now requires SetQueueAttributes to call buckets.Delete(key) after the Raft commit so a config change takes effect on the next request, not after the 1h idle-eviction window. Race with LoadOrStore is benign (rebuilt bucket starts at full capacity, same as failover).
(low) §4.3 forward-compat fixed: bucket key for split-FIFO is (queueName, partitionID), not MessageGroupId — the latter would create unbounded buckets per group value.

3.D split-queue FIFO doc:

(medium) §5 now uses a new dedicated --sqsFifoPartitionMap flag for partition-to-Raft-group assignment instead of overloading --raftSqsMap (which keeps doing endpoint mapping for proxyToLeader). Eliminates the parsing ambiguity. §11 rollout table updated.
(low) §9 unit test list now includes FifoThroughputLimit=perQueue with PartitionCount=8 routes every group to partition 0 to lock down the §3.3 short-circuit guard against regression.

Docs-only changes; +9/-6 lines net.

@claude review

claude · 2026-04-26T12:04:39Z

Claude finished @bootjp's task in 3m 26s —— View job

Review

Read updated design documents
Verify fix: §3.1 SetQueueAttributes bucket invalidation (3.C) — medium
Verify fix: §4.3 bucket key (queueName, partitionID) vs MessageGroupId (3.C) — low
Verify fix: §5 --sqsFifoPartitionMap dedicated flag (3.D) — medium
Verify fix: §9 FifoThroughputLimit=perQueue test case (3.D) — low
Check for remaining open Codex P1 items from prior rounds
Check for new issues
Overall assessment

All four issues from the trigger are correctly resolved. Two Codex P1 findings first surfaced in my fourth-round review — and noted there as "not in scope of this push" — remain open in the current doc. One new low-severity finding.

All four trigger fixes: resolved ✓

3.C — §3.1 SetQueueAttributes bucket cache invalidation (medium) ✓
The paragraph is explicit and accurate: after the Raft commit, the handler calls buckets.Delete(key) for all three action-keys (Send, Recv, Default). The race with LoadOrStore is correctly noted as benign (fresh bucket starts at full capacity, same as failover). The 1h idle-eviction context makes the urgency of the invalidation clear.

3.C — §4.3 bucket key (queueName, partitionID) (low) ✓
The corrected key is explicit, the MessageGroupId-as-key footgun is called out with the reason (unbounded attacker-amplifiable map), and the bucketKey struct is noted to gain a partition uint32 field when Phase 3.D lands. Consistent with §3.1's struct definition.

3.D — §5 --sqsFifoPartitionMap dedicated flag (medium) ✓
The new flag is introduced with clear syntax (orders.fifo:8=group-7,...group-14), the separation from --raftSqsMap's endpoint-mapping role is stated explicitly and the reason (parsing ambiguity → wrong proxy target) is documented. The rollout table (§11, PR 4) also references the new flag by name.

3.D — §9 FifoThroughputLimit=perQueue test case (low) ✓
The bullet is present, includes the concrete call form (FifoThroughputLimit = "perQueue" with PartitionCount = 8 → partition 0 for every group ID), and the regression rationale is spelled out ("the perQueue branch is a one-line guard that could easily be dropped during a refactor"). Good.

Still-open Codex P1 findings from `d424d18d7c` (surfaced in my last review, not addressed in this push)

These were called out as "not in scope of this push" in my fourth-round review summary but were not deferred to a follow-up issue — they are design-level gaps the implementation PRs will have to resolve.

[Medium] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.1 — `DeleteQueue` does not invalidate the bucket cache

§3.1 now correctly invalidates buckets on SetQueueAttributes, but says nothing about DeleteQueue. A queue deleted and recreated with the same name inherits the old in-memory bucket's token balance and — if the new queue was created without throttle config — the old bucket's capacity/refillRate as well, until the idle-eviction sweep (1h default). An operator who uses DeleteQueue+CreateQueue to reset a noisy queue's state gets a surprise: the throttle limits from the deleted queue follow the new one for up to an hour.

The fix is one sentence: "DeleteQueue calls buckets.Delete for all action-keys of the queue (same as SetQueueAttributes), so a recreated same-name queue starts with a clean bucket." Alternatively, include generation in bucketKey so old and new incarnations never share a bucket entry. Fix this →

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §4.2 — `WaitTimeSeconds` not bounded across the partition fanout

The fanout iterates partitions sequentially until MaxNumberOfMessages are collected. If each proxied sub-call independently long-polls for WaitTimeSeconds, an empty partitioned queue with WaitTimeSeconds = 20 and PartitionCount = 8 holds the connection for up to 160 s before returning empty — well past any reasonable client or load-balancer timeout. The doc notes "FIFO consumers are expected to use long-poll (WaitTimeSeconds)" as a mitigant for the N-round-trip cost, which inadvertently encourages the very configuration that creates the problem.

§4.2 should specify how the wait budget is managed: either (a) a single top-level deadline is threaded through as remaining_wait = WaitTimeSeconds - elapsed_so_far and passed to each sub-call, so the total wait is bounded by WaitTimeSeconds; or (b) partitions are probed in parallel and the first non-empty result short-circuits. Option (a) matches how AWS's internal partition fanout works; option (b) is simpler to implement but changes the per-partition polling semantics. Either choice should be explicit. Fix this →

New finding

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.4 — reference to `--raftSqsMap` for partition routing contradicts §5

§3.4 states: "The existing kv/shard_router.go plus the multi-shard configuration flags (--raftSqsMap) already support this — each (queueName, partition) pair becomes its own routing key."

--raftSqsMap maps raftAddr=sqsAddr (endpoint pairs for proxyToLeader); it does not map (queueName, partition) → Raft group. §5 is explicit that partition-to-Raft-group assignment requires the new --sqsFifoPartitionMap flag and that --raftSqsMap is unchanged by this design. An implementor reading §3.4 would conclude partition routing is free with the existing flag, reach §5, and be confused about what needs to be added. The §3.4 sentence should say that while the router infrastructure (kv/shard_router.go) supports multi-group routing today, the partition-to-group assignment configuration requires the new --sqsFifoPartitionMap flag described in §5. Fix this →

Summary

All four trigger fixes are cleanly applied. The two Codex P1 items that have been open since d424d18d7c — DeleteQueue bucket lifecycle and WaitTimeSeconds budget management across the fanout — are the only findings I'd call blocking before implementation PRs start. Both are design decisions the implementation code has to embody either way; resolving them in the doc now prevents the PR author from having to reverse a committed approach. The §3.4/§5 inconsistency is low-severity cleanup.
| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6bd33ab70

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T12:08:04Z

+- All four `Send*` / `Recv*` fields must be either both zero (disabled) or both positive.
+- Capacity ≥ refill (otherwise the bucket can never burst above the steady state).
+- A hard ceiling per queue (e.g. 100,000 RPS) so a typo (`SendCapacity = 1e9`) does not silently mean "no limit at all" but rejects with `InvalidAttributeValue`.


Require bucket capacity to cover max batch token cost

The validator currently only enforces paired positivity and capacity >= refill, but §3.3 charges batch APIs by len(Entries) (up to 10). With configs like SendCapacity=5, a valid SendMessageBatch of 10 can never succeed because the bucket can never accumulate 10 tokens, so clients get permanent throttling and misleading Retry-After values. Please add validation that per-action capacity is at least the maximum single-request charge (or change the charging rule) to avoid creating unrecoverable throttling states.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T12:08:04Z

+5. **Mixed-version cluster**: a rolling upgrade where some nodes have HT-FIFO and others don't. The new feature gates on the queue's `PartitionCount > 1` field, which is set at create time; old nodes that try to scan a partitioned queue's keyspace will simply not find anything (the prefix has changed). The catalog rejects `CreateQueue` with `PartitionCount > 1` until every node in the cluster reports the new feature flag.
+
+   **The capability advertisement mechanism**: each node's existing `/sqs_health` endpoint (`adapter/sqs.go: serveSQSHealthz`) gains a new field in its JSON body — `capabilities: ["htfifo"]` once this PR's code is in the binary. The catalog's CreateQueue handler reads the live node set from the distribution layer's node registry (the same registry used by `proxyToLeader` to locate leaders), polls `/sqs_health` on each, and gates `PartitionCount > 1` on every node reporting the `htfifo` capability. Nodes that don't respond within a short timeout are treated as not-yet-upgraded — a deliberate fail-closed default so a network blip does not let a partitioned queue land in a partially-upgraded cluster. This mirrors the §3.3.2 admin-forwarding upgrade gate from the admin dashboard design (PR #644), which uses the same "all-nodes-must-report" pattern for `AdminForward`.


Enforce HT-FIFO capability beyond queue creation

This guard only blocks CreateQueue when not all nodes advertise htfifo, but the same section notes older nodes will not see partitioned-keyspace data. That leaves a failure mode where a downgraded/replaced old node joins after creation, becomes leader for a partitioned queue, and serves false-empty reads/writes against the wrong prefix. The proposal should require a runtime safeguard (startup or leadership admission) that prevents non-HT-FIFO binaries from serving once any PartitionCount > 1 queue exists.

Useful? React with 👍 / 👎.

…conds, §3.4) Three findings from the fifth-round Claude review on PR #664: 3.C per-queue throttling — DeleteQueue bucket lifecycle (medium): §3.1 now requires DeleteQueue to call buckets.Delete(key) for every bucketKey of the deleted queue (Send, Recv, Default), so a same-name recreation does not inherit stale token balance/limits for up to 1h. Alternative (generation in bucketKey) considered and rejected as more expensive than an explicit Delete. 3.D split-queue FIFO — WaitTimeSeconds budget across fanout (medium): §4.2 now threads a shared deadline through the partition fanout (step 3) and passes remainingWait per sub-call (step 4a/c/d). Without this, an empty queue with WaitTimeSeconds=20 + PartitionCount=8 would hold the connection for up to 160s. §8.1 also drops the contradictory "long-poll mitigates the cost" sentence — long-poll was actually the trigger for the bug. 3.D split-queue FIFO — §3.4 --raftSqsMap reference (low): rephrased to distinguish the router infrastructure (kv/shard_router.go, which already supports multi-group routing by key) from the configuration surface (the new --sqsFifoPartitionMap flag in §5). Eliminates the contradiction with §5.

bootjp · 2026-04-26T12:26:03Z

Pushed addressing all three findings from your fifth-round review (2 medium, 1 low):

3.C per-queue throttling — §3.1 DeleteQueue bucket lifecycle (medium):

New paragraph immediately after the SetQueueAttributes invalidation block. DeleteQueue now also calls buckets.Delete(key) for every bucketKey of the deleted queue (Send, Recv, Default), so a DeleteQueue+CreateQueue with the same name no longer inherits the previous incarnation's token balance/limits for up to 1h. Documented the generation-in-bucketKey alternative and the reason for picking the explicit Delete (cheaper, mirrors the SetQueueAttributes pattern).

3.D split-queue FIFO — §4.2 WaitTimeSeconds budget across fanout (medium):

Pseudocode reworked: step 3 sets deadline := start + WaitTimeSeconds (cap 20s); step 4a/4c/4d compute remainingWait := max(0, deadline - now()) and pass it as the per-sub-call long-poll budget. Total wall-clock wait is now bounded by the original WaitTimeSeconds regardless of PartitionCount. Without this, an empty queue with WaitTimeSeconds=20 + PartitionCount=8 would hold the connection for up to 160s.
Added a "Why the shared deadline matters" paragraph documenting the bug class and the rejected alternative (parallel probe with shared cancellation — rejected because it erases the partitionOrder rotation).
§8.1 dropped the contradictory "long-poll mitigates the RTT cost" sentence; long-poll was actually the trigger for the bug.

3.D split-queue FIFO — §3.4 --raftSqsMap reference (low):

Rephrased to distinguish the router infrastructure (kv/shard_router.go, which already supports multi-group routing by key) from the configuration surface (--sqsFifoPartitionMap, the new flag from §5). Calls out explicitly that conflating the two would parse partition assignments as endpoint pairs and route to the wrong leader. Aligns with §5.

@claude review

claude · 2026-04-26T12:26:19Z

Claude finished @bootjp's task in 3m 20s —— View job

Review

Read updated design documents
Verify fix: §3.1 DeleteQueue bucket invalidation (3.C) — medium
Verify fix: §4.2 WaitTimeSeconds shared deadline across fanout (3.D) — medium
Verify fix: §3.4 --raftSqsMap reference cleanup (3.D) — low
Check open Codex P1 items from last round
Check for new issues
Overall assessment

All three fixes from the trigger are correctly applied. Two Codex P1 items raised in the sixth-round review remain open and are the only findings I would consider blocking before implementation PRs. One new low-severity finding.

All three trigger fixes: resolved ✓

3.C — §3.1 DeleteQueue bucket lifecycle (medium) ✓
The paragraph is explicit and accurate: after the Raft commit that purges the queue meta, buckets.Delete(key) is called for all three action-keys. The race with LoadOrStore is correctly noted as benign. The alternative (generation-in-bucketKey) is documented and the reason for rejecting it is clear. The 1 h idle-eviction context makes the urgency obvious.

3.D — §4.2 WaitTimeSeconds shared deadline (medium) ✓
Step 3 sets deadline := start + WaitTimeSeconds (capped at 20s); step 4a computes remainingWait := max(0, deadline - now()) and passes it to each sub-call. The "Why the shared deadline matters" paragraph correctly explains the 160 s pathological case that the fix eliminates, and the rejected parallel-probe alternative is documented with the reason (erases partitionOrder rotation semantics). §8.1 no longer contains the contradictory long-poll mitigation sentence.

3.D — §3.4 --raftSqsMap reference cleanup (low) ✓
The distinction is now explicit: kv/shard_router.go already supports multi-group routing (no changes needed there), but the configuration surface for partition-to-group assignment is --sqsFifoPartitionMap (§5), not --raftSqsMap. The consequence of conflating the two ("parse partition assignments as endpoint pairs and route to the wrong leader") is called out directly.

Still-open Codex P1 findings from `f6bd33ab70` (sixth-round Codex review)

These were raised in the commit-by-commit Codex review on PR #664 and have not yet been addressed.

[Medium] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.2 — validator does not bound capacity to max single-request charge

The validator enforces capacity >= refill but not capacity >= max_single_request_charge. The maximum single-request charge from the Send bucket is 10 (a full SendMessageBatch). A queue configured with SendCapacity = 5 creates a state where every SendMessageBatch of > 5 entries is permanently unserviceable — the bucket can only hold 5 tokens and can never accumulate 10. The client receives Throttling forever, Retry-After grows proportionally, and there is no way to recover without reconfiguring the queue. The Retry-After formula (§3.4) correctly computes ceil((10 - 5) / refillRate) but that only tells the client "wait N seconds" before they'll be throttled again.

The fix is one validation rule: each capacity that covers a batch verb must be at least the maximum charge of any single request through that bucket. For Send: SendCapacity >= 10. For Recv: RecvCapacity >= 10 (DeleteMessageBatch is also up to 10). Document the constraint in §3.2 and reject with InvalidAttributeValue at SetQueueAttributes. Fix this →

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §8 item 5 — no runtime safeguard against a downgraded node becoming leader after HT-FIFO queue creation

§8 item 5 says "the catalog rejects CreateQueue with PartitionCount > 1 until every node reports htfifo" — a create-time gate. But the same item acknowledges "older nodes that try to scan a partitioned queue's keyspace will simply not find anything." There is no runtime protection against the following sequence:

All nodes have htfifo; a partitioned queue orders.fifo is created.
Node A is upgraded back (or replaced by a rollback image) and rejoins without htfifo.
Node A is elected leader for the partition-3 Raft group.
Node A's ReceiveMessage scans the old prefix (!sqs|msg|data|orders.fifo|...) and finds nothing — false-empty reads.
Node A's SendMessage writes the old key format — data on partition 3 is written to the wrong prefix and the reaper won't find it under the new prefix.

The create-time gate does not cover this; it only prevents new partitioned queues from being created in a mixed-version cluster. Once a partitioned queue exists, any node that serves it must also understand the new prefix.

The proposal needs a runtime safeguard: a node that does not have htfifo should refuse to become leader (or refuse to serve the SQS handler) for a Raft group that contains any queue with PartitionCount > 1. A practical implementation: on startup (or on leadership acquisition), query the catalog for partitioned queues in the local shard; if any exist and the binary lacks htfifo, refuse leadership with an explicit log message. This is the node-admission variant of the create-time gate. Fix this →

New finding

[Low] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.1 — `bucketKey` struct definition and `DeleteQueue` paragraph are inconsistent on whether `partitionID` is in the current key

The struct definition in §3.1 is:

type bucketKey struct {
    queue  string
    action string  // "Send" | "Receive" | "*"
}

But the DeleteQueue paragraph (added in this latest push) says:

"so we keep bucketKey = (queueName, partitionID, action) for now"

partitionID is not in the current struct. §4.3 explicitly says the struct "grows a partition uint32 field" when Phase 3.D lands. The DeleteQueue paragraph should say (queueName, action) — partitionID is a future addition. An implementor reading §3.1 end-to-end will see the struct definition conflict with the inline text, and if they include partitionID in Phase 3.C based on the paragraph, the §4.3 forward-compat note will be confusing. Fix this →

Summary

All three trigger fixes are cleanly applied. The two Codex P1 items from the sixth-round review — bucket capacity bound on max batch charge (3.C §3.2) and runtime safeguard for downgraded nodes post-HT-FIFO creation (3.D §8) — are the only findings I'd call blocking before the implementation PRs. Both represent design gaps that will have to be resolved in code regardless; better to settle the spec now. The bucketKey inconsistency is a low-severity cleanup.
| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 624d411a93

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T12:28:26Z

+
+No global lock is held during step 3; concurrent traffic on different queues runs in parallel.
+
+**Cache invalidation on `SetQueueAttributes`**: when an operator updates the throttle config via `SetQueueAttributes`, the handler — *after* the Raft commit that persists the new `sqsQueueThrottle` — calls `buckets.Delete(key)` for every `bucketKey` belonging to the updated queue (`Send`, `Recv`, `Default`). Without this step the in-memory bucket would keep enforcing the old limits until the idle-eviction sweep removes the stale entry (default 1 h window), defeating the operator's intent to throttle a noisy tenant in real time. The `LoadOrStore` race with the `Delete` call is benign: the next request rebuilds from the freshly-committed meta, and the rebuilt bucket starts at full capacity (same semantics as the failover case documented below). Claude P1 on PR #664 caught the gap.


Use canonical action keys when deleting throttle buckets

The design defines bucketKey.action as "Send" | "Receive" | "*", but this invalidation step deletes ("Send", "Recv", "Default"). If implemented as written, Receive and default buckets will not be removed on SetQueueAttributes, so updated limits can remain stale until idle eviction (and the same mismatch appears in the DeleteQueue path), which directly defeats the real-time reconfiguration behavior this section is trying to guarantee.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-26T12:28:26Z

+}
+```
+
+`PartitionCount` is **immutable after first SendMessage**. The validator on `SetQueueAttributes` rejects any change; operators who want a different partition count create a new queue. Why immutable: changing it would require re-hashing every existing message into a new partition, which (a) breaks ordering for in-flight messages of every group whose hash bucket changed, and (b) is a multi-second / multi-minute operation that cannot be expressed as one OCC transaction.


Keep FIFO routing attributes immutable after first send

This section only freezes PartitionCount, but routing also depends on FifoThroughputLimit (perQueue forces partition 0 in partitionFor). Allowing SetQueueAttributes to flip throughput mode after messages exist can move the same MessageGroupId to a different partition mid-stream, splitting one group across partitions and violating FIFO within-group ordering; the proposal should lock routing-affecting attributes after first send (or require queue recreation).

Useful? React with 👍 / 👎.

…rade, key) Three findings from the sixth-round review on PR #664: 3.C per-queue throttling §3.2 — capacity bound on max batch charge (Codex P1, medium): SetQueueAttributes validator now requires capacity >= max single-request charge. Send/Recv buckets must be >= 10 when non-zero (full SendMessageBatch / DeleteMessageBatch is 10 entries). Without this rule, SendCapacity=5 makes every full batch permanently unserviceable — the bucket can never accumulate enough tokens. 3.D split-queue FIFO §8 item 5 — runtime safeguard for downgraded leaders (Codex P1, medium): create-time gate only blocks new partitioned queues. Once one exists, a downgraded node lacking htfifo capability could be elected leader and produce false-empty reads + key-prefix corruption. Adds a node-admission rule: on startup + leadership acquisition, enumerate local-shard queues with PartitionCount > 1; if any and the binary lacks htfifo, refuse leadership and step down via TransferLeadership. 3.C per-queue throttling §3.1 — DeleteQueue paragraph bucketKey text (Claude low): "(queueName, partitionID, action)" contradicted the struct definition above it (no partitionID until Phase 3.D §4.3). Corrected to "(queueName, action) for Phase 3.C".

bootjp · 2026-04-26T12:37:00Z

Pushed addressing all three findings from your sixth-round review (2 medium Codex P1 + 1 low Claude):

3.C per-queue throttling — §3.2 capacity ≥ max batch charge (Codex P1, medium):

New fourth bullet on the SetQueueAttributes validator: when SendCapacity > 0 it must also be ≥ 10, and likewise RecvCapacity > 0 → ≥ 10. Justified inline against the §3.3 charging table (SendMessageBatch / DeleteMessageBatch charge up to 10 each). Documented the failure mode the rule prevents (permanently unserviceable batches with no recovery short of re-running SetQueueAttributes) and the rejection envelope (InvalidAttributeValue with the per-bucket minimum named so the operator sees the cause). Default* is exempt because no batch verb is in scope of the catch-all bucket today; explicit forward note for when one moves there.

3.D split-queue FIFO — §8 item 5 runtime safeguard for downgraded leaders (Codex P1, medium):

New paragraph after the capability advertisement mechanism. Describes the failure sequence the create-time gate misses (downgrade after a partitioned queue exists → false-empty reads + key-prefix corruption from the rolled-back leader). Specifies the safeguard: on startup and on leadership acquisition for an SQS Raft group, enumerate local-shard queues with PartitionCount > 1; if any exist and the binary lacks htfifo, refuse leadership with an explicit log line and TransferLeadership away (group becomes leaderless if no peer is willing — desired fail-closed behaviour). Hooks into the existing leadership-acquisition path in kv/lease_state.go so steady-state cost is zero. Explicitly required before the §11 PR 4 rollout step that marks the binary htfifo-eligible.

3.C per-queue throttling — §3.1 bucketKey text/struct consistency (Claude low):

The DeleteQueue paragraph said "we keep bucketKey = (queueName, partitionID, action) for now" but the struct definition above has only queue + action (no partitionID until Phase 3.D per §4.3). Corrected to "for Phase 3.C we keep bucketKey = (queueName, action) per the struct definition above — the partition field is added later by Phase 3.D, see §4.3".

@claude review

claude · 2026-04-26T12:37:11Z

Claude finished @bootjp's task in 3m 10s —— View job

Review

Read updated design documents
Verify fix: §3.2 capacity ≥ max batch charge (3.C) — Codex P1 medium
Verify fix: §8 item 5 runtime safeguard for downgraded leaders (3.D) — Codex P1 medium
Verify fix: §3.1 bucketKey text/struct consistency (3.C) — Claude low
Check for new issues (including open Codex P1 from last round)
Overall assessment

All three fixes from this push are correctly applied. Two Codex P1 findings from the previous round's Codex review (on commit 624d411a93) were not in scope of this push and remain open. One new low-severity finding.

All three trigger fixes: resolved ✓

3.C — §3.2 capacity ≥ max batch charge (Codex P1, medium) ✓
The fourth validation bullet is precise: SendCapacity > 0 → SendCapacity ≥ 10, RecvCapacity > 0 → RecvCapacity ≥ 10, justified by the §3.3 charging table. The failure mode (permanent unserviceability for full batches) is documented inline, the rejection envelope (InvalidAttributeValue) is named, and the Default* exemption is correctly scoped with an explicit forward note. The Codex P1 attribution is clear. Clean.

3.D — §8 item 5 runtime safeguard for downgraded leaders (Codex P1, medium) ✓
The new paragraph accurately describes the failure sequence the create-time gate misses (downgrade after partitioned queue exists → false-empty reads + key-prefix corruption). The safeguard is concrete: catalog enumeration on startup + leadership acquisition, TransferLeadership on capability mismatch, fail-closed if no willing peer. The hook into kv/lease_state.go, the zero steady-state cost claim, and the ordering constraint relative to §11 PR 4 are all present and correct.

3.C — §3.1 bucketKey text/struct consistency (Claude low) ✓
The DeleteQueue paragraph now reads "for Phase 3.C we keep bucketKey = (queueName, action) per the struct definition above — the partition field is added later by Phase 3.D, see §4.3." Consistent with the struct definition on lines 92–95 and the forward-compat note in §4.3. The previous partitionID mention is gone.

Still-open Codex P1 findings from the commit `624d411a93` Codex review

These were new findings added by Codex on the last commit round; they were not in scope of this push.

[Medium] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.1 — action key names in cache invalidation don't match struct definition

The bucketKey.action field is defined at line 94 as "Send" | "Receive" | "*". But both the SetQueueAttributes and DeleteQueue invalidation paragraphs (lines 114 and 116) delete keys for "Send", "Recv", "Default":

calls buckets.Delete(key) for every bucketKey belonging to the updated queue (Send, Recv, Default)

"Recv" ≠ "Receive" and "Default" ≠ "*". An implementation that follows the struct definition will never match the keys the invalidation code tries to delete, so Receive and Default buckets will remain stale after a config update — defeating the correctness guarantee the SetQueueAttributes invalidation paragraph is trying to establish. The fix is one-word changes in both paragraphs: use "Receive" and "*" (matching the struct), or update the struct comment to use "Recv" / "Default" and define those as the canonical values. Fix this →

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.2 — `FifoThroughputLimit` and `DeduplicationScope` are mutable after first send

Line 106 states only PartitionCount is immutable after first SendMessage. But routing in partitionFor also depends on FifoThroughputLimit — the perQueue short-circuit (added in an earlier round) collapses all groups to partition 0 regardless of PartitionCount. If an operator flips FifoThroughputLimit from perMessageGroupId to perQueue after messages exist, every MessageGroupId that was previously distributed across N partitions now routes to partition 0, splitting those groups across multiple partitions and violating within-group FIFO ordering for any in-flight messages. Similarly, DeduplicationScope affects how the dedup key is scoped; changing it live would either duplicate or suppress messages unexpectedly.

The fix is extending the immutability sentence: "PartitionCount, FifoThroughputLimit, and DeduplicationScope are all immutable after first SendMessage" — since all three affect routing or dedup semantics in ways that cannot be changed without a full drain. Fix this →

New finding

[Low] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §6 — integration test example uses an attribute value that §3.2 now rejects

§6 item 2 reads:

"Configure a queue with SendCapacity=5 SendRefillPerSecond=1. Send 5 messages back-to-back → all 200."

But §3.2's new fourth validation bullet requires SendCapacity ≥ 10 when non-zero. A SetQueueAttributes call with SendCapacity=5 would now return InvalidAttributeValue before the test even creates its first message — meaning the test as written would fail at setup, not at message 6. The example should use SendCapacity=10 (or higher) so the test is consistent with the validator it is supposed to be testing against. Fix this →

Summary

All three trigger fixes are cleanly applied. The two Codex P1 findings from the previous round's review — action key naming mismatch in cache invalidation (3.C §3.1) and FifoThroughputLimit / DeduplicationScope mutability (3.D §3.2) — are the only items I'd consider blocking before implementation PRs. Both are design-level decisions that implementation code would have to make regardless; getting the doc right now prevents the PRs from encoding the wrong choice. The integration test inconsistency is low-severity cleanup.
| Branch

…immutability, test capacity) Three findings from the seventh-round review on PR #664: 3.C per-queue throttling §3.1 -- action key naming consistency (Codex P1, medium): bucketKey.action canonical values are "Send"|"Receive"|"*" but cache invalidation paragraphs were saying "Send","Recv","Default" (the JSON config field-name prefixes). An implementation following the struct would never match the wrong key names, leaving Receive and "*" buckets stale after config changes. Changed both invalidation paragraphs to use the struct vocabulary, and added a "Config-field -> bucket-action mapping" paragraph after the sqsQueueThrottle struct so the short-form / canonical-form distinction is explicit. 3.D split-queue FIFO §3.2 -- FifoThroughputLimit and DeduplicationScope must also be immutable (Codex P1, medium): prior text only marked PartitionCount as immutable. But changing FifoThroughputLimit (perMessageGroupId<->perQueue) flips the §3.3 short-circuit on partition routing -- in-flight messages stay where they are, new sends route differently, within-group FIFO ordering silently breaks. Changing DeduplicationScope reshapes the dedup key so the prior scope's de-duped messages can resurrect or new sends can be incorrectly suppressed. Extended the immutability sentence to cover all three with one bullet per attribute explaining the failure mode. 3.C per-queue throttling §6 item 2 -- test capacity inconsistent with new validator (Claude low): integration test used SendCapacity=5 but §3.2's new fourth bullet requires SendCapacity >= 10. Test would have failed at setup, not on the 6th send. Bumped to SendCapacity=10 / 10 messages back-to-back / 11th throttled, with an inline note explaining why the value changed.

chatgpt-codex-connector · 2026-04-26T12:44:58Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

bootjp · 2026-04-26T12:45:20Z

Pushed addressing all three findings from your seventh-round review (2 medium Codex P1 + 1 low Claude):

3.C per-queue throttling — §3.1 action key naming consistency (Codex P1, medium):

Both invalidation paragraphs (SetQueueAttributes, DeleteQueue) now read (Send, Receive, *— the canonical action values from thebucketKey struct above). The prior (Send, Recv, Default) was the JSON config field-name vocabulary, not the in-memory map key — implementations following the struct would have left Receive and * buckets stale.
Added a new "Config-field → bucket-action mapping" paragraph immediately after the sqsQueueThrottle struct definition. Explicitly maps Send* → bucketKey{action:"Send"}, Recv* → bucketKey{action:"Receive"}, Default* → bucketKey{action:"*"}, and pins which vocabulary to use where (config-field for the JSON contract / GetQueueAttributes response; bucket-action for in-memory bucket-store ops). Implementation must apply the mapping after SetQueueAttributes commit.

3.D split-queue FIFO — §3.2 immutability extended to FifoThroughputLimit + DeduplicationScope (Codex P1, medium):

Sentence rewritten: "PartitionCount, FifoThroughputLimit, and DeduplicationScope are all immutable after first SendMessage." Three bullets explain the failure mode for each:
- PartitionCount — re-hashing every message, breaks in-flight ordering, multi-second op (existing reasoning).
- FifoThroughputLimit — flipping perMessageGroupId ↔ perQueue activates/deactivates the §3.3 short-circuit; in-flight messages stay where they are, new sends for the same group route to a different partition, within-group FIFO is silently violated.
- DeduplicationScope — changing scope reshapes the dedup key ((queue, dedupId) vs (queue, partitionId, dedupId) vs (queue, MessageGroupId, dedupId)); narrowing scope can resurrect previously-deduped duplicates, widening can suppress legitimately-new sends that collide with still-cached prior-scope entries.
Concluding sentence pins the unifying invariant: each attribute participates in a routing or dedup decision whose correctness depends on every existing message having been written under one consistent value, and live mutation creates incompatible "before"/"after" sets the runtime cannot reconcile without a full drain.

3.C per-queue throttling — §6 item 2 test capacity (Claude low):

Bumped from SendCapacity=5 SendRefillPerSecond=1 / 5 messages / 6th throttled to SendCapacity=10 SendRefillPerSecond=1 / 10 messages / 11th throttled. Added a parenthetical inline note explaining why the value changed (the new §3.2 validator would have rejected SendCapacity=5 at setup).

@claude review

…e, PR 2 dormancy gate) Two Codex P1 findings from the tenth-round Codex review on commit 14b4d88 that were carried forward to today's eleventh-round Claude review: 3.D split-queue FIFO §3.2 / §4.1 -- {PartitionCount > 1, DeduplicationScope = "queue"} validation moved to control plane (Codex P1, medium): the old §4.1 paragraph rejected this combination at SendMessage time, so an operator who mis-configured at CreateQueue would get a successful response and only discover the problem when every send failed -- a created-but-unserviceable queue with no recovery short of DeleteQueue+CreateQueue. Added a new "Cross-attribute validation at CreateQueue and SetQueueAttributes" paragraph to §3.2 with the rejection rule (InvalidParameterValue plus the AWS-shaped reason "queue-scoped deduplication is incompatible with multi-partition FIFO because the dedup key cannot be globally unique across partitions without a cross-partition OCC transaction"). Reframed the §4.1 paragraph as a "cannot reach this code" pointer to the §3.2 gate so the runtime rejection is impossible. 3.D split-queue FIFO §11 -- explicit dormancy gate for PR 2-4 (Codex P1, medium): "feature is dormant" was editorial intent, not a runtime guarantee. A cluster on PR 2-4 would accept CreateQueue(PartitionCount=4), dispatch every SendMessage against the legacy single-partition keyspace (partitionIndex=0 because the data plane has not been wired yet), then have those messages be invisible to the partition-aware fanout reader and reaper that land in PR 5. PR 2 row now adds a temporary CreateQueue rejection ("PartitionCount > 1 requires HT-FIFO data plane -- not yet enabled"); PR 3-4 rows note the gate is still in place; PR 5 row explicitly removes it in the same commit that wires the fanout so the gate-and-lift land atomically. New paragraph below the table documents why the gate exists and why atomic gate-and-lift makes the wrong-layout-data class of bug impossible.

bootjp · 2026-04-26T13:27:04Z

Pushed addressing the two Codex P1 mediums you flagged as still open from the tenth-round Codex review on 14b4d88a82. You're right — both were genuinely missed and both need to land before PR 2.

3.D split-queue FIFO — control-plane gate for {PartitionCount > 1, DeduplicationScope = "queue"} (Codex P1, medium):

New "Cross-attribute validation at CreateQueue and SetQueueAttributes" paragraph in §3.2. Rejection envelope is InvalidParameterValue with the AWS-shaped reason: "queue-scoped deduplication is incompatible with multi-partition FIFO because the dedup key cannot be globally unique across partitions without a cross-partition OCC transaction."
Reframed the §4.1 paragraph from a runtime rejection to a "cannot reach this code path because §3.2 rejects at control-plane time" pointer. The runtime rejection is now impossible — the operator gets the validation error at CreateQueue where they can cheaply fix the config, not at the first SendMessage.

3.D split-queue FIFO — explicit PR 2 dormancy gate (Codex P1, medium):

PR 2 row now describes a temporary CreateQueue rejection: PartitionCount > 1 returns InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data plane — not yet enabled"). The schema field exists in the meta type but cannot land in production data.
PR 3 + 4 rows note the gate is still in place.
PR 5 row explicitly removes the gate in the same commit that wires the fanout — the gate-and-lift are atomic so a half-deployed cluster can never accept a partitioned queue without the data plane to serve it.
New "Why the temporary gate" paragraph below the rollout table walks through the failure sequence the gate prevents (PR 2–4 cluster accepts CreateQueue(PartitionCount=4) → SendMessage writes to legacy !sqs|msg|data|<queue>|… prefix → PR 5 reader looks for partitioned !sqs|msg|data|p|<queue>|<partition>|… and finds nothing → silent message loss invisible to reaper too).

Both fixes lock down real production-time bugs the design would otherwise leave unspecified for the implementer to discover.

@claude review

claude · 2026-04-26T13:27:17Z

Claude finished @bootjp's task in 3m 48s —— View job

Review

Read updated design documents
Verify fix: §3.2 cross-attribute validation for {PartitionCount > 1, DeduplicationScope = "queue"} (3.D) — Codex P1 medium
Verify fix: §11 PR 2 dormancy gate for PartitionCount > 1 (3.D) — Codex P1 medium
Check for new issues
Overall assessment

All two fixes from this push are correctly applied. No open Codex P1 items remain from any prior round. Two new low-severity findings.

Both trigger fixes: resolved ✓

3.D — §3.2 cross-attribute validation for {PartitionCount > 1, DeduplicationScope = "queue"} (Codex P1, medium) ✓
Lines 115–116 contain the new "Cross-attribute validation at CreateQueue and SetQueueAttributes" paragraph. The rejection envelope is InvalidParameterValue with the AWS-shaped reason quoted verbatim; the failure mode it prevents (created-but-unserviceable queue with no recovery short of DeleteQueue+CreateQueue) is documented inline; the guidance that §4.x rejection paragraphs should be reframed as "cannot reach this code" notes is present. §4.1 (line 184) correctly describes the {DeduplicationScope = queue, PartitionCount > 1} combination as unreachable because the §3.2 control-plane gate prevents it — the runtime rejection that was previously in §4.1 is gone. Clean.

3.D — §11 PR 2 dormancy gate + "Why the temporary gate" paragraph (Codex P1, medium) ✓
The PR 2 row (line 371) now carries the explicit temporary gate: CreateQueue rejects PartitionCount > 1 with InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data plane — not yet enabled") with an explicit note that the schema field exists in the meta type but cannot land in production data. PR 3 and 4 rows correctly state the gate remains in place. The PR 5 row explicitly removes the gate in the same commit as the data-plane fanout — the atomic gate-and-lift is the key correctness guarantee. The "Why the temporary gate" paragraph (line 379) accurately describes the failure sequence the gate prevents: a cluster on PR 2–4 accepting a CreateQueue(PartitionCount=4) → SendMessage writes to the legacy !sqs|msg|data|<queue>|… prefix → PR 5 fanout reader looks for !sqs|msg|data|p|<queue>|<partition>|… and finds nothing → silent message loss invisible to the reaper too. Clean.

No open items from prior rounds ✓

All Codex P1 and Claude-flagged items across all prior review rounds are now resolved. The two trigger fixes close the last open design gaps identified before PR 2 can start.

New findings

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.2 — cross-attribute rule at `SetQueueAttributes` is permanently unreachable

§3.2's cross-attribute validation paragraph says "the same validator that enforces immutability also rejects incoherent attribute combinations" and names SetQueueAttributes alongside CreateQueue. But both PartitionCount and DeduplicationScope are immutable — SetQueueAttributes cannot change either, so no SetQueueAttributes call can ever present the {PartitionCount > 1, DeduplicationScope = "queue"} combination. The cross-attribute rule is only ever triggered by CreateQueue. An implementor following the paragraph literally would write dead code in the SetQueueAttributes handler with no indication it is harmless dead code.

The fix is a single clarifying sentence: "In practice this rule is only reachable at CreateQueue time — SetQueueAttributes cannot present the combination because both PartitionCount and DeduplicationScope are immutable and will be rejected by the immutability check before the cross-attribute check runs." Fix this →

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §11 — PR 4 "Mixed-version gate" does not explicitly list the §8 runtime leadership-refusal safeguard

§8's "Runtime safeguard for downgraded leaders" paragraph closes with: "Implementations of Phase 3.D's PR set must include this safeguard before the rollout step that marks the binary htfifo-eligible (§11 PR 4)."

PR 4's rollout description (line 373) says: "Routing layer: kv/shard_router.go accepts the (queue, partition) key. New --sqsFifoPartitionMap flag. Mixed-version gate." The "Mixed-version gate" label likely covers the §8.5 capability advertisement mechanism (/sqs_health + catalog polling), but the leadership-refusal check (the on-startup and on-leadership-acquisition catalog query that triggers TransferLeadership if any partitioned queue exists on a non-htfifo node) is a distinct implementation concern from capability advertisement. An implementor writing PR 4 from the table alone could deliver the /sqs_health capability badge and the CreateQueue guard without implementing the kv/lease_state.go admission hook, and the table would appear satisfied.

The fix is a parenthetical in the PR 4 row: "Mixed-version gate (§8 capability advertisement via /sqs_health plus leadership-refusal hook in kv/lease_state.go for non-htfifo nodes that encounter a partitioned queue on startup or leadership acquisition)." Fix this →

Summary

Both trigger fixes are cleanly applied, and all open items from prior rounds are resolved. The two new findings are low-severity documentation clarifications that would help implementation PR authors avoid confusion — neither represents a design gap that blocks PR 2 from starting. The proposal docs are in excellent shape.
| Branch

…ion, PR 4 hook scope) Two low-severity findings from the thirteenth-round Claude review on PR #664: 3.D split-queue FIFO §3.2 -- cross-attribute rule at SetQueueAttributes is dead code (Claude low): the prior wording said the rule applies at both CreateQueue and SetQueueAttributes, but immutability fires first on SetQueueAttributes so the {PartitionCount > 1, DeduplicationScope = queue} combination can never reach the cross-attribute check on that path. Added a clarifying paragraph so PR 2 author knows the SetQueueAttributes side is unreachable rather than missing test coverage. 3.D split-queue FIFO §11 PR 4 -- "Mixed-version gate" expanded (Claude low): the prior label could be read as covering only the §8.5 capability advertisement (/sqs_health + catalog polling). The §8 leadership-refusal hook in kv/lease_state.go is a distinct implementation concern and §8 explicitly says it must land before PR 4 marks the binary htfifo-eligible. Expanded the parenthetical to enumerate both components so the PR 4 author cannot miss the runtime hook.

bootjp · 2026-04-26T13:40:45Z

Pushed addressing both low findings from your thirteenth-round review:

3.D split-queue FIFO — §3.2 cross-attribute rule at SetQueueAttributes is dead code (low):

New clarifying paragraph after the cross-attribute validation block: "In practice this rule is only reachable at CreateQueue time. SetQueueAttributes cannot present the {PartitionCount > 1, DeduplicationScope = "queue"} combination because both PartitionCount and DeduplicationScope are immutable: any change to either is caught by the immutability check (described in the next paragraph) before the cross-attribute check has the chance to run. Implementing the cross-attribute rule on the SetQueueAttributes path would therefore be dead code — it is harmless to add for symmetry, but PR 2's author should know it is unreachable rather than spend time hunting for the test case that hits it."

3.D split-queue FIFO — §11 PR 4 "Mixed-version gate" expanded (low):

PR 4 row's "Mixed-version gate" parenthetical now explicitly lists both components required before the binary is marked htfifo-eligible: (1) §8.5 capability advertisement via /sqs_health + catalog polling for CreateQueue gating, and (2) the §8 leadership-refusal hook in kv/lease_state.go that calls TransferLeadership when a non-htfifo binary discovers a partitioned queue in its shard on startup or leadership acquisition. PR 4 author cannot now satisfy the table by delivering only the capability badge.

@claude review

claude · 2026-04-26T13:41:03Z

Claude finished @bootjp's task in 2m 5s —— View job

Review

Read updated design documents
Verify fix: §3.2 dead-code clarification for SetQueueAttributes cross-attribute rule (3.D) — low
Verify fix: §11 PR 4 "Mixed-version gate" expanded to enumerate both components (3.D) — low
Check for new issues
Overall assessment

Both fixes from this push are correctly applied. No open items remain from any prior round. No new issues found.

Both trigger fixes: resolved ✓

3.D — §3.2 dead-code clarification for SetQueueAttributes cross-attribute rule (low) ✓
Line 117 now reads: "In practice this rule is only reachable at CreateQueue time. SetQueueAttributes cannot present the {PartitionCount > 1, DeduplicationScope = "queue"} combination because both PartitionCount and DeduplicationScope are immutable: any change to either is caught by the immutability check (described in the next paragraph) before the cross-attribute check has the chance to run. Implementing the cross-attribute rule on the SetQueueAttributes path would therefore be dead code — it is harmless to add for symmetry, but PR 2's author should know it is unreachable rather than spend time hunting for the test case that hits it."

The reasoning is correct: any SetQueueAttributes call that would present the combination must carry a differing DeduplicationScope value (since the stored value is set at CreateQueue and cannot be "queue" while PartitionCount > 1 per the cross-attribute rule), and that differing value is caught by the immutability check before the cross-attribute check runs. PR 2's author now has explicit guidance not to burn time on dead code.

3.D — §11 PR 4 "Mixed-version gate" expanded (low) ✓
Line 375 now reads: "Mixed-version gate (§8.5 capability advertisement via /sqs_health + catalog polling for CreateQueue gating, and the §8 leadership-refusal hook in kv/lease_state.go that calls TransferLeadership when a non-htfifo binary discovers a partitioned queue in its shard on startup or leadership acquisition — both components are required before the binary is marked htfifo-eligible)."

Both the capability-advertisement side (§8.5, needed for the CreateQueue gate) and the runtime leadership-admission side (the kv/lease_state.go hook, needed to prevent a downgraded node from serving partitioned queues) are explicitly enumerated, and the "both components are required" note removes any ambiguity about scope for the PR 4 author.

No open items from prior rounds ✓

All Codex P1 and Claude-flagged items across all prior review rounds are resolved.

No new findings ✓

I read both docs end-to-end. One minor observation that does not warrant a fix request: line 115 still names SetQueueAttributes alongside CreateQueue in the cross-attribute validation section heading ("Cross-attribute validation at CreateQueue and SetQueueAttributes"), while line 117 immediately explains that the SetQueueAttributes path is dead code. The two are consistent — document both paths, then explain one is unreachable — and an implementor reading both lines gets the right mental model. The alternative would be to narrow the heading to just CreateQueue, but the current form is clear enough and changing it would touch only a cosmetic detail.

Summary and merge recommendation

Both trigger fixes are cleanly applied, and the docs have no remaining open items from any review round. The proposal documents for Phase 3.C (per-queue throttling) and Phase 3.D (split-queue FIFO) are in excellent shape. Ready to merge. Implementation PR 2 can begin from this base.

| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d246cc3e99

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T13:44:56Z

+**Cache invalidation on `SetQueueAttributes`**: when an operator updates the throttle config via `SetQueueAttributes`, the handler — *after* the Raft commit that persists the new `sqsQueueThrottle` — calls `buckets.Delete(key)` for every `bucketKey` belonging to the updated queue (`Send`, `Receive`, `*` — the canonical action values from the `bucketKey` struct above). Without this step the in-memory bucket would keep enforcing the old limits until the idle-eviction sweep removes the stale entry (default 1 h window), defeating the operator's intent to throttle a noisy tenant in real time. The `LoadOrStore` race with the `Delete` call is benign: the next request rebuilds from the freshly-committed meta, and the rebuilt bucket starts at full capacity (same semantics as the failover case documented below). Claude P1 on PR #664 caught the gap.
+
+**Cache invalidation on `DeleteQueue`**: when a queue is deleted, the handler — *after* the Raft commit that purges the queue meta — calls `buckets.Delete(key)` for every `bucketKey` belonging to the deleted queue (`Send`, `Receive`, `*` — the canonical action values from the `bucketKey` struct above), mirroring the `SetQueueAttributes` path above. Without this step, a `DeleteQueue` immediately followed by `CreateQueue` with the same name would inherit the previous incarnation's in-memory bucket (current token balance, capacity, refill rate) until the 1 h idle-eviction sweep removes the stale entry — so an operator using `DeleteQueue`+`CreateQueue` to reset a noisy queue's state would be surprised to see the old throttle still in effect. (Alternative considered: include the queue `generation` in `bucketKey` so old and new incarnations cannot collide structurally; the explicit `Delete` is cheaper and matches the `SetQueueAttributes` pattern, so for Phase 3.C we keep `bucketKey = (queueName, action)` per the struct definition above — the `partition` field is added later by Phase 3.D, see §4.3 — and document the lifecycle requirement here.) Claude P1 on PR #664 fifth-round review.
+
+The bucket map is per-process. On leader failover, a fresh bucket starts at full capacity on the new leader — there is no Raft replication of bucket state. **Why this is correct**: the worst-case behaviour of "fresh bucket on failover" is that a noisy queue gets one extra burst worth of bandwidth right after a leader change. Replicating bucket state would cost a Raft commit per token decrement, which would defeat the entire point of the token bucket. AWS's own rate limiter has the same property at region failover boundaries.


Invalidate throttle caches on leader changes

This design only deletes bucket entries in the SetQueueAttributes/DeleteQueue handler on the current leader, while the cache is explicitly per-process; a follower that previously led the queue can keep an old (queue, action) bucket until idle eviction and later reuse it after failover. In that scenario the new leader can enforce stale limits/token balance (or a prior queue incarnation) despite recent control-plane updates, so the documented “fresh bucket on failover” behavior is not guaranteed. Please add a leadership-time invalidation/version check (or keying that includes queue incarnation/config version) so failover cannot resurrect stale throttle state.

Useful? React with 👍 / 👎.

Implements the design from docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md. Schema (sqs_catalog.go): - sqsQueueMeta gains an optional Throttle *sqsQueueThrottle field with six float64 sub-fields (Send/Recv/Default x Capacity/RefillPerSecond). - Six new ThrottleSendCapacity / ThrottleSendRefillPerSecond / ThrottleRecvCapacity / ThrottleRecvRefillPerSecond / ThrottleDefaultCapacity / ThrottleDefaultRefillPerSecond attributes go through the standard sqsAttributeAppliers dispatch. - validateThrottleConfig enforces the cross-field rules from the design: each (capacity, refill) pair must be both-zero or both-positive; capacity must be >= refill so the bucket can burst; Send/Recv capacities must be >= 10 (the SendMessageBatch / DeleteMessageBatch max charge) so a small capacity does not make every full batch permanently unserviceable; per-field hard ceiling of 100k. Validator runs once at the end of applyAttributes so a multi-field update sees the post-apply state as a whole. - queueMetaToAttributes surfaces the configured Throttle* fields so GetQueueAttributes("All") round-trips. Extracted into addThrottleAttributes to stay under the cyclop ceiling. Bucket store (sqs_throttle.go, new): - bucketStore: sync.Map[bucketKey]*tokenBucket; per-bucket sync.Mutex on the hot path so cross-queue traffic never serialises on a process-wide lock. Per Gemini medium on PR #664 the single-mutex alternative was rejected. Lazy idle-evict sweep removes inactive buckets after 1h to bound memory. - charge(): lock-free Load, LoadOrStore on miss (race-tolerant -- both racers compute identical capacity/refill from the same meta), then per-bucket lock for refill + take + release. Refill is elapsed * refillRate, capped at capacity. On reject, computes Retry-After from the actual refillRate and the requestedCount per the §3.4 formula (numerator is requested count, not 1, so a batch verb does not get told to retry in 1s when it really needs 10s). - resolveActionConfig() handles the Default* fall-through: when only DefaultCapacity is set, Send and Recv requests share the same bucketKey{action:"*"} so Default behaves as one shared cap rather than three independent quotas. - invalidateQueue() drops every bucket belonging to a queue. Called after the Raft commit on SetQueueAttributes / DeleteQueue so new limits take effect on the very next request, not after the 1h idle-evict sweep. Without this step the §3.1 cache-invalidation contract fails and operators see stale enforcement. Charging (sqs_messages.go, sqs_messages_batch.go): - All seven message-plane handlers wired: SendMessage / SendMessageBatch (Send bucket; batch charges by entry count), ReceiveMessage / DeleteMessage / DeleteMessageBatch / ChangeMessageVisibility / ChangeMessageVisibilityBatch (Receive bucket). - Throttle check sits OUTSIDE the OCC retry loop per §4.2 -- a rejected request never reaches the coordinator, so the existing sendMessageWithRetry et al. cannot busy-loop on a permanent rate-limit failure. - sendMessage extracted into prepareSendMessage + validateSend + body to stay under the cyclop ceiling once the throttle branch was added. Error envelope (sqs.go): - New sqsErrThrottling code + writeSQSThrottlingError helper that writes 400 + AWS-shaped JSON body + x-amzn-ErrorType + Retry-After header. The envelope is the same shape AWS uses; SDKs that key off x-amzn-ErrorType handle it without changes. Tests: - adapter/sqs_throttle_test.go: 18 unit tests covering bucket math (fresh capacity, refill elapsed, refill cap, batch reject preserves partial credit, Retry-After uses requested count, sub-1-RPS floor, per-action / per-queue / Default-fallthrough isolation, invalidateQueue, concurrent -race, default-off short-circuit) and the validator (nil/empty canonicalisation, both-zero-or-both-positive, capacity >= 10 batch floor, Default* exempt, capacity >= refill, parseThrottleFloat range checks, computeRetryAfter floor). - adapter/sqs_throttle_integration_test.go: 8 end-to-end tests covering the §6 testing strategy (default-off allows unbounded, send/recv reject after capacity with correct envelope + Retry-After, batch charges by entry count, SetQueueAttributes invalidation, DeleteQueue + CreateQueue lifecycle invalidation, GetQueueAttributes round-trip, validator rejects below batch min). Run with -race clean. Out of scope for this PR (deferred to a follow-up): - Prometheus counter (sqs_throttled_requests_total) and gauge (sqs_throttle_tokens_remaining) per the design's §4.1 monitoring/ registry.go entry. The wiring needs a new prometheus collector and hooking through the Registry seam; punted to keep this PR focused on the data-plane behaviour. The bucket store's chargeOutcome already exposes tokensAfter so the gauge wiring is one site. - §6.5 cross-protocol (Query) parity test: the Query protocol layer lives on PR #662 and the throttle envelope already has the same shape on both protocols (writeSQSError handles both), so the test is a follow-up after #662 lands. - §6.6 failover behaviour test: the §3.1 "fresh bucket on failover" contract is implementation-correct (the bucket map is per-process, no Raft state) but a 3-node failover test needs the cluster scaffolding and is best added with the Jepsen workload.

bootjp · 2026-04-27T10:46:41Z

@claude review

claude · 2026-04-27T10:46:53Z

Claude finished @bootjp's task in 2m 25s —— View job

Review

Read current state of both design documents
Verify last two Claude low fixes from previous push
Check open Codex P1 from latest Codex review (d246cc3e99)
Check for new issues
Overall assessment

Both trigger fixes from last push: resolved ✓

3.D — §3.2 dead-code clarification for SetQueueAttributes cross-attribute rule (low) ✓
Line 117 correctly explains that the cross-attribute {PartitionCount > 1, DeduplicationScope = "queue"} rule is dead code on the SetQueueAttributes path (the immutability check fires first), and that an implementor adding it for symmetry should know it is unreachable. PR 2 author now has the right mental model.

3.D — §11 PR 4 "Mixed-version gate" expanded (low) ✓
Line 375 explicitly names both required components: (1) the §8.5 capability advertisement + catalog polling gate for CreateQueue, and (2) the §8 kv/lease_state.go leadership-refusal hook for downgraded nodes. Both are tagged as required before the binary is marked htfifo-eligible. The PR 4 author cannot satisfy the table by delivering only one of the two.

Still-open Codex P1 from the `d246cc3e99` review

[Medium] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.1 — "fresh bucket on failover" is incorrect for nodes that were previously leaders

The doc states (line 118):

"The bucket map is per-process. On leader failover, a fresh bucket starts at full capacity on the new leader — there is no Raft replication of bucket state. Why this is correct: the worst-case behaviour of 'fresh bucket on failover' is that a noisy queue gets one extra burst worth of bandwidth right after a leader change."

This claim is only true when the failover target has never been the leader for that queue. When a node cycles back to leadership (A → B → A), node A retains its in-memory bucket from its previous leadership term and uses it on re-election. The Codex P1 scenario:

Node A is leader; queue Q is configured with SendCapacity=10, SendRefillPerSecond=1. Node A handles 10 messages → bucket is at 0 tokens.
Operator calls SetQueueAttributes on node B (the new leader), raising SendCapacity=100, SendRefillPerSecond=100. Node B deletes its own bucket entry for (Q, Send) and lazily rebuilds from the new meta.
Node A (now follower) still has its old bucket at 0 tokens in its local sync.Map — the buckets.Delete call ran on node B's process, not node A's.
Node B fails; node A becomes leader again.
Node A loads its stale bucket (0 tokens, old 10-token capacity) from sync.Map. Every SendMessage is throttled with Retry-After: 10 — until the idle-eviction sweep fires (up to 1 h later).

The "one extra burst" characterisation in the doc applies only to the case of first-time leadership. The actual worst case under the Codex scenario is up to 1 hour of incorrect throttle enforcement — not a burst, the opposite.

The natural fix is to move bucket invalidation from the leader's HTTP handler to the Raft FSM apply path for the SetQueueAttributes entry. The FSM apply runs on every node (leaders and followers alike), so:

When SetQueueAttributes is committed, the FSM's apply on node A also calls buckets.Delete(key) for the updated queue.
When node A next becomes leader, its bucket cache is clean and lazily rebuilds from the freshly-committed meta.
The clock/bucket lifecycle behaviour is identical to the first-time-leader case.

An alternative is to flush the bucket store for all local-shard queues on leadership acquisition (a one-shot purge at the start of each new leader term). This is simpler to implement but slightly more aggressive. Either approach is acceptable; the doc should specify which one and update the "Why this is correct" paragraph to accurately describe what the guarantee is after the fix. Fix this →

No new findings ✓

Both docs were read end-to-end. No new issues found beyond the Codex P1 above.

Summary

Both Claude lows from the previous push are correctly applied. The one open item is the Codex P1 from the last Codex review — the "fresh bucket on failover" claim and the bucket invalidation placement (leader HTTP handler vs. FSM apply). This is the only finding I'd consider blocking before the implementation PRs begin: it is a design decision that PR 2's author would otherwise have to resolve in code without guidance from the proposal doc. Once addressed, the docs are ready to merge and PR 2 can start.
| Branch

…ini high/medium) Six fixes from the first round of automated reviews on PR #679: Codex P1 -- bucket reconciliation on stale config (sqs_throttle.go): loadOrInit returned cached buckets without checking that capacity / refillRate still match the queue's current Throttle config. After a leadership change, a node retaining buckets from a prior leader term would keep enforcing the prior term's limits even after a new SetQueueAttributes had committed -- the invalidation only runs on the leader that processed the commit, so a different leader's stale buckets survive. Now compares cap/refill on every Load hit and rebuilds (Delete + LoadOrStore) on mismatch. Codex P1 -- invalidate only on actual throttle change (sqs_catalog.go, sqs_throttle.go): cache invalidation in setQueueAttributes ran unconditionally after every successful commit, including unrelated- field updates and no-op writes. Result: any caller could silently restore a noisy tenant's burst capacity by writing a no-op SetQueueAttributes. Now gated on throttleAttributesPresent(in.Attributes) which checks the request for any Throttle* key. Bucket reconciliation above acts as the safety net if a future code path bypasses the gate. Codex P2 -- attributesEqual covers Throttle (sqs_catalog.go): CreateQueue idempotency relied on attributesEqual which did not include Throttle*. A re-create with different limits was treated as idempotent and silently kept the old limits. Now compares the full Throttle struct via throttleConfigEqual; baseAttributesEqual extracted to keep cyclop under the ceiling. Gemini high -- thread throttle through existing meta load (sqs_messages.go, sqs_throttle.go): chargeQueue did one Pebble read per request even though the hot-path handlers (sendMessage, receiveMessage) load the meta moments later. Added chargeQueueWithThrottle that takes pre-loaded throttle config; both hot-path handlers now load meta once and pass throttle in. Throttle check now sits AFTER the QueueDoesNotExist branch so a missing queue no longer consumes a token. Batch + delete handlers keep chargeQueue (one extra meta read) -- low-QPS verbs where the simplification of not pulling meta out of the retry loop is worth the per-call cost. Gemini high -- move sweep off hot path (sqs.go, sqs_throttle.go): maybeSweep ran the O(N) sync.Map.Range on whichever request was unlucky enough to trigger the per-minute sweep, causing latency spikes on many-queue clusters. Replaced with runSweepLoop on a background ticker tied to s.reaperCtx (started in Run alongside the existing message reaper, cleaned up by the same reaperCancel in Stop). The hot-path charge() no longer calls into the sweep at all. Gemini medium -- cap retry-after duration (sqs_throttle.go): computeRetryAfter could compute a multi-day Retry-After (or worse, overflow time.Duration arithmetic) for pathologically small refillRate / large requested values. Capped at throttleRetryAfterCap (1h, matching the bucket idle-evict window). Cap is applied before the Duration multiplication so overflow is impossible. New tests: - TestBucketStore_ReconcilesBucketOnConfigChange pins the Codex P1 reconciliation contract. - TestComputeRetryAfter_CapsAtMaximum pins the Gemini medium cap. - TestThrottleAttributesPresent covers the request-gate helper used by the conditional invalidation. All tests pass under -race; golangci-lint clean.

…est (PR #679 round 2) Two items from the round-2 Claude review on PR #679: (1) bucketStore.sweepMu and bucketStore.lastSweep became dead code when the move from on-hot-path maybeSweep to the background runSweepLoop landed in round 1. The sweep() ticker is the only caller, so the serialisation primitive that protected against concurrent on-hot-path callers no longer has a job. Removed both fields and the sweepMu.Lock/lastSweep=now/sweepMu.Unlock block; sweep() now reads b.clock() inline. Field comment updated to describe the post-runSweepLoop design (single-goroutine driver). (2) TestSQSServer_CatalogCreateIsIdempotent now exercises throttleConfigEqual via a fourth case: same name, same VisibilityTimeout, but the second create adds Throttle* attributes. attributesEqual must notice the diff and the call must reject as QueueNameExists. Without this case a future bug in throttleConfigEqual (e.g. always returning true) would slip past the existing VisibilityTimeout-only test. Both findings noted in the round-2 Claude review on PR #679.

Round-3 Claude review on PR #679 caught two stale comments left over from the runSweepLoop refactor: throttleIdleEvictAfter said "no goroutine; lookups call sweep() opportunistically" -- both clauses are now false (runSweepLoop is the goroutine, hot-path charge() never calls sweep()). throttleEvictSweepEvery said "bounds how often the sweep runs from the hot path" -- the hot path no longer runs sweep at all; this constant is now just the background ticker interval. Both updated to describe the post-runSweepLoop design and to reinforce that the sweep cost is amortised across the goroutine ticker rather than concentrated on whichever request was unlucky enough to trigger it (which was the old behaviour Gemini high flagged in PR #679 round 1).

…(PR #679 round 4) Round-4 Claude review on PR #679 surfaced a TOCTOU race in the loadOrInit reconciliation path the round-1 fix introduced. The race: two concurrent goroutines hit the same stale bucket. A detects mismatch, Delete + LoadOrStore stores freshA. B (which also Loaded the stale bucket before A's Delete) then runs its own Delete -- which under the previous unconditional Delete would evict freshA. B's LoadOrStore then stores freshB. Net effect: a charge() that obtained freshA via the in-flight LoadOrStore from A's path keeps charging a bucket that is no longer in the map, while subsequent requests get freshB at full capacity. Total tokens consumed across the mismatch window can exceed the new capacity by one bucket's worth. Fix: replace the unconditional b.buckets.Delete(key) with b.buckets.CompareAndDelete(key, v) where v is the original stale pointer. CompareAndDelete is a no-op when the map already holds someone else's fresh bucket, so the racer that arrives second takes the winner's fresh bucket via LoadOrStore (which returns the existing entry) instead of orphaning it. Test: TestBucketStore_ConcurrentReconciliationRespectsNewCapacity seeds a stale bucket (cfgOld, drained), then races 200 goroutines through a charge with cfgNew (capacity 50). Asserts exactly 50 successes -- a Delete-after-replace race would let some past the cap. Runs under -race; the new test plus the existing TestBucketStore_ConcurrentChargesPreserveCount and TestBucketStore_ReconcilesBucketOnConfigChange cover the concurrent-fresh-bucket, concurrent-stale-mismatch, and sequential-mismatch cases.

Implements PR 2 of the §11 multi-PR rollout from docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md. The schema fields land on sqsQueueMeta with the design's wire types; the validator enforces the §3.2 cross-attribute rules; a temporary CreateQueue gate rejects PartitionCount > 1 until PR 5 lifts the gate atomically with the data-plane fanout. Schema (sqs_catalog.go): - sqsQueueMeta gains three optional fields: * PartitionCount uint32 — number of FIFO partitions; 0/1 means legacy single-partition; > 1 enables HT-FIFO. * FifoThroughputLimit string — "perMessageGroupId" (default for HT-FIFO) or "perQueue" (collapses every group to partition 0). * DeduplicationScope string — "messageGroup" (default for HT-FIFO) or "queue" (legacy single-window). - attributesEqual extended via baseAttributesEqual + htfifoAttributesEqual split (kept under cyclop ceiling). - queueMetaToAttributes surfaces the configured HT-FIFO fields via addHTFIFOAttributes so GetQueueAttributes("All") round-trips. Routing (sqs_partitioning.go, new): - partitionFor(meta, messageGroupId) uint32 implements §3.3: FNV-1a over MessageGroupId & (PartitionCount - 1). Edge cases documented in the godoc — PartitionCount 0/1 → 0, FifoThroughputLimit perQueue short-circuits to 0, empty MessageGroupId → 0. - The bitwise-mask optimisation requires PartitionCount be a power of two; the validator enforces it. A future bug that leaks a non-power-of-two value is caught by TestPartitionFor_PowerOfTwoMaskingMatchesMod. Validation (sqs_partitioning.go): - validatePartitionConfig enforces the §3.2 cross-attribute rules: * PartitionCount must be a power of two in [1, htfifoMaxPartitions=32]. * FifoThroughputLimit / DeduplicationScope are FIFO-only. * {PartitionCount > 1, DeduplicationScope = "queue"} rejects with InvalidParameterValue (incoherent params, vs InvalidAttributeValue for malformed individual values) per §3.2 cross-attribute gate. - Runs in parseAttributesIntoMeta (after resolveFifoQueueFlag, so the IsFIFO check sees the post-resolution flag) and in trySetQueueAttributesOnce (after applyAttributes; IsFIFO comes from the loaded meta). - validatePartitionImmutability is the §3.2 rule that PartitionCount, FifoThroughputLimit, and DeduplicationScope are immutable from CreateQueue. SetQueueAttributes is all-or-nothing: a request that touches a mutable attribute alongside an attempted immutable change rejects the whole request before persisting either. - snapshotImmutableHTFIFO captures the pre-apply values so the immutability check has a clean before/after pair. Dormancy gate (sqs_partitioning.go + sqs_catalog.go): - validatePartitionDormancyGate is the §11 PR 2 temporary gate that rejects CreateQueue with PartitionCount > 1 with InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data plane — not yet enabled"). The schema field exists in the meta type but no partitioned data can land. Removed in PR 5 in the same commit that wires the data-plane fanout — gate-and-lift is atomic so a half-deployed cluster can never accept a partitioned queue without the data plane to serve it. Tests: - adapter/sqs_partitioning_test.go: 14 unit tests covering partitionFor (legacy zero/one, perQueue short-circuit, empty MessageGroupId, determinism, distribution within ±5% across 8 partitions on 100k samples, power-of-two masking < N), isPowerOfTwo, validatePartitionConfig (power-of-two, max cap, FIFO-only, cross-attr InvalidParameterValue, single-partition + queue-dedup OK), validatePartitionDormancyGate (rejects > 1, allows 0/1), validatePartitionImmutability (per-attribute change rejects, same- value no-op succeeds), htfifoAttributesPresent. - adapter/sqs_partitioning_integration_test.go: 7 end-to-end tests against a real createNode cluster covering the wire surface: dormancy gate rejects PartitionCount > 1 (and the gate's reason surfaces to the operator), allows PartitionCount=1, validator rejects non-power-of-two, FIFO-only rejection on Standard queue, cross-attr {partitions, queue dedup} rejects, immutability rejects SetQueueAttributes change with same-value no-op succeeds, all-or- nothing — combined mutable+immutable rejects entirely without persisting the mutable change, GetQueueAttributes round-trip. - All tests pass under -race; golangci-lint clean. Out of scope for this PR (§11 PR 3-8): - PR 3: keyspace threading — partitionIndex through every sqsMsg*Key constructor, defaulting to 0 so existing queues stay byte-identical. Gate from PR 2 still in place. - PR 4: routing layer + --sqsFifoPartitionMap flag + mixed-version gate (§8.5 capability advertisement + §8 leadership-refusal hook in kv/lease_state.go). Gate still in place. - PR 5: send/receive partition fanout, receipt-handle v2, removes the dormancy gate atomically with the data-plane fanout. - PR 6: PurgeQueue/DeleteQueue partition iteration + tombstone + reaper. - PR 7: Jepsen HT-FIFO workload + metrics. - PR 8: doc lifecycle bump.

Three medium-priority Gemini findings + one cyclop fix from the rebase onto the updated PR #679 throttling branch. (1) sqsErrInvalidParameterValue removed (sqs_catalog.go, sqs_partitioning.go, sqs_partitioning_test.go): the constant duplicated sqsErrValidation which already carries "InvalidParameterValue". Reuses the existing constant; comment on the validator clarifies the choice. (2) snapshotImmutableHTFIFO heap allocation gated on htfifoAttributesPresent (sqs_catalog.go): the snapshot is only needed when the request actually carries an HT-FIFO attribute. A SetQueueAttributes that touches only mutable attributes (e.g. VisibilityTimeout — the common path) now skips the allocation entirely. The validator short-circuit on preApply == nil keeps correctness identical: a request with no HT-FIFO attribute can never change an HT-FIFO field. (3) partitionFor inlined FNV-1a (sqs_partitioning.go): the hash/fnv.New64a + h.Write([]byte(messageGroupID)) path heap- allocates the byte slice and pays the hash.Hash interface dispatch on every SendMessage. Replaced with a manual FNV-1a loop over the string — measurably faster on the routing hot path. MessageGroupId is capped at 128 chars by validation so the loop is bounded. (4) cyclop (trySetQueueAttributesOnce): the conditional preApply branch + immutability check + Throttle validator pushed cyclop to 12. Extracted applyAndValidateSetAttributes so the function stays under the ceiling; the helper is also useful for any future admin-surface callers that want the same apply+validate flow without the OCC dispatch. All tests pass under -race; golangci-lint clean.

…eanup Three items from the round-2 Claude review on PR #681: (1) Bug (latent, live in PR 5): validatePartitionConfig only guarded FifoThroughputLimit / DeduplicationScope as FIFO-only, not PartitionCount > 1. A Standard queue with PartitionCount=2 would slip past the validator after PR 5 lifts the dormancy gate. Added a guard inside the !meta.IsFIFO block: PartitionCount > 1 → InvalidAttributeValue("PartitionCount > 1 is only valid on FIFO queues"). PartitionCount 0 / 1 stay accepted on Standard (both mean "single-partition layout"). Test TestValidatePartitionConfig_StandardQueueRejectsHTFIFOAttrs extended to cover PartitionCount=2/4/8/16/32 → reject and PartitionCount=0/1 → accept. (2) Low: TestSQSServer_HTFIFO_RejectsQueueScopedDedupOnPartitioned had two contradictory comments. The function-level doc said "Test sets PartitionCount=1 to bypass dormancy" but the test actually sends PartitionCount=2; the inline comment claimed "dormancy fires first" but validatePartitionConfig (inside parseAttributesIntoMeta) runs before validatePartitionDormancyGate, so the cross-attr rule fires first. Rewrote the doc to describe the actual control-plane order (validatePartitionConfig → cross-attr rejection → wire 400) and removed the misleading "dormancy fires first" assertion. (3) Nit: replaced the custom 11-line contains/indexOf helpers with strings.Contains. Identical behaviour, idiomatic. All findings noted in the round-2 Claude review on PR #681.

…2) (#681) ## Summary Phase 3.D **PR 2** (schema + validators + dormancy gate) per the §11 multi-PR rollout in [`docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md`](https://github.com/bootjp/elastickv/blob/docs/sqs-phase3-proposals/docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md) (currently on PR #664). Stacks on **PR #679 (Phase 3.C)**: branched off `feat/sqs-throttling-phase3c` so the 3.D schema work shares the same base as the throttling implementation. Will be rebased once PR #679 merges to main. - `sqsQueueMeta` gains `PartitionCount uint32`, `FifoThroughputLimit string`, `DeduplicationScope string`. - `partitionFor(meta, messageGroupId) uint32` implements §3.3 — FNV-1a over `MessageGroupId`, masked by `PartitionCount-1` (validator-enforced power-of-two so the bitwise mask is equivalent to mod). Edge cases: `PartitionCount` 0/1 → 0, `FifoThroughputLimit=perQueue` short-circuits to 0, empty `MessageGroupId` → 0. - `validatePartitionConfig` enforces the §3.2 cross-attribute rules: power-of-two `[1, htfifoMaxPartitions=32]`, `FifoThroughputLimit`/`DeduplicationScope` are FIFO-only, `{PartitionCount > 1, DeduplicationScope = "queue"}` rejects with `InvalidParameterValue` at control-plane time so the operator sees the error before the first `SendMessage`. - `validatePartitionImmutability` enforces the §3.2 immutability rule (the three HT-FIFO fields are immutable from `CreateQueue` onward). `SetQueueAttributes` is **all-or-nothing** — a request that touches a mutable attribute alongside an attempted immutable change rejects the whole request before persisting either. - `validatePartitionDormancyGate` is the §11 PR 2 temporary gate: `CreateQueue` with `PartitionCount > 1` rejects with `InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data plane — not yet enabled")`. The schema field exists in the meta type but no partitioned data can land. Removed in PR 5 in the same commit that wires the data-plane fanout — gate-and-lift is **atomic** so a half-deployed cluster can never accept a partitioned queue without the data plane to serve it. - `GetQueueAttributes("All")` round-trips the configured HT-FIFO fields via `addHTFIFOAttributes`. ## Test plan - [x] `adapter/sqs_partitioning_test.go` — 14 unit tests covering `partitionFor` (legacy zero/one, perQueue short-circuit, empty MessageGroupId, determinism, distribution within ±5% across 8 partitions on 100k samples, power-of-two masking always < N), `isPowerOfTwo`, `validatePartitionConfig` (power-of-two, max cap, FIFO-only, cross-attr `InvalidParameterValue`, single-partition + queue-dedup OK), `validatePartitionDormancyGate` (rejects > 1, allows 0/1, gate reason surfaces to the operator), `validatePartitionImmutability` (per-attribute change rejects, same-value no-op succeeds), `htfifoAttributesPresent`. - [x] `adapter/sqs_partitioning_integration_test.go` — 7 end-to-end tests: dormancy gate rejects `PartitionCount > 1` (and the gate's "not yet enabled" message surfaces), allows `PartitionCount=1`, validator rejects non-power-of-two, FIFO-only rejection on Standard queue, cross-attr `{PartitionCount=2, DeduplicationScope=queue}` rejects, immutability `SetQueueAttributes` change rejects with same-value no-op succeeding, all-or-nothing combined mutable+immutable rejects without persisting the mutable change, `GetQueueAttributes` round-trip omits unset fields. - [x] All tests pass under `-race`. - [x] `golangci-lint run ./adapter/...` → 0 issues. ## Self-review (5 lenses) 1. **Data loss** — Schema fields land on `sqsQueueMeta` and are persisted via the existing OCC `Put` on `sqsQueueMetaKey`. The dormancy gate runs at `CreateQueue` admission time; once PR 5 lifts the gate, no schema migration is needed (existing queues have `PartitionCount=0` which is equivalent to 1). The keyspace stays untouched in this PR (PR 3 threads `partitionIndex` through `sqsMsg*Key`). 2. **Concurrency / distributed failures** — Validation is purely function-of-input; no shared state. The immutability check loads the meta in the same `nextTxnReadTS` snapshot that `applyAttributes` uses, so a concurrent `SetQueueAttributes` from another writer is caught by the existing OCC `StartTS + ReadKeys` fence (already in place for `setQueueAttributesWithRetry`). Dormancy gate runs at `CreateQueue` admission so a concurrent gate-lift mid-request is impossible. 3. **Performance** — `partitionFor` is one FNV-1a hash + one bitwise mask; constant time. `validatePartitionConfig` is field-comparison only, runs once per `CreateQueue` / `SetQueueAttributes`. `addHTFIFOAttributes` adds at most three map writes and only when fields are non-empty. Hot path `SendMessage` / `ReceiveMessage` is unaffected by this PR (no routing changes — those land in PR 5). 4. **Data consistency** — Cross-attribute rule `{partitions, queue dedup}` rejects at control-plane time so an operator can never end up with a created-but-unserviceable queue (the runtime path that would have surfaced the same error at first send is unreachable). Immutability rule prevents mid-life routing-key reconfiguration that would silently violate within-group FIFO ordering. 5. **Test coverage** — 14 unit + 7 integration tests cover the validator's full decision tree (each rule has a dedicated table-driven case), the dormancy gate's accept/reject paths with the operator-visible reason, the immutability all-or-nothing semantics, and `partitionFor`'s edge cases plus a 100k-sample distribution test that catches non-uniform routing immediately. Lint is clean. ## Out of scope (subsequent §11 rollout PRs) - **PR 3** — keyspace threading: `partitionIndex` through every `sqsMsg*Key` constructor, defaulting to 0 so existing queues stay byte-identical. Dormancy gate from this PR remains in place. - **PR 4** — routing layer + `--sqsFifoPartitionMap` flag + mixed-version gate (§8.5 capability advertisement + §8 leadership-refusal hook in `kv/lease_state.go`). Dormancy gate still in place. - **PR 5** — send/receive partition fanout + receipt-handle v2 codec. **Removes the dormancy gate atomically with the data-plane fanout** so the gate-and-lift land in one commit. - **PR 6** — `PurgeQueue` / `DeleteQueue` partition iteration + tombstone schema update + reaper. - **PR 7** — Jepsen HT-FIFO workload + metrics. - **PR 8** — partial-doc lifecycle bump. ## Branch base Branched off `feat/sqs-throttling-phase3c` (PR #679) which itself branches off `docs/sqs-phase3-proposals` (PR #664). Will rebase against `main` after both ancestors merge.

…round 5) Two findings from the round-5 Codex review: (1) P1 — Default* batch-capacity floor: validateThrottleConfig disabled the >=10 batch floor for ThrottleDefault* on the assumption that Default never serves a batch verb. But resolveActionConfig in sqs_throttle.go falls Send and Receive traffic through to the Default bucket whenever the corresponding Send*/Recv* pair is unset. So a config like ThrottleDefaultCapacity=5, ThrottleDefaultRefillPerSecond=1 was accepted at the validator, then made every full SendMessageBatch / DeleteMessageBatch (charge=10) permanently unserviceable because the Default bucket could never accumulate 10 tokens. Flipped the third validateThrottlePair call's requireBatchCapacity from false to true; updated the inline comment to spell out why Default* needs the floor (the design doc note in §3.2 about Default* being exempt was wrong about the fall-through). New unit test TestValidateThrottleConfig_DefaultBucketBatchFloor covers DefaultCapacity=1/5 reject and DefaultCapacity=10 accept. (2) P2 — sweep eviction race: Earlier sweep computed idle under bucket.mu, released the lock, then unconditionally Deleted the entry. A concurrent charge() in that window could refill+take a token, then Delete evicted the just-used bucket and the next request got a fresh full-cap bucket — the just-taken token was effectively undone, allowing excess throughput. Fixed in two parts: - Hold bucket.mu across the Delete so the idle observation cannot be invalidated between check and delete. A concurrent charge() that wants the bucket either runs to completion before sweep acquires mu (sweep then sees the freshened lastRefill and skips the delete) or waits for sweep to release mu (charge then takes the bucket — but sweep has already removed it from the map, so the charge succeeds against an orphan bucket and only the next-after-charge request gets the full-cap rebuild — bounded one-token leak). - CompareAndDelete(k, v) ensures sweep does not evict a replacement bucket inserted by invalidateQueue or a future reconciliation path. Holding bucket.mu across sync.Map.Delete is safe — sync.Map.Load is wait-free on the read-only path and never blocks on bucket.mu, so there is no AB-BA cycle with charge(). New regression test TestBucketStore_SweepDoesNotEvictRecentlyUsedBucket backdates a bucket's lastRefill by 2h, advances the clock by 2h to force the bucket past the idle cutoff, then races sweep against a 50-iteration charge loop. After both finish the bucket must still be in the store — at least one charge interleaves with sweep's re-check window and sweep observes the freshened lastRefill. All tests pass under -race; golangci-lint clean.

… round 6) Round 5's sweep+CompareAndDelete fix closed half the eviction race — sweep can no longer evict a bucket whose lastRefill was freshened by a concurrent charge — but left a second window. Goroutines that loaded the bucket from sync.Map *before* sweep removed it from the map were still blocked on bucket.mu, and once sweep released mu they would charge the orphan bucket while later requests created a fresh full-capacity entry via loadOrInit. A burst aligned with a sweep tick could therefore consume up to 2x the configured capacity. Three changes plug the window: - tokenBucket gains an `evicted bool` field, protected by mu. - sweep, invalidateQueue, and the loadOrInit reconciliation path all set evicted=true under mu *after* successfully removing the bucket from the map (CompareAndDelete success or LoadAndDelete return). - charge() loops on a chargeBucket helper that checks evicted under mu and signals retry; the caller drops the orphan reference and loops back through loadOrInit, which converges on whatever live bucket the map now holds. CreateQueue now calls invalidateQueue on a genuine commit (tryCreateQueueOnce post-Dispatch) — DeleteQueue invalidated, but a sendMessage holding pre-delete meta could recreate a bucket between DeleteQueue's invalidate and the next CreateQueue commit, so a same- name recreate would inherit used token state. Invalidating after the genuine create resets the bucket regardless of in-flight traffic to the prior incarnation. The idempotent-return path (existing queue with identical attributes) deliberately skips invalidation. Tests: - TestBucketStore_SweepRaceDoesNotInflateBudget: -race stress test that runs 4 sweeps against 64 chargers on a single backdated bucket; total successful charges must stay <= capacity. The old code path could yield up to 2x capacity. - TestBucketStore_OrphanedBucketRetriesToLiveEntry: deterministic exercise of the evicted-flag retry — sweep evicts, the next charge allocates a fresh bucket via loadOrInit. - TestBucketStore_InvalidateMarksOrphanEvicted: pins the round-6 invalidateQueue change (LoadAndDelete + flag set under mu). - The pre-existing TestBucketStore_SweepDoesNotEvictRecentlyUsedBucket is replaced; the prior assertion (stillThere) was vacuously true because the for-range 50 charge loop would always recreate the bucket via loadOrInit after any eviction (Claude Low on round 5). go test -race -count=20 of the new tests passes; full SQS test package passes under -race; golangci-lint clean. Round-6 review pulled four other items not actioned: - Codex P2 "CreateQueue idempotency must compare Throttle" — already addressed: attributesEqual at sqs_catalog.go:726 calls baseAttributesEqual + throttleConfigEqual + htfifoAttributesEqual. - Gemini high "chargeQueue performs redundant storage read" — already addressed: hot-path handlers (sendMessage, receiveMessage) call chargeQueueWithThrottle to skip the second meta load. - Gemini high "maybeSweep on hot path" — already addressed: runSweepLoop runs sweep on a background ticker wired to reaperCtx. - Gemini medium "cap retryAfter at 10ms to prevent overflow" — already addressed: throttleRetryAfterCap=1h cap before multiply prevents overflow; 10ms cap is inappropriate for SQS rate-limit semantics (clients would retry constantly under contention).

Round 6's invalidateQueue used LoadAndDelete-then-lock — a charger that loaded the pointer pre-LoadAndDelete could win the mu race and charge the just-removed bucket while evicted was still false (Codex P2 on round 6). The fix mirrors sweep's ordering: 1. Load the pointer from the map (without deleting). 2. Acquire bucket.mu. 3. CompareAndDelete with the loaded v (rejects a replacement bucket inserted by a concurrent reconciliation). 4. Set evicted=true under mu. 5. Unlock. A charger blocked on mu now sees evicted=true on entry and retries against the live entry, instead of charging the orphan. Replaced the over-strict TestBucketStore_InvalidateRace... with TestBucketStore_InvalidateUnderConcurrencyIsRaceFree: invalidate's reset-to-full-capacity is the *intended* semantics of an invalidation event (DeleteQueue / SetQueueAttributes / CreateQueue all want the new policy applied to a fresh bucket), so the post-invalidate fresh bucket can legitimately absorb up to capacity additional tokens — that 2x window is structural, not a bug. Asserting successful-charge count <= capacity would also pass on the buggy code path under most schedulings, so the assertion would not differentiate old vs new. The deterministic TestBucketStore_InvalidateMarksOrphanEvicted already pins the actual mechanism (evicted=true under mu on dropped bucket); the new stress test exists to surface any -race finding the new lock ordering might introduce. go test -race -count=20 of every TestBucketStore_* test passes; golangci-lint clean.

… round 6.2) validatePartitionConfig documents PartitionCount=0 (unset) and =1 (explicit single partition) as semantically identical legacy / single-partition routing, but htfifoAttributesEqual was using strict numeric equality. So a queue created without PartitionCount (stored as 0) followed by a CreateQueue retry that explicitly passes PartitionCount=1 was rejected as "different attributes" — broken idempotency for any provisioning flow that normalises defaults differently between attempts. (Codex P2 on PR #679 round 6.1.) Fix: normalisePartitionCount maps 0 -> 1 before comparison; values >1 pass through unchanged so HT-FIFO partition counts still require exact match. The on-disk schema is unchanged. Tests: - TestHTFIFOAttributesEqual_PartitionCountZeroAndOneEquivalent — pins (0, 1) equality, symmetry, and that real divergence (2 vs 0, 2 vs 1) still rejects. - TestNormalisePartitionCount — pins the helper directly. go test -race ./adapter/... pass; golangci-lint clean.

#679 round 6.3) Round 6.2 normalised the CreateQueue idempotency path (htfifoAttributesEqual) but left the SetQueueAttributes immutability path (validatePartitionImmutability) using strict equality. Same 0 == 1 gap (Claude Low on round 6.2): 1. CreateQueue with no PartitionCount -> stored as 0. 2. SetQueueAttributes(PartitionCount=1) -> requested.PartitionCount = 1. 3. validatePartitionImmutability: 0 != 1 -> rejects "PartitionCount is immutable", even though the partition layout is unchanged. Fix: validatePartitionImmutability now normalises both sides through normalisePartitionCount before comparison, mirroring the round-6.2 fix. The narrow gap surfaced under the §11 PR 2 dormancy gate (SetQueueAttributes does not call the gate, so this path is reachable on existing queues), so a fix is correct now rather than deferring to PR 5. Test: - TestValidatePartitionImmutability_PartitionCountZeroAndOneEquivalent table-driven over all (stored, requested) ∈ {0,1,2,4,8} pairs; no-op pairs accept, real-change pairs reject. Reverting either call to strict equality flips the relevant rows. go test -race ./adapter/... pass; golangci-lint clean.

## Summary Phase 3.C implementation: per-queue rate limiting via token-bucket throttling. Implements the design from [`docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md`](https://github.com/bootjp/elastickv/blob/docs/sqs-phase3-proposals/docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md) (currently on PR #664). - Token-bucket store on `*SQSServer`: `sync.Map[bucketKey]*tokenBucket`, per-bucket `sync.Mutex` so cross-queue traffic never serialises on a process-wide lock. Lazy idle-evict (1h) bounds memory. - New `sqsQueueMeta.Throttle *sqsQueueThrottle` field with six float64 sub-fields (Send/Recv/Default × Capacity/RefillPerSecond), wired through the existing `sqsAttributeAppliers` dispatch as `ThrottleSendCapacity` / `ThrottleSendRefillPerSecond` / etc. - Validator (`validateThrottleConfig`) enforces the §3.2 cross-attribute rules: each (capacity, refill) pair both-zero or both-positive; capacity ≥ refill; Send/Recv capacities ≥ 10 (the SendMessageBatch / DeleteMessageBatch max charge); per-field hard ceiling 100k. - All seven message-plane handlers wired (SendMessage[Batch], ReceiveMessage, DeleteMessage[Batch], ChangeMessageVisibility[Batch]). Throttle check sits **outside** the OCC retry loop per §4.2 — a rejected request never reaches the coordinator. - On rejection: `400 Throttling` with `x-amzn-ErrorType` + AWS-shaped JSON envelope + `Retry-After` header computed from the §3.4 formula (numerator is requested count, not 1, so a batch verb does not get told to retry in 1s when it really needs 10s). - Cache invalidation on `SetQueueAttributes` and `DeleteQueue` (§3.1): drops every bucket belonging to the queue after the Raft commit so new limits take effect on the very next request, not after the 1h idle-evict sweep. - `GetQueueAttributes("All")` round-trips Throttle* fields so SDKs can confirm the config landed. ## Test plan - [x] `adapter/sqs_throttle_test.go` — 18 unit tests: - bucket math: fresh capacity, refill elapsed, refill cap-at-capacity, batch reject preserves partial credit, Retry-After uses requested count, sub-1-RPS floor; - isolation: per-action, per-queue, Default-fallthrough shares one bucket; - lifecycle: `invalidateQueue` drops all action keys; - concurrency: `-race` clean, exactly-capacity successes; - default-off short-circuit; - validator: nil/empty canonicalisation, both-zero-or-both-positive, capacity ≥ 10 batch floor, Default* exempt, capacity ≥ refill, `parseThrottleFloat` range checks, `computeRetryAfter` floor. - [x] `adapter/sqs_throttle_integration_test.go` — 8 end-to-end tests against a real `createNode` cluster: - default-off allows unbounded; - send/recv reject after capacity with correct envelope + `Retry-After`; - batch charges by entry count; - `SetQueueAttributes` invalidation (raise-and-immediate-success); - `DeleteQueue` + `CreateQueue` lifecycle invalidation; - `GetQueueAttributes` round-trip; - validator rejects below batch min. - [x] `golangci-lint run` clean. - [x] `go test -race -run "TestBucketStore|TestValidateThrottleConfig|TestParseThrottleFloat|TestComputeRetryAfter|TestSQSServer_Throttle" ./adapter/...` clean. ## Self-review (5 lenses) 1. **Data loss** — Throttle config is on `sqsQueueMeta`, persisted via the existing `setQueueAttributesWithRetry` OCC commit. Bucket state is per-process and never written to storage, by design (§3.1: replicating bucket state would cost a Raft commit per token decrement, defeating the bucket). On leader failover the new leader rebuilds at full capacity — same as the AWS region-failover semantic. No write path can lose throttle config. 2. **Concurrency** — Per-bucket `sync.Mutex`, never held across `sync.Map` ops. `LoadOrStore` race on first insert is safe (both racers compute identical config from the same meta snapshot). Concurrent `SetQueueAttributes` + in-flight charge: the charge may briefly use the old bucket; the next request rebuilds from the fresh meta. `-race` test pins the count invariant: exactly `capacity` successes out of N concurrent goroutines, never more. 3. **Performance** — Hot path on unconfigured queues is one `nil` check (`Throttle == nil`) → return. On configured queues: one `sync.Map.Load` (lock-free), one mutex acquire on the bucket. No global lock. `sync.Map`'s read-mostly optimisation matches the access pattern. The lazy idle-evict sweep runs at most once per minute from the hot path so a many-queue cluster does not pay an O(N) cost per request. 4. **Data consistency** — Throttling sits outside OCC; a rejected request never touches the coordinator (§4.2). The OCC retry loop in `sendMessageWithRetry` cannot busy-loop on a permanent rate-limit failure. Cache invalidation after the Raft commit (not before) so the rebuilt bucket reads the freshly committed config — there is no window where a request can read a stale bucket built from the old config. 5. **Test coverage** — 18 unit tests cover bucket math, isolation, lifecycle, concurrency, default-off; 8 integration tests cover the end-to-end wire path. Both pass under `-race`. Validator has a dedicated table-driven test per rule. ## Out of scope (deferred to follow-ups) - **Prometheus metrics** (counter `sqs_throttled_requests_total{queue,action}` + gauge `sqs_throttle_tokens_remaining{queue,action}` per the design's §4.1 monitoring/registry.go entry). The bucket store's `chargeOutcome` already exposes `tokensAfter` so the gauge wiring is one site; the counter is a one-line increment after `writeSQSThrottlingError`. Punted to keep this PR focused on data-plane behaviour. - **§6.5 cross-protocol (Query) parity test** — the Query protocol layer lives on PR #662 (separate branch); the throttle envelope already has the same shape on both protocols (`writeSQSError` handles both), so the test is a follow-up after #662 lands. - **§6.6 failover behaviour test** — the §3.1 "fresh bucket on failover" contract is implementation-correct (the bucket map is per-process, no Raft state) but a 3-node failover test needs cluster scaffolding and is best added with the Jepsen workload. ## Branch base Branched off [`docs/sqs-phase3-proposals`](https://github.com/bootjp/elastickv/tree/docs/sqs-phase3-proposals) (PR #664) so the design doc is part of the reviewable surface. Will rebase against `main` after #664 merges.

…sals # Conflicts: # adapter/sqs_catalog.go

bootjp · 2026-04-27T16:04:18Z

@claude review

claude · 2026-04-27T16:04:35Z

Claude finished @bootjp's task in 5m 42s —— View job

Review

Read current state of both design documents
Check open Codex P1: "fresh bucket on failover" correctness (3.C §3.1)
Review implementation files added to the PR
Check for new issues
Overall assessment

Context

Since the last review round, two implementation PRs have landed on this branch:

PR feat(sqs): per-queue throttling (Phase 3.C) #679 — Phase 3.C per-queue throttling (adapter/sqs_throttle.go, tests, sqs.go/sqs_catalog.go wiring, 6+ fix rounds)
PR feat(sqs): HT-FIFO schema + validators + dormancy gate (Phase 3.D PR 2) #681 — Phase 3.D PR 2 HT-FIFO schema + validators (adapter/sqs_partitioning.go, sqs_catalog.go extensions, dormancy gate, 3 fix rounds)

This changes the scope of this PR review from docs-only to docs + implementation.

Open Codex P1 from last round: RESOLVED in code, STILL OPEN in doc

Codex P1: §3.1 "fresh bucket on failover" is incorrect for re-elected nodes

The implementation resolves this via lazy config reconciliation in loadOrInit (adapter/sqs_throttle.go:272–328):

// On Load hit, compare cached config to current meta config.
bucket.mu.Lock()
matches := bucket.capacity == capacity && bucket.refillRate == refill
bucket.mu.Unlock()
if matches {
    return bucket
}
// Config mismatch: stale bucket — CompareAndDelete + evicted=true,
// then fall through to LoadOrStore with a fresh full-capacity bucket.

When node A re-acquires leadership after a SetQueueAttributes commit on node B:

Node A's charge() calls loadOrInit(queue, action, capacity=NEW, refill=NEW) where the config comes from the freshly-read meta.
loadOrInit finds node A's stale bucket (capacity=OLD).
Mismatch detected → stale bucket evicted (CompareAndDelete + evicted=true) → fresh full-capacity bucket under the new config.

The evicted=true flag (see tokenBucket.evicted) ensures any concurrent chargeBucket goroutine holding the stale bucket retries against the live entry (chargeBucket returns retry=true on evicted, charge loops up to 4 times).

However, the design doc §3.1 line 118 has not been updated:

"On leader failover, a fresh bucket starts at full capacity on the new leader — there is no Raft replication of bucket state. Why this is correct: the worst-case behaviour of 'fresh bucket on failover' is that a noisy queue gets one extra burst worth of bandwidth right after a leader change."

This claim is only accurate for first-time-leader failover. For re-elected nodes, the path is reconciliation, not a fresh bucket. The doc should describe the two paths:

First-time leader (no existing bucket): loadOrInit miss → fresh bucket at full capacity under current meta config. ✓ (doc is correct for this case)
Re-elected node with stale bucket: loadOrInit hit → config mismatch → old bucket evicted → fresh bucket at full capacity under the now-current meta config. ✓ (implementation is correct, doc doesn't describe this)

The "one extra burst" worst-case also needs updating: for a re-elected node with a stale exhausted bucket, the new bucket starts at full capacity of the current config (which may be higher or lower than the old config). Fix this →

Implementation review: LGTM with one low note

The implementation is well-structured and the eviction-race handling in particular (evicted flag, CompareAndDelete in sweep / invalidateQueue / loadOrInit) is carefully done. Specific observations:

sqs_throttle.go — correct throughout:

invalidateQueue (line 350) uses the lock-then-CompareAndDelete-then-flag pattern matching sweep. Consistent. ✓
runSweepLoop background goroutine wired to the server's context so it cleans up alongside the reaper. ✓
charge retry loop bounded at 4 with fail-open fallback. ✓
computeRetryAfter uses requested - current in the numerator (matching the PR history's §3.4 formula update). ✓
resolveActionConfig correctly shares the "*" bucket key for Default so all fall-through actions share one budget. ✓

sqs_throttle.go — one note [Low]:

resolveActionConfig returns return action, 0, 0 as the no-throttle sentinel (last line, currently action is passed through). The caller in charge() checks if capacity == 0 || refill == 0 and returns early, so the first return value is never used in that path — functionally harmless. But returning the raw action string instead of a canonical bucket-action constant makes the sentinel ambiguous for future callers who might not read the callsite's guard. A cleaner return would be return bucketActionAny, 0, 0 or just the zero value. Not blocking, just a readability note.

sqs_partitioning.go — correct throughout:

partitionFor handles all edge cases (nil, PartitionCount ≤ 1, perQueue short-circuit, empty GroupId). Inlined FNV-1a avoids allocation vs. the hash.Hash interface path — the comment explains the perf motivation well. ✓
validatePartitionConfig has the FIFO-only guard for PartitionCount > 1 (PR feat(sqs): HT-FIFO schema + validators + dormancy gate (Phase 3.D PR 2) #681 round 2 fix) and the {PartitionCount > 1, DeduplicationScope = "queue"} cross-attribute rejection. ✓
validatePartitionDormancyGate correctly rejects at PartitionCount > 1. Will be removed atomically with the PR 5 data-plane fanout. ✓
validatePartitionImmutability uses normalisePartitionCount for the 0/1 equivalence edge case (PR feat(sqs): per-queue throttling (Phase 3.C) #679 round 6.2 fix). ✓

sqs_catalog.go invalidateQueue gate [Low]:

The setQueueAttributes handler gates invalidateQueue on throttleAttributesPresent(in.Attributes) (line 1384). The comment correctly notes this is "purely a hot-path optimisation" because loadOrInit reconciliation catches stale buckets from config changes on other nodes. This is correct — but the comment refers to loadOrInit reconciliation as a backstop, which is now the primary mechanism for the ex-follower-turned-leader scenario. The doc should describe both mechanisms and their relationship (explicit invalidation for the current leader's store, lazy reconciliation for all other nodes including future re-elected ones).

Summary

The Codex P1 from the last round is addressed in the implementation via loadOrInit reconciliation — but the design doc §3.1 has not been updated to reflect the two-path failover model (first-time-leader vs. re-elected-with-stale-bucket). This is the only finding I'd call blocking before this PR merges — the doc should accurately describe the mechanism that protects operators from stale throttle state after leadership changes.

The implementation itself is in good shape across both sqs_throttle.go and sqs_partitioning.go.

| Branch |

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2edf9d49d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T16:18:42Z

+type bucketKey struct {
+	queue  string
+	action string
+}


Include queue generation in throttle bucket keys

Fresh evidence from this implementation: bucketKey is still only (queue, action), and reconciliation in loadOrInit only checks capacity/refill. If a queue is deleted/recreated with the same throttle values (or leadership moves to a node holding an old cached bucket), the old token balance is reused for the new incarnation because invalidation is process-local, so the recreated queue can start already drained (or unexpectedly bursty) instead of starting from a fresh full bucket.

Useful? React with 👍 / 👎.

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 26, 2026

View reviewed changes

This was referenced Apr 26, 2026

feat(sqs): per-queue throttling (Phase 3.C) #679

Merged

feat(sqs): HT-FIFO schema + validators + dormancy gate (Phase 3.D PR 2) #681

Merged

bootjp added 15 commits April 27, 2026 19:48

Merge remote-tracking branch 'origin/main' into docs/sqs-phase3-propo…

2edf9d4

…sals # Conflicts: # adapter/sqs_catalog.go

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

		A partition is identified by the tuple `(queueName, partitionIndex)` where `partitionIndex ∈ [0, N)`. Every existing message-keyspace key gains the `partitionIndex` as a fixed-width segment immediately after the queue name. Concretely, today's `!sqs\|msg\|data\|<queue>\|<gen>\|<msgID>` becomes `!sqs\|msg\|data\|<queue>\|<partition>\|<gen>\|<msgID>`. The `<partition>` segment is `appendU64(partitionIndex)` (fixed 8 bytes BE), so:

		- Single-partition FIFOs (and all Standard queues) write `partitionIndex = 0` and read with the same prefix as today — the schema is byte-identical for `N=1`.

		mu sync.Mutex
		buckets map[bucketKey]*tokenBucket // (queueName, action) -> bucket

	mu sync.Mutex
	buckets map[bucketKey]*tokenBucket // (queueName, action) -> bucket
	buckets sync.Map // map[bucketKey]*tokenBucket

		A consumer making one ReceiveMessage call sees results from only the partitions whose leader is the local node. For a deployment where every partition has its own leader, a single client's ReceiveMessage hits one partition's worth of messages per call. This matches AWS HT-FIFO behaviour: clients are expected to spread their consumer pool across partitions, and an idle partition does not block a busy one.

		For deployments that co-locate partitions on one Raft group, all partitions' leaders are the same node and a single ReceiveMessage scans all partitions in sequence. The scan is bounded by `MaxNumberOfMessages` and the existing per-partition page limit — no separate budget.


		## 8. Failure Modes and Edge Cases

		1. Consumer never sees a partition's leader: a consumer that always lands on node A sees only the partitions whose leader is A. If partitions are spread evenly, this is `1/N` of the queue. Mitigation: AWS SDK consumers naturally distribute across endpoints; for elastickv, document that clients should round-robin endpoints when consuming a partitioned FIFO.

		2. On miss, build the bucket from queue meta and `LoadOrStore` it (one-shot insert race tolerated — both racers will agree on the same configuration).
		3. Acquire the bucket's own `mu`, refill based on elapsed time, take or reject the requested tokens, release `mu`.

		secondsToNextToken := math.Ceil((1.0 - currentTokens) / refillRate)
		retryAfter := max(1, int(secondsToNextToken)) // never less than 1

		5. Mixed-version cluster: a rolling upgrade where some nodes have HT-FIFO and others don't. The new feature gates on the queue's `PartitionCount > 1` field, which is set at create time; old nodes that try to scan a partitioned queue's keyspace will simply not find anything (the prefix has changed). The catalog rejects `CreateQueue` with `PartitionCount > 1` until every node in the cluster reports the new feature flag.

		The capability advertisement mechanism: each node's existing `/sqs_health` endpoint (`adapter/sqs.go: serveSQSHealthz`) gains a new field in its JSON body — `capabilities: ["htfifo"]` once this PR's code is in the binary. The catalog's CreateQueue handler reads the live node set from the distribution layer's node registry (the same registry used by `proxyToLeader` to locate leaders), polls `/sqs_health` on each, and gates `PartitionCount > 1` on every node reporting the `htfifo` capability. Nodes that don't respond within a short timeout are treated as not-yet-upgraded — a deliberate fail-closed default so a network blip does not let a partitioned queue land in a partially-upgraded cluster. This mirrors the §3.3.2 admin-forwarding upgrade gate from the admin dashboard design (PR #644), which uses the same "all-nodes-must-report" pattern for `AdminForward`.


		No global lock is held during step 3; concurrent traffic on different queues runs in parallel.

		Cache invalidation on `SetQueueAttributes`: when an operator updates the throttle config via `SetQueueAttributes`, the handler — after the Raft commit that persists the new `sqsQueueThrottle` — calls `buckets.Delete(key)` for every `bucketKey` belonging to the updated queue (`Send`, `Recv`, `Default`). Without this step the in-memory bucket would keep enforcing the old limits until the idle-eviction sweep removes the stale entry (default 1 h window), defeating the operator's intent to throttle a noisy tenant in real time. The `LoadOrStore` race with the `Delete` call is benign: the next request rebuilds from the freshly-committed meta, and the rebuilt bucket starts at full capacity (same semantics as the failover case documented below). Claude P1 on PR #664 caught the gap.

Conversation

bootjp commented Apr 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3.C — Per-queue throttling and tenant fairness (proposal)

3.D — Split-queue FIFO (high-throughput FIFO) (proposal)

Test plan

Self-review

Stacking

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 26, 2026

Uh oh!

claude Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

All four previous issues: resolved ✓

New issues in the updated doc

[Medium] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §8.1 — stale failure mode description

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.1 — sqsMsgDataKey variadic signature is a footgun

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.1 — p| discriminator safety constraint is implicit

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §8.5 — upgrade gate mechanism underspecified

[Low] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.4 — Retry-After: 1 lies for slow-refill queues

Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 26, 2026

Uh oh!

claude Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

All five previous issues: resolved ✓

New findings

[Medium] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.3 — FifoThroughputLimit=perQueue not implemented in partitionFor

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §4.1 — pseudocode still uses the old sqsMsgDataKey name

[Low] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.4 — Retry-After formula hardcodes 1.0 regardless of batch size

Summary

bootjp commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

claude Bot commented Apr 26, 2026 •

edited

Loading

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §8.1 — stale failure mode description

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.1 — `sqsMsgDataKey` variadic signature is a footgun

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.1 — `p|` discriminator safety constraint is implicit

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §8.5 — upgrade gate mechanism underspecified

[Low] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.4 — `Retry-After: 1` lies for slow-refill queues

claude Bot commented Apr 26, 2026 •

edited

Loading

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.3 — `FifoThroughputLimit=perQueue` not implemented in `partitionFor`

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §4.1 — pseudocode still uses the old `sqsMsgDataKey` name

[Low] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.4 — `Retry-After` formula hardcodes `1.0` regardless of batch size

claude Bot commented Apr 26, 2026 •

edited

Loading

[Medium] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.1 — bucket cache not invalidated on `SetQueueAttributes`

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §5 — `--raftSqsMap` reuse conflicts with endpoint-address semantics

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §9 — no test case for `FifoThroughputLimit=perQueue`

[Low] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §4.3 — "sharded by `MessageGroupId`" is the wrong key

claude Bot commented Apr 26, 2026 •

edited

Loading

Still-open Codex P1 findings from `d424d18d7c` (surfaced in my last review, not addressed in this push)

[Medium] `docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md` §3.1 — `DeleteQueue` does not invalidate the bucket cache

[Medium] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §4.2 — `WaitTimeSeconds` not bounded across the partition fanout

[Low] `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §3.4 — reference to `--raftSqsMap` for partition routing contradicts §5