Skip to content

docs(sqs): proposals for Phase 3.C (throttling) + 3.D (split-queue FIFO)#664

Open
bootjp wants to merge 30 commits intomainfrom
docs/sqs-phase3-proposals
Open

docs(sqs): proposals for Phase 3.C (throttling) + 3.D (split-queue FIFO)#664
bootjp wants to merge 30 commits intomainfrom
docs/sqs-phase3-proposals

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 26, 2026

Summary

Docs-only PR. Two design proposals for the remaining Phase 3 SQS items per docs/design/2026_04_24_proposed_sqs_compatible_adapter.md Section 14 (Phase 3 bullets). Both items were explicitly called out as needing separate design docs before any implementation work; this PR lands those proposals so the implementation PRs have a reviewed architecture to build on.

3.C — Per-queue throttling and tenant fairness (proposal)

Per-queue token-bucket throttling configured on queue meta (no separate keyspace), evaluated at the SQS adapter layer on the leader (no Raft per request), surfaced as the AWS Throttling error envelope so SDK retry/backoff just works.

Key decisions:

  • Default-off. Existing queues are unaffected; operators opt in per queue via SetQueueAttributes.
  • Per-action buckets (Send / Receive / Default) so a slow consumer cannot pin the producer.
  • Per-leader buckets, no replication. Worst case on failover: one extra burst on the new leader. Acceptable per AWS-equivalent behaviour at region failover boundaries; replicating would cost a Raft commit per SendMessage.
  • Batch verbs charge by entry count, not call count, with all-or-nothing rejection (matches AWS).
  • Admin-only configuration plane. Standard SQS clients see InvalidAttributeName on the Throttle* attributes (matches AWS behaviour for unknown attributes); the data-plane enforcement runs for everyone.

3.D — Split-queue FIFO (high-throughput FIFO) (proposal)

Per-MessageGroupId hash partitioning across multiple Raft groups, mirroring AWS High Throughput FIFO. Within-group ordering preserved; across-group throughput scales with the partition count.

Key decisions:

  • Existing single-partition FIFO queues stay byte-identical (PartitionCount = 0 path is the legacy layout; no migration runs implicitly).
  • Power-of-two partition counts only (1, 2, 4, 8, 16, 32) so the routing step is hash & (N-1) and future offline rebuilds stay tractable.
  • Partition count is immutable after first SendMessage. Live re-partitioning would break ordering for in-flight messages of every group whose hash bucket changed; out of scope.
  • Multi-PR rollout plan with an explicit "gate of no return" called out at PR 5 (the data-plane PR). PRs 1–4 are reversible no-ops on data layout; once a partitioned FIFO holds real data, rollback means draining and recreating the queue.
  • FNV-1a hash (deterministic across processes / Go versions / architectures). Risk of attacker-controlled MessageGroupId pinning all traffic to one partition is documented and accepted (the feature is for cooperative operators).

Test plan

Self-review

This is a docs-only PR; the 5-lens self-review collapses to:

  1. Data loss / Concurrency / Performance / Consistency — N/A, no code touched.
  2. Test coverage — N/A, no code touched. The proposals themselves include a Testing Strategy section (§6 / §9) so the implementation PRs have explicit acceptance criteria.

Stacking

This PR is independent of #650, #659, and #662. Branched from current main. Merge whenever ready — landing the proposal docs early lets reviewers comment on the architecture before the implementation PRs go up.

Summary by CodeRabbit

  • Documentation
    • Added comprehensive design specification for queue-level throttling in SQS with configurable send/receive rate limits, error handling, and phased rollout plan
    • Added comprehensive design specification for FIFO queue partitioning enabling parallel message processing across partitions while preserving message ordering within partitions

Two design proposals for the remaining Phase 3 SQS items per
docs/design/2026_04_24_proposed_sqs_compatible_adapter.md (Section
14, Phase 3 bullets). Both are explicitly called out as needing
separate design docs before implementation; this lands them so
implementation work has a reviewed architecture to start from.

3.C — per-queue token-bucket throttling, configured on queue meta,
evaluated at the SQS adapter layer on the leader. Default-off,
AWS-shape Throttling error envelope, per-leader buckets (no Raft
per request). Proposal covers: bucket model, charging per verb /
batch entry, configuration validation rules, AWS-compatibility
posture (admin-only attribute, hidden from standard principals),
multi-shard correctness, observability, alternatives, open
questions.

3.D — high-throughput FIFO via per-MessageGroupId hash partitioning
across multiple Raft groups. Multi-PR rollout plan with an explicit
"gate of no return" called out at the data-plane PR. Existing
single-partition FIFO queues stay byte-identical (PartitionCount=0
path). Proposal covers: data model and key encoding, routing,
SendMessage / ReceiveMessage / PurgeQueue / DeleteMessage flows,
shard router config syntax, reaper implications, migration story
(no live re-partitioning), failure modes, mixed-version cluster
behaviour, alternatives, open questions.

No code changes; review-only PR.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 12 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b710464a-f267-46a0-bb94-b04db59bc4d7

📥 Commits

Reviewing files that changed from the base of the PR and between 829cf97 and 2edf9d4.

📒 Files selected for processing (12)
  • adapter/sqs.go
  • adapter/sqs_catalog.go
  • adapter/sqs_catalog_test.go
  • adapter/sqs_messages.go
  • adapter/sqs_messages_batch.go
  • adapter/sqs_partitioning.go
  • adapter/sqs_partitioning_integration_test.go
  • adapter/sqs_partitioning_test.go
  • adapter/sqs_throttle.go
  • adapter/sqs_throttle_integration_test.go
  • adapter/sqs_throttle_test.go
  • docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md
📝 Walkthrough

Walkthrough

Two new design documents propose enhancements to the elastickv SQS adapter: per-queue token-bucket throttling with leader-evaluated rate limiting, and HT-FIFO-style queue partitioning for .fifo queues with deterministic routing and partition-aware message handling.

Changes

Cohort / File(s) Summary
SQS Design Proposals
docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md, docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md
Two design documents proposing SQS adapter enhancements: per-queue token-bucket throttling with Send/Recv independent buckets and queue-meta configuration; and HT-FIFO partitioning for .fifo queues with MessageGroupId-based routing, partition-aware scanning, receipt-handle versioning, and mixed-version leadership safeguards.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

🐰✨ Design docs dance through the digital spring,
Throttled buckets and partitions take wing,
HT-FIFO hops quicken the queue's steady beat,
SQS adventures make the release complete! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and concisely describes the primary changes: two design proposals for Phase 3.C (per-queue throttling) and Phase 3.D (split-queue FIFO) SQS features. It is clear, specific, and accurately reflects the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/sqs-phase3-proposals

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two design documents for the SQS adapter: one for per-queue throttling and another for split-queue FIFO partitioning. The throttling proposal aims to ensure tenant fairness and prevent resource exhaustion, while the split-queue FIFO design targets high-throughput requirements by distributing queues across multiple Raft groups. Review feedback identifies a critical backward compatibility risk in the proposed keyspace schema that would break existing data access. Additionally, there are concerns regarding potential performance bottlenecks from coarse-grained locking in the throttling store and a design flaw in the partitioned ReceiveMessage logic that could lead to 'false empty' responses for clients.

Comment on lines +49 to +51
A partition is identified by the tuple `(queueName, partitionIndex)` where `partitionIndex ∈ [0, N)`. Every existing message-keyspace key gains the `partitionIndex` as a fixed-width segment immediately after the queue name. Concretely, today's `!sqs|msg|data|<queue>|<gen>|<msgID>` becomes `!sqs|msg|data|<queue>|<partition>|<gen>|<msgID>`. The `<partition>` segment is `appendU64(partitionIndex)` (fixed 8 bytes BE), so:

- Single-partition FIFOs (and all Standard queues) write `partitionIndex = 0` and read with the same prefix as today — the schema is byte-identical for `N=1`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The proposed key format !sqs|msg|data|<queue>|<partition>|<gen>|<msgID> is not backward compatible with the existing format. Inserting the 8-byte <partition> segment shifts the offsets for <gen> and <msgID>, meaning the new code will fail to find existing messages on disk. This introduces a significant operational risk and prevents rolling upgrades. Per repository rules, the design must detail mitigation strategies (e.g., a temporary 'bridge' or 'proxy' mode) and strategies for zero-downtime cutovers (e.g., dual-write proxy or logical replication) to avoid service interruption.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
  2. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

Comment on lines +78 to +79
mu sync.Mutex
buckets map[bucketKey]*tokenBucket // (queueName, action) -> bucket
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a single sync.Mutex to protect the buckets map will likely become a performance bottleneck, as this lock must be acquired on every SendMessage and ReceiveMessage call. Per the general rules, frequently accessed fields on hot paths should prefer more granular locking or atomic operations to improve performance.

Suggested change
mu sync.Mutex
buckets map[bucketKey]*tokenBucket // (queueName, action) -> bucket
buckets sync.Map // map[bucketKey]*tokenBucket
References
  1. For frequently accessed fields that require monotonic updates, prefer atomic operations or granular locking over mutexes to improve performance on hot paths.

Comment on lines +154 to +157
A consumer making one ReceiveMessage call sees results from **only the partitions whose leader is the local node**. For a deployment where every partition has its own leader, a single client's ReceiveMessage hits one partition's worth of messages per call. This matches AWS HT-FIFO behaviour: clients are expected to spread their consumer pool across partitions, and an idle partition does not block a busy one.

For deployments that co-locate partitions on one Raft group, all partitions' leaders are the same node and a single ReceiveMessage scans all partitions in sequence. The scan is bounded by `MaxNumberOfMessages` and the existing per-partition page limit — no separate budget.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Restricting ReceiveMessage to only scan partitions led by the local node is a significant departure from the SQS model. In a multi-node deployment, this can lead to "false empty" responses if a client hits a node that is not the leader for partitions containing messages. This forces clients to implement cluster-aware round-robin logic, which leaks internal sharding details and breaks the SQS abstraction.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 341b0deb14

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +146 to +148
b. If this node is not the leader for that partition, skip (the
consumer's next call will land on a different node and that
leader will serve its partition).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Proxy non-local partitions during ReceiveMessage

Step 2b explicitly skips partitions whose leader is not the local node, which means a consumer pinned to one endpoint will only ever drain that node’s leader-owned partitions while messages on other partitions can backlog indefinitely. This turns ReceiveMessage into node-local visibility rather than queue-wide visibility and breaks normal single-endpoint client deployments unless they add explicit endpoint round-robin behavior.

Useful? React with 👍 / 👎.

| `adapter/sqs.go` | After `authorizeSQSRequest`, call `bucketStore.charge(queueName, action, count)`. On reject, write the `Throttling` envelope and return. |
| `adapter/sqs_throttle_test.go` (new) | Unit tests for bucket math (edge cases: idle drift, burst, partial refill, batch over-charge, default-off). ~300 lines. |
| `adapter/sqs_throttle_integration_test.go` (new) | End-to-end: configure a queue with low limits, send N messages back-to-back, confirm the (N+1)th gets `Throttling` with `Retry-After`. ~150 lines. |
| `monitoring/registry.go` | Two new counter vectors: `sqs_throttled_requests_total{queue, action}` and `sqs_throttle_tokens_remaining{queue, action}`. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Define tokens_remaining as a gauge metric

This section defines sqs_throttle_tokens_remaining as a counter, but token-bucket remaining capacity must go up and down over time. If implemented as a counter, the value can only increase and any dashboard based on “trending toward zero” becomes meaningless, masking real throttling pressure; this metric needs to be specified as a gauge.

Useful? React with 👍 / 👎.

Four issues raised by Codex + Gemini Code Assist on PR #664:

3.D split-queue FIFO doc:

1. Backward-compatibility risk in keyspace (Codex P1 + Gemini high).
   The original §3.1 said "every existing key gains a <partition>
   segment" but also "the schema is byte-identical for N=1" — those
   are contradictory; inserting bytes shifts every downstream offset
   and breaks readback of existing data on disk. Rewrote §3.1 to
   make the key shape conditional on whether the queue is
   partitioned: legacy / Standard queues use today's keys
   byte-identically, partitioned queues use a new prefix
   (!sqs|msg|data|p|<queue>|<partition>|...) with an explicit "p|"
   discriminator that cannot collide with the legacy prefix even
   when partition=0. Constructor sketch updated.

2. ReceiveMessage "false empty" on non-local-leader partitions
   (Codex P1 + Gemini medium). Original §4.2 had clients pinned to
   one node only seeing partitions whose leader is local — messages
   on other partitions backlog forever and clients hit empty replies
   that look like "queue is drained". Rewrote §4.2 so ReceiveMessage
   proxies the per-partition fanout to the right leader via the
   existing leader-proxy machinery (extended to take a partition
   argument). Added a partitionOrder rotation seeded by RequestId so
   successive calls don't bias toward partition 0. Co-located
   deployments still pay nothing; spread deployments pay one extra
   hop per non-local partition only until MaxNumberOfMessages are
   collected.

3.C per-queue throttling doc:

3. Single sync.Mutex on the bucket store would be a hot-path
   contention point (Gemini medium). Switched §3.1 to a sync.Map
   for the bucket lookup (read-mostly access pattern) plus a
   per-bucket sync.Mutex for charge / refill. Cross-queue traffic
   never serialises on the same lock. Documented the three-step
   charge operation explicitly so readers see the locking story.

4. sqs_throttle_tokens_remaining must be a gauge, not a counter
   (Codex P2). Token budgets go up and down; a counter would mask
   exactly the depletion operators need to see. Updated both §4.1
   (files-touched table) and §7 (operational dashboard) to call
   the metric a gauge. Counter is correct for the rejections total.

Docs-only PR; no code changes, no implementation gate.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Pushed addressing all four review findings (Codex + Gemini converged):

3.D split-queue FIFO doc:

  • Keyspace backward-compat (CodexP1 + GeminiHigh): §3.1 rewritten — legacy queues use today's keys byte-identically, partitioned queues use a new !sqs|msg|data|p|... prefix with an explicit p| discriminator that cannot collide with legacy keys.
  • ReceiveMessage false-empty (CodexP1 + GeminiMedium): §4.2 rewritten — proxies non-local-leader partition scans via the existing leader-proxy (extended to take a partition argument) so a consumer pinned to one endpoint still sees every partition's messages. Added partitionOrder rotation to avoid head-of-line bias.

3.C per-queue throttling doc:

  • Hot-path lock contention (GeminiMedium): §3.1 switched to sync.Map lookup + per-bucket sync.Mutex for charge/refill. Cross-queue traffic never serialises on the same lock. Three-step charge operation documented explicitly.
  • Metric type (CodexP2): sqs_throttle_tokens_remaining is now correctly a gauge (token budgets go up and down). sqs_throttled_requests_total stays a counter.

Docs-only changes; no implementation gate. +64/-22 lines across the two files.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 1s —— View job


Review

  • Read updated design documents
  • Verify fix: keyspace backward-compat (3.D)
  • Verify fix: ReceiveMessage false-empty (3.D)
  • Verify fix: hot-path lock contention (3.C)
  • Verify fix: metric type correction (3.C)
  • Overall assessment

All four issues from the Codex/Gemini review have been correctly addressed. New issues found below, one of them medium-severity.


All four previous issues: resolved ✓

3.D — Keyspace backward-compat (Gemini high + Codex P1)
The p| discriminator approach cleanly separates the new partitioned keyspace from legacy keys. Legacy paths stay byte-identical; partitioned queues get !sqs|msg|data|p|<queue>|<partition>|…. The reaper enumerating both prefixes is a good forward-compatibility detail.

3.D — ReceiveMessage false-empty (Gemini medium + Codex P1)
§4.2 now proxies non-local-leader partitions via the existing proxyToLeader machinery (extended with a partition argument) and the partitionOrder rotation seeded by RequestId is a nice head-of-line bias fix. The X-Elastickv-Receive-Partition: <k> header approach is clean and consistent with the existing proxy pattern.

3.C — Hot-path lock contention (Gemini medium)
sync.Map for the bucket registry (lock-free reads) + per-bucket sync.Mutex for charge/refill is the right pattern. The three-step charge operation is well-documented. The race-on-insert via LoadOrStore is handled correctly (both racers agree on config).

3.C — Metric type (Codex P2)
sqs_throttle_tokens_remaining is correctly a gauge in both §4.1 (files-touched table) and §7 (operational section). Both sites updated.


New issues in the updated doc

[Medium] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §8.1 — stale failure mode description

The fix to §4.2 eliminated the false-empty problem server-side (proxy fanout handles it), but §8.1 still says:

"Consumer never sees a partition's leader: a consumer that always lands on node A sees only the partitions whose leader is A … Mitigation: document that clients should round-robin endpoints."

This directly contradicts §4.2 — the "round-robin endpoints" mitigation is no longer correct or necessary for this failure mode, since the proxy now handles it transparently. An implementor reading §8.1 in isolation would build the wrong mental model and might even suppress the proxy logic thinking clients are expected to handle distribution themselves.

The failure mode to document now is the performance cost, not the correctness risk: proxy latency under spread deployments (one extra RTT per non-local partition), and the trade-off that a consumer asking for 1 message touches at most 1 remote leader while an "is queue empty?" poll costs N proxy round-trips. Fix this →

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.1 — sqsMsgDataKey variadic signature is a footgun

The proposed constructor:

func sqsMsgDataKey(queueName string, gen uint64, messageID string, partitions ...uint32) []byte {

Using a variadic for the partition discriminator has two issues as implementation guidance:

  1. sqsMsgDataKey(q, gen, id, p0, p1) silently ignores p1 — a caller could accidentally pass the wrong count and the compiler won't catch it.
  2. Variadic calls allocate even for the zero-argument case in some Go versions, which matters on the legacy hot path.

The doc could instead name two distinct helpers (legacyMsgDataKey / partitionedMsgDataKey) with the thin dispatcher staying internal. The variadic form is fine as a unified constructor for now but the implementation PR should be aware of these edge cases.

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.1 — p| discriminator safety constraint is implicit

The doc correctly states the discriminator prevents collision, but doesn't state why it's safe: valid SQS queue names are alphanumeric + hyphen + underscore and cannot contain |, so no queue name can begin with p|, making the prefix unambiguous. One sentence in §3.1 would nail this closed explicitly — the implementation PR's name validator needs to enforce the | prohibition on queue names (which is standard SQS behaviour, but worth calling out as load-bearing here).

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §8.5 — upgrade gate mechanism underspecified

"The catalog rejects CreateQueue with PartitionCount > 1 until every node in the cluster reports the new feature flag" is the gate-of-no-return safety mechanism, but the doc doesn't describe how a node reports its feature flag or how the catalog queries all live nodes. The cross-reference to "§3.3.2 admin-forwarding upgrade gate from the admin dashboard design" is opaque without a link. Since this is the safety rail against accidentally creating partitioned queues in a mixed-version cluster, the mechanism deserves a sentence or two (e.g. "each node's health endpoint advertises a capabilities set; the catalog reads all registered nodes from the distribution layer's node registry and gates on all of them reporting HTFIFO").

[Low] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.4 — Retry-After: 1 lies for slow-refill queues

The doc justifies the constant Retry-After: 1 as "at the configured refill rate, one second is enough for at least one fresh token" — but this is only true when refillRate >= 1 token/s. A queue configured with SendRefillPerSecond = 0.1 needs 10 seconds for the next token; returning Retry-After: 1 causes 9 unnecessary 400 retries before the client finally gets through. The §3.4 table comment should note this edge case and suggest max(1, ceil(1/refillRate)) seconds, or cap the minimum SendRefillPerSecond at 1 to keep the constant valid.


Summary

The four original issues are cleanly resolved. The §8.1 stale failure mode is the only finding I'd call blocking — it directly contradicts §4.2 and gives implementors the wrong mental model. The remaining three are low-severity clarifications that would improve the implementation PRs' accuracy.
| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fa02faa35

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


## 8. Failure Modes and Edge Cases

1. **Consumer never sees a partition's leader**: a consumer that always lands on node A sees only the partitions whose leader is A. If partitions are spread evenly, this is `1/N` of the queue. Mitigation: AWS SDK consumers naturally distribute across endpoints; for elastickv, document that clients should round-robin endpoints when consuming a partitioned FIFO.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove node-local receive behavior from failure guidance

Fresh evidence: §8.1 explicitly says an endpoint-pinned consumer only sees 1/N of partitions, which contradicts §4.2’s required proxy fanout behavior. Keeping this guidance in the proposal makes the implementation ambiguous, and following it would regress ReceiveMessage to node-local visibility where messages on remote-leader partitions can backlog indefinitely for single-endpoint consumers.

Useful? React with 👍 / 👎.

Comment on lines +109 to +111
if meta.PartitionCount <= 1 {
return 0
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align partition routing with perQueue throughput mode

Section 3.2 says FifoThroughputLimit=perQueue should effectively collapse routing to one partition, but the partitionFor pseudocode only checks PartitionCount and will hash across all partitions whenever PartitionCount > 1. If this is implemented as written, perQueue mode cannot behave as documented and will still distribute groups across partitions.

Useful? React with 👍 / 👎.

| JSON | HTTP 400, body `{"__type":"Throttling","message":"Rate exceeded for queue '<name>' action '<action>'"}`, header `x-amzn-ErrorType: Throttling`, header `Retry-After: 1` |
| Query | HTTP 400, body `<ErrorResponse><Error><Type>Sender</Type><Code>Throttling</Code><Message>...</Message></Error><RequestId>...</RequestId></ErrorResponse>`, headers as above |

`Retry-After: 1` is the conservative default — at the configured refill rate, one second is enough for at least one fresh token. A future iteration could compute the precise wait from `(1 - currentTokens) / refillRate` but the constant is enough for SDK backoff logic.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compute Retry-After from refill rate

This claim assumes every bucket refills at least one token per second, but the proposed validator allows any positive refill rate. With configurations like RefillPerSecond=0.1, a Retry-After: 1 response guarantees another immediate throttle, creating avoidable retry pressure and misleading backoff behavior; the wait should be derived from the bucket’s refill rate (or the config should enforce a minimum of 1).

Useful? React with 👍 / 👎.

Five issues raised by Claude on PR #664 (one medium, four low — all
docs-only edits to the partial doc trio):

3.D split-queue FIFO doc:

1. (Medium) §8.1 stale failure-mode description contradicted §4.2.
   Original §8.1 said "Consumer never sees a partition's leader →
   mitigation: round-robin endpoints" — but §4.2 was rewritten earlier
   in this PR to proxy non-local partitions server-side, eliminating
   the false-empty risk entirely. Implementors reading §8.1 in
   isolation would have built the wrong mental model and possibly
   suppressed the proxy logic. §8.1 now describes the actual remaining
   failure mode — proxy round-trip latency under spread deployments —
   and explains that single-partition / co-located deployments pay
   nothing.

2. (Low) §3.1 variadic sqsMsgDataKey was a footgun. The variadic
   form would silently accept extra args without compiler diagnosis.
   Replaced with two named constructors (legacyMsgDataKey /
   partitionedMsgDataKey) and an explicit dispatch at the call site
   based on meta.PartitionCount > 1.

3. (Low) §3.1 p| discriminator safety constraint was implicit.
   Added an explicit paragraph explaining why the prefix cannot
   collide with the legacy keyspace: queue names admit only
   [A-Za-z0-9_-] (+ optional .fifo suffix), the validator must
   continue to reject |, and base32 segments never start with the
   literal ASCII p. Future relaxations of the name validator have
   to revisit this prefix scheme.

4. (Low) §8.5 upgrade-gate mechanism was underspecified. Added the
   concrete protocol: each node's /sqs_health endpoint advertises
   capabilities: ["htfifo"] once this PR's code is in the binary;
   the catalog reads the live node set from the distribution
   layer's node registry, polls /sqs_health on each, and gates
   PartitionCount > 1 on every node reporting htfifo. Fail-closed
   on timeouts. References PR #644 as the prior art for the same
   pattern.

3.C per-queue throttling doc:

5. (Low) §3.4 Retry-After: 1 lied for slow-refill queues. A queue
   with SendRefillPerSecond = 0.1 needs 10 s for the next token;
   returning 1 burned 9 unnecessary client retries before the
   request would succeed. Replaced with a per-queue computation:
   ceil((1 - currentTokens) / refillRate), floored at 1 s to
   match Retry-After's integer-second granularity. Fast-refill
   queues (>= 1 RPS) preserve the existing Retry-After: 1
   behaviour.

Docs-only PR; no code changes, no implementation gate.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Pushed addressing all five findings from your follow-up review (1 medium, 4 low):

3.D split-queue FIFO doc:

  • (medium) §8.1 stale failure-mode description rewritten — was contradicting §4.2's proxy fix. Now describes the actual remaining failure (proxy RTT cost under spread deployment) instead of the eliminated false-empty risk.
  • (low) §3.1 variadic sqsMsgDataKey replaced with two named constructors (legacyMsgDataKey / partitionedMsgDataKey) + explicit call-site dispatch — no silent argument loss.
  • (low) §3.1 p| discriminator safety made explicit: queue-name validator's | prohibition + base32-encoded segments never starting with literal p together guarantee no collision.
  • (low) §8.5 upgrade-gate mechanism now concrete: /sqs_health advertises capabilities: ["htfifo"]; catalog polls every registered node, fail-closed on timeout. References PR admin: AdminForward follower-side client + handler integration (P1, partial) #644 as prior art.

3.C per-queue throttling doc:

  • (low) §3.4 Retry-After now computed per-queue from ceil((1 - currentTokens) / refillRate) (floored at 1 s for HTTP integer-second granularity) instead of the constant 1. Slow-refill queues no longer cause busy-loops.

Docs-only changes; no implementation gate. +24/-13 lines net across the two files.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 5m 50s —— View job


Review

  • Read updated design documents
  • Verify fix: §8.1 stale failure-mode description (3.D) — medium
  • Verify fix: §3.1 named constructors replacing variadic (3.D) — low
  • Verify fix: §3.1 p| discriminator safety explicit (3.D) — low
  • Verify fix: §8.5 upgrade-gate mechanism concrete (3.D) — low
  • Verify fix: §3.4 Retry-After computed from refill rate (3.C) — low
  • Check for new issues
  • Overall assessment

All five issues from the last Claude review are correctly resolved. One pre-existing Codex P2 finding from the second-pass review was not in the trigger's scope but remains open; surfacing it here. Two low-severity findings are new.


All five previous issues: resolved ✓

3.D — §8.1 stale failure-mode description (medium)
The failure mode now reads "Proxy RTT under spread deployment" with the explicit note that §4.2 proxies them server-side so false-empty is gone. The N-proxy-round-trip pathological case (asking "is anything here?" against an empty queue) and the long-poll mitigation are both called out. No contradiction with §4.2.

3.D — §3.1 variadic sqsMsgDataKey (low)
Two distinct named constructors: legacyMsgDataKey(queueName, gen, messageID) and partitionedMsgDataKey(queueName, partition, gen, messageID). Call-site dispatch on meta.PartitionCount > 1. No variadic, no silent argument loss.

3.D — §3.1 p| discriminator safety (low)
The argument is now explicit: (1) queue names are restricted to [A-Za-z0-9_-] so | can never appear; (2) the generation field is base32-encoded so it uses only uppercase A-Z and digits 2–7, meaning base32 cannot produce |, which prevents any legacy-key prefix scan from matching a new partitioned-key scan prefix; (3) a note that the implementation PR's name validator must continue to reject |. The argument is present and ultimately correct.

3.D — §8.5 upgrade-gate mechanism (low)
Concrete now: /sqs_health advertises capabilities: ["htfifo"]; the catalog's CreateQueue handler polls every registered node from the distribution layer's node registry (same registry as proxyToLeader); fail-closed on timeout; explicit prior art reference to PR #644's AdminForward gate.

3.C — §3.4 Retry-After formula (low)
Formula is ceil((1.0 - currentTokens) / refillRate) with max(1, ...) floor, inline example: refillRate=0.1Retry-After: 10. The "validator keeps refillRate > 0" divide-by-zero note is a good touch.


New findings

[Medium] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.3 — FifoThroughputLimit=perQueue not implemented in partitionFor

This was flagged as Codex P2 in the second-pass review but was not in the trigger's scope (trigger addressed Claude's five findings only). It remains open in the doc.

§3.2 documents FifoThroughputLimit:

"the alternative 'perQueue' reduces the partition assignment to a single partition (effectively N=1)"

But §3.3's partitionFor pseudocode only gates on meta.PartitionCount <= 1:

if meta.PartitionCount <= 1 {
    return 0
}
// falls through to hash for any PartitionCount > 1

A queue created with PartitionCount=8 + FifoThroughputLimit=perQueue would hash across all 8 partitions, directly contradicting §3.2's documented semantics. The fix is one line: Fix this →

// Fixed guard:
if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" {
    return 0
}

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §4.1 — pseudocode still uses the old sqsMsgDataKey name

§3.1 renamed the constructor, but §4.1 step 5 still references the old form:

5. Build the OCC OperationGroup with partition-aware keys:
     dataKey  = sqsMsgDataKey(queue, partitionIndex, gen, msgID)

An implementor reading §4.1 in isolation gets conflicting guidance vs. §3.1. The pseudocode should dispatch via the named constructors introduced in §3.1: Fix this →

[Low] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.4 — Retry-After formula hardcodes 1.0 regardless of batch size

The fix correctly computes Retry-After from refillRate. But the formula numerator is always 1.0:

secondsToNextToken := math.Ceil((1.0 - currentTokens) / refillRate)

For a SendMessageBatch of 10 entries (charge = 10) against a bucket with 0 tokens and refillRate = 1.0, this gives Retry-After: 1 — but 10 seconds are needed for the batch to succeed. The client burns 9 retries before accumulating enough tokens. The formula should use the requested count instead of the constant 1.0:

secondsToNextToken := math.Ceil((float64(requestedCount) - currentTokens) / refillRate)

This matters most for large batches with slow-to-medium refill rates; for fast refill or single operations the constant-1 floor masks the gap anyway.


Summary

All five issues from the last review are cleanly fixed. The FifoThroughputLimit=perQueue routing gap (Codex P2 from the second-pass review, not previously in Claude's scope) is the only finding here I'd call blocking before the implementation PRs — it means the routing logic in §3.3 contradicts the semantics documented in §3.2. The other two are low-severity doc cleanups.
| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f289d6c291

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +109 to +110
2. On miss, build the bucket from queue meta and `LoadOrStore` it (one-shot insert race tolerated — both racers will agree on the same configuration).
3. Acquire the bucket's own `mu`, refill based on elapsed time, take or reject the requested tokens, release `mu`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Invalidate cached buckets on throttle config updates

The design caches each (queue, action) bucket on first use via LoadOrStore, but it never defines a refresh/invalidatation path when SetQueueAttributes changes throttle values. For active queues this means updated limits can stay stale until idle eviction or failover, so operators may keep over-throttling or under-throttling traffic long after they reconfigure the queue.

Useful? React with 👍 / 👎.

Comment on lines +181 to +182
secondsToNextToken := math.Ceil((1.0 - currentTokens) / refillRate)
retryAfter := max(1, int(secondsToNextToken)) // never less than 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Base Retry-After on requested token cost

The proposed Retry-After calculation always targets the next single token, but this proposal charges batch requests by entry count. When a large SendMessageBatch is throttled, clients can be told to retry in 1 second even though many more seconds are needed to refill enough tokens for that same batch, which guarantees repeated throttles and adds avoidable retry pressure.

Useful? React with 👍 / 👎.

Comment on lines +202 to +205
`kv/shard_router.go` today routes by queue name. With partitions, the routing key becomes `(queueName, partitionIndex)`. The existing `--raftSqsMap` flag accepts entries keyed by queue name; the syntax extends to accept a partition suffix:

```
--raftSqsMap "orders.fifo:8=group-7,group-8,group-9,group-10,group-11,group-12,group-13,group-14"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Split partition routing from raftSqsMap endpoint mapping

This section assigns queue/partition placement to --raftSqsMap, but that flag is currently defined as raftAddr=sqsAddr endpoint mapping for leader proxying. Reusing it for partition-to-group routing makes the config surface ambiguous and can break proxy address resolution, especially in the receive fanout flow that depends on correct leader endpoint lookup.

Useful? React with 👍 / 👎.

Three findings (1 medium, 2 low):

3.D split-queue FIFO doc:

1. (Medium) §3.3 partitionFor missed FifoThroughputLimit=perQueue.
   §3.2 documents that perQueue "reduces routing to a single partition
   regardless of PartitionCount" but the §3.3 pseudocode only checked
   PartitionCount <= 1. A queue created with PartitionCount=8 +
   FifoThroughputLimit=perQueue would hash across all 8 partitions,
   directly contradicting the documented semantics. Updated the early-
   return guard to:
       if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" {
           return 0
       }
   Codex P2 caught this in the second-pass review but it was outside
   the scope of the previous fix iteration.

2. (Low) §4.1 step 5 still referenced the old variadic
   sqsMsgDataKey(queue, partitionIndex, gen, msgID). §3.1 had already
   moved to two named constructors (legacyMsgDataKey /
   partitionedMsgDataKey) but §4.1's pseudocode still used the
   variadic form, contradicting §3.1. Replaced with explicit if/else
   dispatch through the named constructors so an implementor reading
   §4.1 in isolation gets the same naming as §3.1.

3.C per-queue throttling doc:

3. (Low) §3.4 Retry-After formula's numerator was hardcoded to 1.0
   regardless of charge size. SendMessageBatch of 10 entries against
   a 0-token bucket with refillRate = 1.0 would have returned
   Retry-After: 1 (insufficient) instead of 10. Updated formula to:
       needed := float64(requestedCount) - currentTokens
       secondsToNextRefill := math.Ceil(needed / refillRate)
       retryAfter := max(1, int(secondsToNextRefill))
   requestedCount is the same value §3.3's charge step uses
   (1 for single, len(Entries) for batch).

Docs-only PR; no code changes, no implementation gate.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Pushed addressing all three findings from your third review (1 medium, 2 low):

3.D split-queue FIFO doc:

  • (medium) §3.3 partitionFor now also short-circuits when FifoThroughputLimit == "perQueue" — fixes the contradiction with §3.2's documented semantics. Codex P2 from the second-pass review, now resolved.
  • (low) §4.1 step 5 pseudocode now dispatches through the §3.1 named constructors (legacyMsgDataKey / partitionedMsgDataKey) instead of the old variadic form. No more conflicting guidance between §3.1 and §4.1.

3.C per-queue throttling doc:

  • (low) §3.4 Retry-After formula now uses requestedCount instead of the hardcoded 1.0 numerator. SendMessageBatch of 10 against refillRate=1.0 correctly returns Retry-After: 10 instead of 1.

Docs-only changes; +18/-7 lines net.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 2m 9s —— View job


Review

  • Read updated design documents
  • Verify fix: §3.3 partitionFor FifoThroughputLimit check (medium)
  • Verify fix: §4.1 step 5 named constructor dispatch (low)
  • Verify fix: §3.4 Retry-After uses requestedCount (low)
  • Check for new issues (including open Codex P1 findings)
  • Overall assessment

All three issues from the trigger are correctly resolved. Two Codex P1 findings from the previous review remain open and are surfaced below (they were not in scope of this push). Two new low-severity issues found.


All three previous issues: resolved ✓

3.D — §3.3 FifoThroughputLimit=perQueue routing (medium)
The guard is now:

if meta.PartitionCount <= 1 || meta.FifoThroughputLimit == "perQueue" {
    return 0
}

The comment explicitly documents that perQueue collapses all groups to partition 0 regardless of PartitionCount, and calls out the Codex P2 catch. The contradiction with §3.2 is resolved.

3.D — §4.1 step 5 named constructor dispatch (low)
Step 5 now uses the explicit if meta.PartitionCount > 1 dispatch branching to partitionedMsgDataKey / legacyMsgDataKey (and analogues for visKey / groupKey). No more reference to the old variadic sqsMsgDataKey. §3.1 and §4.1 are now consistent.

3.C — §3.4 Retry-After uses requestedCount (low)
Formula is now:

needed              := float64(requestedCount) - currentTokens
secondsToNextRefill := math.Ceil(needed / refillRate)
retryAfter          := max(1, int(secondsToNextRefill))

Both cases are correctly handled: a SendMessageBatch of 10 against refillRate=1.0 gives Retry-After: 10; a single SendMessage against refillRate=0.1 also gives Retry-After: 10. The floor-1 and divide-by-zero notes are preserved.


Open Codex P1 findings from the previous round (not addressed in this push)

These were raised against f289d6c291 and are still open in the current doc.

[Medium] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.1 — bucket cache not invalidated on SetQueueAttributes

§3.1 describes LoadOrStore on first use: "build the bucket from queue meta and LoadOrStore it." Once a bucket is stored, the doc describes no path for invalidating or refreshing it when SetQueueAttributes changes the throttle config. A queue created with SendRefillPerSecond = 10, then lowered to 1 via SetQueueAttributes, continues to allow 10 RPS until the idle-eviction sweep removes the old bucket entry (default 1h window). An operator who lowers limits to stop a noisy tenant would see the effect delayed by up to an hour.

The fix is a sentence in §3.1: SetQueueAttributes must call buckets.Delete(key) for all action-keys of the updated queue after committing the meta change to Raft, so the next request re-initialises the bucket from the new config. Fix this →

[Medium] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §5 — --raftSqsMap reuse conflicts with endpoint-address semantics

§5 proposes extending --raftSqsMap with a new queueName:N=group1,...,groupN syntax. The existing flag maps raftAddr=sqsAddr (endpoint routing for proxyToLeader). Reusing the same flag for a completely different concern — partition-to-Raft-group assignment — makes the config surface ambiguous: a single flag entry must now be parsed as either an address mapping or a partition assignment, and the proxyToLeader fanout in §4.2 depends on correctly resolving the per-partition leader's SQS endpoint from this same flag. Mixing the two intents in one parsing path risks a subtle bug where a partitioned-queue config entry is accidentally interpreted as a raftAddr=sqsAddr pair (or vice versa), silently producing the wrong proxy target.

A new flag (e.g. --raftSqsFifoMap) or a clearly separated config section avoids the parsing collision and lets the two concerns evolve independently. Fix this →


New findings

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §9 — no test case for FifoThroughputLimit=perQueue

§9's unit test list covers PartitionCount <= 1 and empty MessageGroupId fast-paths but does not include a test that verifies FifoThroughputLimit=perQueue routes to partition 0 even when PartitionCount = 8. The perQueue short-circuit is the fix this push added to §3.3 — without an explicit test the correctness is trivially regressionable. One test case: partitionFor({PartitionCount: 8, FifoThroughputLimit: "perQueue"}, "any-group-id") == 0. Fix this →

[Low] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §4.3 — "sharded by MessageGroupId" is the wrong key

§4.3 describes the forward-compatibility path for Phase 3.D:

"each partition gets its own bucket, sharded by MessageGroupId"

The bucket key should be (queueName, partitionID), not (queueName, MessageGroupId). MessageGroupId is the input to partitionFor — the partition index it maps to is what identifies which leader owns the bucket. Using MessageGroupId as the key would create one bucket per group value (unbounded, attacker-amplifiable map size); using partitionID creates exactly PartitionCount buckets per queue (bounded by design). The §4.3 sentence should read "each partition gets its own bucket, keyed by (queueName, partitionID)." Fix this →


Summary

All three fixes from this push are cleanly applied. The two Codex P1 findings from the previous review — bucket cache invalidation on SetQueueAttributes (3.C) and --raftSqsMap parsing ambiguity (3.D) — remain the only items I'd consider blocking before implementation PRs. Both are design-level gaps that implementation code will have to resolve anyway; better to get the doc right first. The two new findings are low-severity doc corrections.
| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d424d18d7c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +93 to +94
queue string
action string // "Send" | "Receive" | "*"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Key throttle buckets by queue generation

The bucket key is defined as only (queue, action), so a deleted/recreated queue with the same name will reuse the old in-memory bucket state until idle eviction, inheriting stale token balance and potentially stale limits. In practice this can throttle a brand-new queue immediately (or let it burst unexpectedly) after DeleteQueue/recreate, which breaks queue incarnation isolation; include generation (or another meta version) in the key or define explicit invalidation on queue lifecycle changes.

Useful? React with 👍 / 👎.

Comment on lines +185 to +189
3. For each partitionIndex in partitionOrder, until MaxNumberOfMessages
are collected or every partition has been tried:
a. Resolve the leader for (queue, partitionIndex).
b. If this node is the leader: scan locally, deliver candidates.
c. Otherwise: forward the request to the leader-of-partition via
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bound per-partition proxy calls by shared wait budget

This flow fans out ReceiveMessage across partitions by forwarding requests partition-by-partition, but it does not require reducing WaitTimeSeconds across hops. If each forwarded call long-polls independently, an empty queue can take roughly PartitionCount × WaitTimeSeconds, which can exceed SQS-style wait expectations and cause client/LB timeouts; the design should require a single request-level deadline and pass only remaining budget to each partition probe.

Useful? React with 👍 / 👎.

Four findings (2 medium, 2 low):

3.C per-queue throttling doc:

1. (Medium) §3.1 missed bucket cache invalidation on SetQueueAttributes.
   The doc described LoadOrStore on first use but no path for refreshing
   the in-memory bucket when SetQueueAttributes changes the throttle
   config. An operator lowering a limit to stop a noisy tenant would
   wait up to the idle-eviction window (1h default) before the change
   took effect. Added a paragraph: SetQueueAttributes calls
   buckets.Delete(key) for every action key after the Raft commit so
   the next request rebuilds from the new meta. The race with
   LoadOrStore is benign (the rebuilt bucket starts at full capacity,
   same semantics as the failover case).

2. (Low) §4.3 forward-compat sentence used the wrong key for the
   per-partition bucket. "sharded by MessageGroupId" was wrong —
   MessageGroupId is the input to partitionFor, not the bucket key.
   Using it directly would create one bucket per unique group value
   (unbounded, attacker-amplifiable). The bucket key is
   (queueName, partitionID) which is bounded by PartitionCount.
   Replaced the sentence with the correct keying and added the
   bucketKey-grows-a-partition-field note.

3.D split-queue FIFO doc:

3. (Medium) §5 reused --raftSqsMap for partition-to-Raft-group
   assignment. That flag already maps raftAddr=sqsAddr for
   proxyToLeader's endpoint resolution; overloading the same parser
   with partition assignments creates a parsing ambiguity that could
   silently produce the wrong proxy target in §4.2's fanout. Replaced
   with a new dedicated --sqsFifoPartitionMap flag. The existing
   --raftSqsMap is unchanged. Updated §11's rollout table accordingly.

4. (Low) §9 unit-test list was missing a case for
   FifoThroughputLimit=perQueue routing. The §3.3 short-circuit
   guard (added in the previous fix iteration) would silently
   regress without a test. Added: "FifoThroughputLimit=perQueue with
   PartitionCount=8 routes every group ID to partition 0."

Docs-only PR; no code changes, no implementation gate.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Pushed addressing all four findings from your fourth review (2 medium, 2 low):

3.C per-queue throttling doc:

  • (medium) §3.1 now requires SetQueueAttributes to call buckets.Delete(key) after the Raft commit so a config change takes effect on the next request, not after the 1h idle-eviction window. Race with LoadOrStore is benign (rebuilt bucket starts at full capacity, same as failover).
  • (low) §4.3 forward-compat fixed: bucket key for split-FIFO is (queueName, partitionID), not MessageGroupId — the latter would create unbounded buckets per group value.

3.D split-queue FIFO doc:

  • (medium) §5 now uses a new dedicated --sqsFifoPartitionMap flag for partition-to-Raft-group assignment instead of overloading --raftSqsMap (which keeps doing endpoint mapping for proxyToLeader). Eliminates the parsing ambiguity. §11 rollout table updated.
  • (low) §9 unit test list now includes FifoThroughputLimit=perQueue with PartitionCount=8 routes every group to partition 0 to lock down the §3.3 short-circuit guard against regression.

Docs-only changes; +9/-6 lines net.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 26s —— View job


Review

  • Read updated design documents
  • Verify fix: §3.1 SetQueueAttributes bucket invalidation (3.C) — medium
  • Verify fix: §4.3 bucket key (queueName, partitionID) vs MessageGroupId (3.C) — low
  • Verify fix: §5 --sqsFifoPartitionMap dedicated flag (3.D) — medium
  • Verify fix: §9 FifoThroughputLimit=perQueue test case (3.D) — low
  • Check for remaining open Codex P1 items from prior rounds
  • Check for new issues
  • Overall assessment

All four issues from the trigger are correctly resolved. Two Codex P1 findings first surfaced in my fourth-round review — and noted there as "not in scope of this push" — remain open in the current doc. One new low-severity finding.


All four trigger fixes: resolved ✓

3.C — §3.1 SetQueueAttributes bucket cache invalidation (medium)
The paragraph is explicit and accurate: after the Raft commit, the handler calls buckets.Delete(key) for all three action-keys (Send, Recv, Default). The race with LoadOrStore is correctly noted as benign (fresh bucket starts at full capacity, same as failover). The 1h idle-eviction context makes the urgency of the invalidation clear.

3.C — §4.3 bucket key (queueName, partitionID) (low)
The corrected key is explicit, the MessageGroupId-as-key footgun is called out with the reason (unbounded attacker-amplifiable map), and the bucketKey struct is noted to gain a partition uint32 field when Phase 3.D lands. Consistent with §3.1's struct definition.

3.D — §5 --sqsFifoPartitionMap dedicated flag (medium)
The new flag is introduced with clear syntax (orders.fifo:8=group-7,...group-14), the separation from --raftSqsMap's endpoint-mapping role is stated explicitly and the reason (parsing ambiguity → wrong proxy target) is documented. The rollout table (§11, PR 4) also references the new flag by name.

3.D — §9 FifoThroughputLimit=perQueue test case (low)
The bullet is present, includes the concrete call form (FifoThroughputLimit = "perQueue" with PartitionCount = 8 → partition 0 for every group ID), and the regression rationale is spelled out ("the perQueue branch is a one-line guard that could easily be dropped during a refactor"). Good.


Still-open Codex P1 findings from d424d18d7c (surfaced in my last review, not addressed in this push)

These were called out as "not in scope of this push" in my fourth-round review summary but were not deferred to a follow-up issue — they are design-level gaps the implementation PRs will have to resolve.

[Medium] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.1 — DeleteQueue does not invalidate the bucket cache

§3.1 now correctly invalidates buckets on SetQueueAttributes, but says nothing about DeleteQueue. A queue deleted and recreated with the same name inherits the old in-memory bucket's token balance and — if the new queue was created without throttle config — the old bucket's capacity/refillRate as well, until the idle-eviction sweep (1h default). An operator who uses DeleteQueue+CreateQueue to reset a noisy queue's state gets a surprise: the throttle limits from the deleted queue follow the new one for up to an hour.

The fix is one sentence: "DeleteQueue calls buckets.Delete for all action-keys of the queue (same as SetQueueAttributes), so a recreated same-name queue starts with a clean bucket." Alternatively, include generation in bucketKey so old and new incarnations never share a bucket entry. Fix this →

[Medium] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §4.2 — WaitTimeSeconds not bounded across the partition fanout

The fanout iterates partitions sequentially until MaxNumberOfMessages are collected. If each proxied sub-call independently long-polls for WaitTimeSeconds, an empty partitioned queue with WaitTimeSeconds = 20 and PartitionCount = 8 holds the connection for up to 160 s before returning empty — well past any reasonable client or load-balancer timeout. The doc notes "FIFO consumers are expected to use long-poll (WaitTimeSeconds)" as a mitigant for the N-round-trip cost, which inadvertently encourages the very configuration that creates the problem.

§4.2 should specify how the wait budget is managed: either (a) a single top-level deadline is threaded through as remaining_wait = WaitTimeSeconds - elapsed_so_far and passed to each sub-call, so the total wait is bounded by WaitTimeSeconds; or (b) partitions are probed in parallel and the first non-empty result short-circuits. Option (a) matches how AWS's internal partition fanout works; option (b) is simpler to implement but changes the per-partition polling semantics. Either choice should be explicit. Fix this →


New finding

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.4 — reference to --raftSqsMap for partition routing contradicts §5

§3.4 states: "The existing kv/shard_router.go plus the multi-shard configuration flags (--raftSqsMap) already support this — each (queueName, partition) pair becomes its own routing key."

--raftSqsMap maps raftAddr=sqsAddr (endpoint pairs for proxyToLeader); it does not map (queueName, partition) → Raft group. §5 is explicit that partition-to-Raft-group assignment requires the new --sqsFifoPartitionMap flag and that --raftSqsMap is unchanged by this design. An implementor reading §3.4 would conclude partition routing is free with the existing flag, reach §5, and be confused about what needs to be added. The §3.4 sentence should say that while the router infrastructure (kv/shard_router.go) supports multi-group routing today, the partition-to-group assignment configuration requires the new --sqsFifoPartitionMap flag described in §5. Fix this →


Summary

All four trigger fixes are cleanly applied. The two Codex P1 items that have been open since d424d18d7cDeleteQueue bucket lifecycle and WaitTimeSeconds budget management across the fanout — are the only findings I'd call blocking before implementation PRs start. Both are design decisions the implementation code has to embody either way; resolving them in the doc now prevents the PR author from having to reverse a committed approach. The §3.4/§5 inconsistency is low-severity cleanup.
| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6bd33ab70

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +153 to +155
- All four `Send*` / `Recv*` fields must be either both zero (disabled) or both positive.
- Capacity ≥ refill (otherwise the bucket can never burst above the steady state).
- A hard ceiling per queue (e.g. 100,000 RPS) so a typo (`SendCapacity = 1e9`) does not silently mean "no limit at all" but rejects with `InvalidAttributeValue`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require bucket capacity to cover max batch token cost

The validator currently only enforces paired positivity and capacity >= refill, but §3.3 charges batch APIs by len(Entries) (up to 10). With configs like SendCapacity=5, a valid SendMessageBatch of 10 can never succeed because the bucket can never accumulate 10 tokens, so clients get permanent throttling and misleading Retry-After values. Please add validation that per-action capacity is at least the maximum single-request charge (or change the charging rule) to avoid creating unrecoverable throttling states.

Useful? React with 👍 / 👎.

Comment on lines +273 to +275
5. **Mixed-version cluster**: a rolling upgrade where some nodes have HT-FIFO and others don't. The new feature gates on the queue's `PartitionCount > 1` field, which is set at create time; old nodes that try to scan a partitioned queue's keyspace will simply not find anything (the prefix has changed). The catalog rejects `CreateQueue` with `PartitionCount > 1` until every node in the cluster reports the new feature flag.

**The capability advertisement mechanism**: each node's existing `/sqs_health` endpoint (`adapter/sqs.go: serveSQSHealthz`) gains a new field in its JSON body — `capabilities: ["htfifo"]` once this PR's code is in the binary. The catalog's CreateQueue handler reads the live node set from the distribution layer's node registry (the same registry used by `proxyToLeader` to locate leaders), polls `/sqs_health` on each, and gates `PartitionCount > 1` on every node reporting the `htfifo` capability. Nodes that don't respond within a short timeout are treated as not-yet-upgraded — a deliberate fail-closed default so a network blip does not let a partitioned queue land in a partially-upgraded cluster. This mirrors the §3.3.2 admin-forwarding upgrade gate from the admin dashboard design (PR #644), which uses the same "all-nodes-must-report" pattern for `AdminForward`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce HT-FIFO capability beyond queue creation

This guard only blocks CreateQueue when not all nodes advertise htfifo, but the same section notes older nodes will not see partitioned-keyspace data. That leaves a failure mode where a downgraded/replaced old node joins after creation, becomes leader for a partitioned queue, and serves false-empty reads/writes against the wrong prefix. The proposal should require a runtime safeguard (startup or leadership admission) that prevents non-HT-FIFO binaries from serving once any PartitionCount > 1 queue exists.

Useful? React with 👍 / 👎.

…conds, §3.4)

Three findings from the fifth-round Claude review on PR #664:

3.C per-queue throttling — DeleteQueue bucket lifecycle (medium):
§3.1 now requires DeleteQueue to call buckets.Delete(key) for every
bucketKey of the deleted queue (Send, Recv, Default), so a same-name
recreation does not inherit stale token balance/limits for up to 1h.
Alternative (generation in bucketKey) considered and rejected as more
expensive than an explicit Delete.

3.D split-queue FIFO — WaitTimeSeconds budget across fanout (medium):
§4.2 now threads a shared deadline through the partition fanout
(step 3) and passes remainingWait per sub-call (step 4a/c/d). Without
this, an empty queue with WaitTimeSeconds=20 + PartitionCount=8 would
hold the connection for up to 160s. §8.1 also drops the contradictory
"long-poll mitigates the cost" sentence — long-poll was actually the
trigger for the bug.

3.D split-queue FIFO — §3.4 --raftSqsMap reference (low): rephrased
to distinguish the router infrastructure (kv/shard_router.go, which
already supports multi-group routing by key) from the configuration
surface (the new --sqsFifoPartitionMap flag in §5). Eliminates the
contradiction with §5.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Pushed addressing all three findings from your fifth-round review (2 medium, 1 low):

3.C per-queue throttling — §3.1 DeleteQueue bucket lifecycle (medium):

  • New paragraph immediately after the SetQueueAttributes invalidation block. DeleteQueue now also calls buckets.Delete(key) for every bucketKey of the deleted queue (Send, Recv, Default), so a DeleteQueue+CreateQueue with the same name no longer inherits the previous incarnation's token balance/limits for up to 1h. Documented the generation-in-bucketKey alternative and the reason for picking the explicit Delete (cheaper, mirrors the SetQueueAttributes pattern).

3.D split-queue FIFO — §4.2 WaitTimeSeconds budget across fanout (medium):

  • Pseudocode reworked: step 3 sets deadline := start + WaitTimeSeconds (cap 20s); step 4a/4c/4d compute remainingWait := max(0, deadline - now()) and pass it as the per-sub-call long-poll budget. Total wall-clock wait is now bounded by the original WaitTimeSeconds regardless of PartitionCount. Without this, an empty queue with WaitTimeSeconds=20 + PartitionCount=8 would hold the connection for up to 160s.
  • Added a "Why the shared deadline matters" paragraph documenting the bug class and the rejected alternative (parallel probe with shared cancellation — rejected because it erases the partitionOrder rotation).
  • §8.1 dropped the contradictory "long-poll mitigates the RTT cost" sentence; long-poll was actually the trigger for the bug.

3.D split-queue FIFO — §3.4 --raftSqsMap reference (low):

  • Rephrased to distinguish the router infrastructure (kv/shard_router.go, which already supports multi-group routing by key) from the configuration surface (--sqsFifoPartitionMap, the new flag from §5). Calls out explicitly that conflating the two would parse partition assignments as endpoint pairs and route to the wrong leader. Aligns with §5.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 20s —— View job


Review

  • Read updated design documents
  • Verify fix: §3.1 DeleteQueue bucket invalidation (3.C) — medium
  • Verify fix: §4.2 WaitTimeSeconds shared deadline across fanout (3.D) — medium
  • Verify fix: §3.4 --raftSqsMap reference cleanup (3.D) — low
  • Check open Codex P1 items from last round
  • Check for new issues
  • Overall assessment

All three fixes from the trigger are correctly applied. Two Codex P1 items raised in the sixth-round review remain open and are the only findings I would consider blocking before implementation PRs. One new low-severity finding.


All three trigger fixes: resolved ✓

3.C — §3.1 DeleteQueue bucket lifecycle (medium)
The paragraph is explicit and accurate: after the Raft commit that purges the queue meta, buckets.Delete(key) is called for all three action-keys. The race with LoadOrStore is correctly noted as benign. The alternative (generation-in-bucketKey) is documented and the reason for rejecting it is clear. The 1 h idle-eviction context makes the urgency obvious.

3.D — §4.2 WaitTimeSeconds shared deadline (medium)
Step 3 sets deadline := start + WaitTimeSeconds (capped at 20s); step 4a computes remainingWait := max(0, deadline - now()) and passes it to each sub-call. The "Why the shared deadline matters" paragraph correctly explains the 160 s pathological case that the fix eliminates, and the rejected parallel-probe alternative is documented with the reason (erases partitionOrder rotation semantics). §8.1 no longer contains the contradictory long-poll mitigation sentence.

3.D — §3.4 --raftSqsMap reference cleanup (low)
The distinction is now explicit: kv/shard_router.go already supports multi-group routing (no changes needed there), but the configuration surface for partition-to-group assignment is --sqsFifoPartitionMap (§5), not --raftSqsMap. The consequence of conflating the two ("parse partition assignments as endpoint pairs and route to the wrong leader") is called out directly.


Still-open Codex P1 findings from f6bd33ab70 (sixth-round Codex review)

These were raised in the commit-by-commit Codex review on PR #664 and have not yet been addressed.

[Medium] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.2 — validator does not bound capacity to max single-request charge

The validator enforces capacity >= refill but not capacity >= max_single_request_charge. The maximum single-request charge from the Send bucket is 10 (a full SendMessageBatch). A queue configured with SendCapacity = 5 creates a state where every SendMessageBatch of > 5 entries is permanently unserviceable — the bucket can only hold 5 tokens and can never accumulate 10. The client receives Throttling forever, Retry-After grows proportionally, and there is no way to recover without reconfiguring the queue. The Retry-After formula (§3.4) correctly computes ceil((10 - 5) / refillRate) but that only tells the client "wait N seconds" before they'll be throttled again.

The fix is one validation rule: each capacity that covers a batch verb must be at least the maximum charge of any single request through that bucket. For Send: SendCapacity >= 10. For Recv: RecvCapacity >= 10 (DeleteMessageBatch is also up to 10). Document the constraint in §3.2 and reject with InvalidAttributeValue at SetQueueAttributes. Fix this →

[Medium] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §8 item 5 — no runtime safeguard against a downgraded node becoming leader after HT-FIFO queue creation

§8 item 5 says "the catalog rejects CreateQueue with PartitionCount > 1 until every node reports htfifo" — a create-time gate. But the same item acknowledges "older nodes that try to scan a partitioned queue's keyspace will simply not find anything." There is no runtime protection against the following sequence:

  1. All nodes have htfifo; a partitioned queue orders.fifo is created.
  2. Node A is upgraded back (or replaced by a rollback image) and rejoins without htfifo.
  3. Node A is elected leader for the partition-3 Raft group.
  4. Node A's ReceiveMessage scans the old prefix (!sqs|msg|data|orders.fifo|...) and finds nothing — false-empty reads.
  5. Node A's SendMessage writes the old key format — data on partition 3 is written to the wrong prefix and the reaper won't find it under the new prefix.

The create-time gate does not cover this; it only prevents new partitioned queues from being created in a mixed-version cluster. Once a partitioned queue exists, any node that serves it must also understand the new prefix.

The proposal needs a runtime safeguard: a node that does not have htfifo should refuse to become leader (or refuse to serve the SQS handler) for a Raft group that contains any queue with PartitionCount > 1. A practical implementation: on startup (or on leadership acquisition), query the catalog for partitioned queues in the local shard; if any exist and the binary lacks htfifo, refuse leadership with an explicit log message. This is the node-admission variant of the create-time gate. Fix this →


New finding

[Low] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.1 — bucketKey struct definition and DeleteQueue paragraph are inconsistent on whether partitionID is in the current key

The struct definition in §3.1 is:

type bucketKey struct {
    queue  string
    action string  // "Send" | "Receive" | "*"
}

But the DeleteQueue paragraph (added in this latest push) says:

"so we keep bucketKey = (queueName, partitionID, action) for now"

partitionID is not in the current struct. §4.3 explicitly says the struct "grows a partition uint32 field" when Phase 3.D lands. The DeleteQueue paragraph should say (queueName, action)partitionID is a future addition. An implementor reading §3.1 end-to-end will see the struct definition conflict with the inline text, and if they include partitionID in Phase 3.C based on the paragraph, the §4.3 forward-compat note will be confusing. Fix this →


Summary

All three trigger fixes are cleanly applied. The two Codex P1 items from the sixth-round review — bucket capacity bound on max batch charge (3.C §3.2) and runtime safeguard for downgraded nodes post-HT-FIFO creation (3.D §8) — are the only findings I'd call blocking before the implementation PRs. Both represent design gaps that will have to be resolved in code regardless; better to settle the spec now. The bucketKey inconsistency is a low-severity cleanup.
| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 624d411a93

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


No global lock is held during step 3; concurrent traffic on different queues runs in parallel.

**Cache invalidation on `SetQueueAttributes`**: when an operator updates the throttle config via `SetQueueAttributes`, the handler — *after* the Raft commit that persists the new `sqsQueueThrottle` — calls `buckets.Delete(key)` for every `bucketKey` belonging to the updated queue (`Send`, `Recv`, `Default`). Without this step the in-memory bucket would keep enforcing the old limits until the idle-eviction sweep removes the stale entry (default 1 h window), defeating the operator's intent to throttle a noisy tenant in real time. The `LoadOrStore` race with the `Delete` call is benign: the next request rebuilds from the freshly-committed meta, and the rebuilt bucket starts at full capacity (same semantics as the failover case documented below). Claude P1 on PR #664 caught the gap.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use canonical action keys when deleting throttle buckets

The design defines bucketKey.action as "Send" | "Receive" | "*", but this invalidation step deletes ("Send", "Recv", "Default"). If implemented as written, Receive and default buckets will not be removed on SetQueueAttributes, so updated limits can remain stale until idle eviction (and the same mismatch appears in the DeleteQueue path), which directly defeats the real-time reconfiguration behavior this section is trying to guarantee.

Useful? React with 👍 / 👎.

}
```

`PartitionCount` is **immutable after first SendMessage**. The validator on `SetQueueAttributes` rejects any change; operators who want a different partition count create a new queue. Why immutable: changing it would require re-hashing every existing message into a new partition, which (a) breaks ordering for in-flight messages of every group whose hash bucket changed, and (b) is a multi-second / multi-minute operation that cannot be expressed as one OCC transaction.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep FIFO routing attributes immutable after first send

This section only freezes PartitionCount, but routing also depends on FifoThroughputLimit (perQueue forces partition 0 in partitionFor). Allowing SetQueueAttributes to flip throughput mode after messages exist can move the same MessageGroupId to a different partition mid-stream, splitting one group across partitions and violating FIFO within-group ordering; the proposal should lock routing-affecting attributes after first send (or require queue recreation).

Useful? React with 👍 / 👎.

…rade, key)

Three findings from the sixth-round review on PR #664:

3.C per-queue throttling §3.2 — capacity bound on max batch charge
(Codex P1, medium): SetQueueAttributes validator now requires capacity
>= max single-request charge. Send/Recv buckets must be >= 10 when
non-zero (full SendMessageBatch / DeleteMessageBatch is 10 entries).
Without this rule, SendCapacity=5 makes every full batch permanently
unserviceable — the bucket can never accumulate enough tokens.

3.D split-queue FIFO §8 item 5 — runtime safeguard for downgraded
leaders (Codex P1, medium): create-time gate only blocks new
partitioned queues. Once one exists, a downgraded node lacking
htfifo capability could be elected leader and produce false-empty
reads + key-prefix corruption. Adds a node-admission rule:
on startup + leadership acquisition, enumerate local-shard queues
with PartitionCount > 1; if any and the binary lacks htfifo,
refuse leadership and step down via TransferLeadership.

3.C per-queue throttling §3.1 — DeleteQueue paragraph bucketKey
text (Claude low): "(queueName, partitionID, action)" contradicted
the struct definition above it (no partitionID until Phase 3.D §4.3).
Corrected to "(queueName, action) for Phase 3.C".
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Pushed addressing all three findings from your sixth-round review (2 medium Codex P1 + 1 low Claude):

3.C per-queue throttling — §3.2 capacity ≥ max batch charge (Codex P1, medium):

  • New fourth bullet on the SetQueueAttributes validator: when SendCapacity > 0 it must also be ≥ 10, and likewise RecvCapacity > 0 → ≥ 10. Justified inline against the §3.3 charging table (SendMessageBatch / DeleteMessageBatch charge up to 10 each). Documented the failure mode the rule prevents (permanently unserviceable batches with no recovery short of re-running SetQueueAttributes) and the rejection envelope (InvalidAttributeValue with the per-bucket minimum named so the operator sees the cause). Default* is exempt because no batch verb is in scope of the catch-all bucket today; explicit forward note for when one moves there.

3.D split-queue FIFO — §8 item 5 runtime safeguard for downgraded leaders (Codex P1, medium):

  • New paragraph after the capability advertisement mechanism. Describes the failure sequence the create-time gate misses (downgrade after a partitioned queue exists → false-empty reads + key-prefix corruption from the rolled-back leader). Specifies the safeguard: on startup and on leadership acquisition for an SQS Raft group, enumerate local-shard queues with PartitionCount > 1; if any exist and the binary lacks htfifo, refuse leadership with an explicit log line and TransferLeadership away (group becomes leaderless if no peer is willing — desired fail-closed behaviour). Hooks into the existing leadership-acquisition path in kv/lease_state.go so steady-state cost is zero. Explicitly required before the §11 PR 4 rollout step that marks the binary htfifo-eligible.

3.C per-queue throttling — §3.1 bucketKey text/struct consistency (Claude low):

  • The DeleteQueue paragraph said "we keep bucketKey = (queueName, partitionID, action) for now" but the struct definition above has only queue + action (no partitionID until Phase 3.D per §4.3). Corrected to "for Phase 3.C we keep bucketKey = (queueName, action) per the struct definition above — the partition field is added later by Phase 3.D, see §4.3".

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 10s —— View job


Review

  • Read updated design documents
  • Verify fix: §3.2 capacity ≥ max batch charge (3.C) — Codex P1 medium
  • Verify fix: §8 item 5 runtime safeguard for downgraded leaders (3.D) — Codex P1 medium
  • Verify fix: §3.1 bucketKey text/struct consistency (3.C) — Claude low
  • Check for new issues (including open Codex P1 from last round)
  • Overall assessment

All three fixes from this push are correctly applied. Two Codex P1 findings from the previous round's Codex review (on commit 624d411a93) were not in scope of this push and remain open. One new low-severity finding.


All three trigger fixes: resolved ✓

3.C — §3.2 capacity ≥ max batch charge (Codex P1, medium)
The fourth validation bullet is precise: SendCapacity > 0 → SendCapacity ≥ 10, RecvCapacity > 0 → RecvCapacity ≥ 10, justified by the §3.3 charging table. The failure mode (permanent unserviceability for full batches) is documented inline, the rejection envelope (InvalidAttributeValue) is named, and the Default* exemption is correctly scoped with an explicit forward note. The Codex P1 attribution is clear. Clean.

3.D — §8 item 5 runtime safeguard for downgraded leaders (Codex P1, medium)
The new paragraph accurately describes the failure sequence the create-time gate misses (downgrade after partitioned queue exists → false-empty reads + key-prefix corruption). The safeguard is concrete: catalog enumeration on startup + leadership acquisition, TransferLeadership on capability mismatch, fail-closed if no willing peer. The hook into kv/lease_state.go, the zero steady-state cost claim, and the ordering constraint relative to §11 PR 4 are all present and correct.

3.C — §3.1 bucketKey text/struct consistency (Claude low)
The DeleteQueue paragraph now reads "for Phase 3.C we keep bucketKey = (queueName, action) per the struct definition above — the partition field is added later by Phase 3.D, see §4.3." Consistent with the struct definition on lines 92–95 and the forward-compat note in §4.3. The previous partitionID mention is gone.


Still-open Codex P1 findings from the commit 624d411a93 Codex review

These were new findings added by Codex on the last commit round; they were not in scope of this push.

[Medium] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.1 — action key names in cache invalidation don't match struct definition

The bucketKey.action field is defined at line 94 as "Send" | "Receive" | "*". But both the SetQueueAttributes and DeleteQueue invalidation paragraphs (lines 114 and 116) delete keys for "Send", "Recv", "Default":

calls buckets.Delete(key) for every bucketKey belonging to the updated queue (Send, Recv, Default)

"Recv""Receive" and "Default""*". An implementation that follows the struct definition will never match the keys the invalidation code tries to delete, so Receive and Default buckets will remain stale after a config update — defeating the correctness guarantee the SetQueueAttributes invalidation paragraph is trying to establish. The fix is one-word changes in both paragraphs: use "Receive" and "*" (matching the struct), or update the struct comment to use "Recv" / "Default" and define those as the canonical values. Fix this →

[Medium] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.2 — FifoThroughputLimit and DeduplicationScope are mutable after first send

Line 106 states only PartitionCount is immutable after first SendMessage. But routing in partitionFor also depends on FifoThroughputLimit — the perQueue short-circuit (added in an earlier round) collapses all groups to partition 0 regardless of PartitionCount. If an operator flips FifoThroughputLimit from perMessageGroupId to perQueue after messages exist, every MessageGroupId that was previously distributed across N partitions now routes to partition 0, splitting those groups across multiple partitions and violating within-group FIFO ordering for any in-flight messages. Similarly, DeduplicationScope affects how the dedup key is scoped; changing it live would either duplicate or suppress messages unexpectedly.

The fix is extending the immutability sentence: "PartitionCount, FifoThroughputLimit, and DeduplicationScope are all immutable after first SendMessage" — since all three affect routing or dedup semantics in ways that cannot be changed without a full drain. Fix this →


New finding

[Low] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §6 — integration test example uses an attribute value that §3.2 now rejects

§6 item 2 reads:

"Configure a queue with SendCapacity=5 SendRefillPerSecond=1. Send 5 messages back-to-back → all 200."

But §3.2's new fourth validation bullet requires SendCapacity ≥ 10 when non-zero. A SetQueueAttributes call with SendCapacity=5 would now return InvalidAttributeValue before the test even creates its first message — meaning the test as written would fail at setup, not at message 6. The example should use SendCapacity=10 (or higher) so the test is consistent with the validator it is supposed to be testing against. Fix this →


Summary

All three trigger fixes are cleanly applied. The two Codex P1 findings from the previous round's review — action key naming mismatch in cache invalidation (3.C §3.1) and FifoThroughputLimit / DeduplicationScope mutability (3.D §3.2) — are the only items I'd consider blocking before implementation PRs. Both are design-level decisions that implementation code would have to make regardless; getting the doc right now prevents the PRs from encoding the wrong choice. The integration test inconsistency is low-severity cleanup.
| Branch

…immutability, test capacity)

Three findings from the seventh-round review on PR #664:

3.C per-queue throttling §3.1 -- action key naming consistency
(Codex P1, medium): bucketKey.action canonical values are
"Send"|"Receive"|"*" but cache invalidation paragraphs were saying
"Send","Recv","Default" (the JSON config field-name prefixes).
An implementation following the struct would never match the
wrong key names, leaving Receive and "*" buckets stale after
config changes. Changed both invalidation paragraphs to use the
struct vocabulary, and added a "Config-field -> bucket-action
mapping" paragraph after the sqsQueueThrottle struct so the
short-form / canonical-form distinction is explicit.

3.D split-queue FIFO §3.2 -- FifoThroughputLimit and
DeduplicationScope must also be immutable (Codex P1, medium):
prior text only marked PartitionCount as immutable. But changing
FifoThroughputLimit (perMessageGroupId<->perQueue) flips the §3.3
short-circuit on partition routing -- in-flight messages stay where
they are, new sends route differently, within-group FIFO ordering
silently breaks. Changing DeduplicationScope reshapes the dedup
key so the prior scope's de-duped messages can resurrect or new
sends can be incorrectly suppressed. Extended the immutability
sentence to cover all three with one bullet per attribute
explaining the failure mode.

3.C per-queue throttling §6 item 2 -- test capacity inconsistent
with new validator (Claude low): integration test used
SendCapacity=5 but §3.2's new fourth bullet requires
SendCapacity >= 10. Test would have failed at setup, not on the
6th send. Bumped to SendCapacity=10 / 10 messages back-to-back /
11th throttled, with an inline note explaining why the value
changed.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Pushed addressing all three findings from your seventh-round review (2 medium Codex P1 + 1 low Claude):

3.C per-queue throttling — §3.1 action key naming consistency (Codex P1, medium):

  • Both invalidation paragraphs (SetQueueAttributes, DeleteQueue) now read (Send, Receive, *— the canonical action values from thebucketKey struct above). The prior (Send, Recv, Default) was the JSON config field-name vocabulary, not the in-memory map key — implementations following the struct would have left Receive and * buckets stale.
  • Added a new "Config-field → bucket-action mapping" paragraph immediately after the sqsQueueThrottle struct definition. Explicitly maps Send*bucketKey{action:"Send"}, Recv*bucketKey{action:"Receive"}, Default*bucketKey{action:"*"}, and pins which vocabulary to use where (config-field for the JSON contract / GetQueueAttributes response; bucket-action for in-memory bucket-store ops). Implementation must apply the mapping after SetQueueAttributes commit.

3.D split-queue FIFO — §3.2 immutability extended to FifoThroughputLimit + DeduplicationScope (Codex P1, medium):

  • Sentence rewritten: "PartitionCount, FifoThroughputLimit, and DeduplicationScope are all immutable after first SendMessage." Three bullets explain the failure mode for each:
    • PartitionCount — re-hashing every message, breaks in-flight ordering, multi-second op (existing reasoning).
    • FifoThroughputLimit — flipping perMessageGroupId ↔ perQueue activates/deactivates the §3.3 short-circuit; in-flight messages stay where they are, new sends for the same group route to a different partition, within-group FIFO is silently violated.
    • DeduplicationScope — changing scope reshapes the dedup key ((queue, dedupId) vs (queue, partitionId, dedupId) vs (queue, MessageGroupId, dedupId)); narrowing scope can resurrect previously-deduped duplicates, widening can suppress legitimately-new sends that collide with still-cached prior-scope entries.
  • Concluding sentence pins the unifying invariant: each attribute participates in a routing or dedup decision whose correctness depends on every existing message having been written under one consistent value, and live mutation creates incompatible "before"/"after" sets the runtime cannot reconcile without a full drain.

3.C per-queue throttling — §6 item 2 test capacity (Claude low):

  • Bumped from SendCapacity=5 SendRefillPerSecond=1 / 5 messages / 6th throttled to SendCapacity=10 SendRefillPerSecond=1 / 10 messages / 11th throttled. Added a parenthetical inline note explaining why the value changed (the new §3.2 validator would have rejected SendCapacity=5 at setup).

@claude review

…e, PR 2 dormancy gate)

Two Codex P1 findings from the tenth-round Codex review on commit
14b4d88 that were carried forward to today's eleventh-round
Claude review:

3.D split-queue FIFO §3.2 / §4.1 -- {PartitionCount > 1,
DeduplicationScope = "queue"} validation moved to control plane
(Codex P1, medium): the old §4.1 paragraph rejected this
combination at SendMessage time, so an operator who mis-configured
at CreateQueue would get a successful response and only discover
the problem when every send failed -- a created-but-unserviceable
queue with no recovery short of DeleteQueue+CreateQueue. Added a
new "Cross-attribute validation at CreateQueue and
SetQueueAttributes" paragraph to §3.2 with the rejection rule
(InvalidParameterValue plus the AWS-shaped reason "queue-scoped
deduplication is incompatible with multi-partition FIFO because
the dedup key cannot be globally unique across partitions without
a cross-partition OCC transaction"). Reframed the §4.1 paragraph
as a "cannot reach this code" pointer to the §3.2 gate so the
runtime rejection is impossible.

3.D split-queue FIFO §11 -- explicit dormancy gate for PR 2-4
(Codex P1, medium): "feature is dormant" was editorial intent, not
a runtime guarantee. A cluster on PR 2-4 would accept
CreateQueue(PartitionCount=4), dispatch every SendMessage against
the legacy single-partition keyspace (partitionIndex=0 because the
data plane has not been wired yet), then have those messages be
invisible to the partition-aware fanout reader and reaper that
land in PR 5. PR 2 row now adds a temporary CreateQueue rejection
("PartitionCount > 1 requires HT-FIFO data plane -- not yet
enabled"); PR 3-4 rows note the gate is still in place; PR 5 row
explicitly removes it in the same commit that wires the fanout so
the gate-and-lift land atomically. New paragraph below the table
documents why the gate exists and why atomic gate-and-lift makes
the wrong-layout-data class of bug impossible.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Pushed addressing the two Codex P1 mediums you flagged as still open from the tenth-round Codex review on 14b4d88a82. You're right — both were genuinely missed and both need to land before PR 2.

3.D split-queue FIFO — control-plane gate for {PartitionCount > 1, DeduplicationScope = "queue"} (Codex P1, medium):

  • New "Cross-attribute validation at CreateQueue and SetQueueAttributes" paragraph in §3.2. Rejection envelope is InvalidParameterValue with the AWS-shaped reason: "queue-scoped deduplication is incompatible with multi-partition FIFO because the dedup key cannot be globally unique across partitions without a cross-partition OCC transaction."
  • Reframed the §4.1 paragraph from a runtime rejection to a "cannot reach this code path because §3.2 rejects at control-plane time" pointer. The runtime rejection is now impossible — the operator gets the validation error at CreateQueue where they can cheaply fix the config, not at the first SendMessage.

3.D split-queue FIFO — explicit PR 2 dormancy gate (Codex P1, medium):

  • PR 2 row now describes a temporary CreateQueue rejection: PartitionCount > 1 returns InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data plane — not yet enabled"). The schema field exists in the meta type but cannot land in production data.
  • PR 3 + 4 rows note the gate is still in place.
  • PR 5 row explicitly removes the gate in the same commit that wires the fanout — the gate-and-lift are atomic so a half-deployed cluster can never accept a partitioned queue without the data plane to serve it.
  • New "Why the temporary gate" paragraph below the rollout table walks through the failure sequence the gate prevents (PR 2–4 cluster accepts CreateQueue(PartitionCount=4)SendMessage writes to legacy !sqs|msg|data|<queue>|… prefix → PR 5 reader looks for partitioned !sqs|msg|data|p|<queue>|<partition>|… and finds nothing → silent message loss invisible to reaper too).

Both fixes lock down real production-time bugs the design would otherwise leave unspecified for the implementer to discover.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 48s —— View job


Review

  • Read updated design documents
  • Verify fix: §3.2 cross-attribute validation for {PartitionCount > 1, DeduplicationScope = "queue"} (3.D) — Codex P1 medium
  • Verify fix: §11 PR 2 dormancy gate for PartitionCount > 1 (3.D) — Codex P1 medium
  • Check for new issues
  • Overall assessment

All two fixes from this push are correctly applied. No open Codex P1 items remain from any prior round. Two new low-severity findings.


Both trigger fixes: resolved ✓

3.D — §3.2 cross-attribute validation for {PartitionCount > 1, DeduplicationScope = "queue"} (Codex P1, medium)
Lines 115–116 contain the new "Cross-attribute validation at CreateQueue and SetQueueAttributes" paragraph. The rejection envelope is InvalidParameterValue with the AWS-shaped reason quoted verbatim; the failure mode it prevents (created-but-unserviceable queue with no recovery short of DeleteQueue+CreateQueue) is documented inline; the guidance that §4.x rejection paragraphs should be reframed as "cannot reach this code" notes is present. §4.1 (line 184) correctly describes the {DeduplicationScope = queue, PartitionCount > 1} combination as unreachable because the §3.2 control-plane gate prevents it — the runtime rejection that was previously in §4.1 is gone. Clean.

3.D — §11 PR 2 dormancy gate + "Why the temporary gate" paragraph (Codex P1, medium)
The PR 2 row (line 371) now carries the explicit temporary gate: CreateQueue rejects PartitionCount > 1 with InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data plane — not yet enabled") with an explicit note that the schema field exists in the meta type but cannot land in production data. PR 3 and 4 rows correctly state the gate remains in place. The PR 5 row explicitly removes the gate in the same commit as the data-plane fanout — the atomic gate-and-lift is the key correctness guarantee. The "Why the temporary gate" paragraph (line 379) accurately describes the failure sequence the gate prevents: a cluster on PR 2–4 accepting a CreateQueue(PartitionCount=4)SendMessage writes to the legacy !sqs|msg|data|<queue>|… prefix → PR 5 fanout reader looks for !sqs|msg|data|p|<queue>|<partition>|… and finds nothing → silent message loss invisible to the reaper too. Clean.


No open items from prior rounds ✓

All Codex P1 and Claude-flagged items across all prior review rounds are now resolved. The two trigger fixes close the last open design gaps identified before PR 2 can start.


New findings

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §3.2 — cross-attribute rule at SetQueueAttributes is permanently unreachable

§3.2's cross-attribute validation paragraph says "the same validator that enforces immutability also rejects incoherent attribute combinations" and names SetQueueAttributes alongside CreateQueue. But both PartitionCount and DeduplicationScope are immutable — SetQueueAttributes cannot change either, so no SetQueueAttributes call can ever present the {PartitionCount > 1, DeduplicationScope = "queue"} combination. The cross-attribute rule is only ever triggered by CreateQueue. An implementor following the paragraph literally would write dead code in the SetQueueAttributes handler with no indication it is harmless dead code.

The fix is a single clarifying sentence: "In practice this rule is only reachable at CreateQueue time — SetQueueAttributes cannot present the combination because both PartitionCount and DeduplicationScope are immutable and will be rejected by the immutability check before the cross-attribute check runs." Fix this →

[Low] docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md §11 — PR 4 "Mixed-version gate" does not explicitly list the §8 runtime leadership-refusal safeguard

§8's "Runtime safeguard for downgraded leaders" paragraph closes with: "Implementations of Phase 3.D's PR set must include this safeguard before the rollout step that marks the binary htfifo-eligible (§11 PR 4)."

PR 4's rollout description (line 373) says: "Routing layer: kv/shard_router.go accepts the (queue, partition) key. New --sqsFifoPartitionMap flag. Mixed-version gate." The "Mixed-version gate" label likely covers the §8.5 capability advertisement mechanism (/sqs_health + catalog polling), but the leadership-refusal check (the on-startup and on-leadership-acquisition catalog query that triggers TransferLeadership if any partitioned queue exists on a non-htfifo node) is a distinct implementation concern from capability advertisement. An implementor writing PR 4 from the table alone could deliver the /sqs_health capability badge and the CreateQueue guard without implementing the kv/lease_state.go admission hook, and the table would appear satisfied.

The fix is a parenthetical in the PR 4 row: "Mixed-version gate (§8 capability advertisement via /sqs_health plus leadership-refusal hook in kv/lease_state.go for non-htfifo nodes that encounter a partitioned queue on startup or leadership acquisition)." Fix this →


Summary

Both trigger fixes are cleanly applied, and all open items from prior rounds are resolved. The two new findings are low-severity documentation clarifications that would help implementation PR authors avoid confusion — neither represents a design gap that blocks PR 2 from starting. The proposal docs are in excellent shape.
| Branch

…ion, PR 4 hook scope)

Two low-severity findings from the thirteenth-round Claude review on
PR #664:

3.D split-queue FIFO §3.2 -- cross-attribute rule at
SetQueueAttributes is dead code (Claude low): the prior wording
said the rule applies at both CreateQueue and SetQueueAttributes,
but immutability fires first on SetQueueAttributes so the
{PartitionCount > 1, DeduplicationScope = queue} combination can
never reach the cross-attribute check on that path. Added a
clarifying paragraph so PR 2 author knows the SetQueueAttributes
side is unreachable rather than missing test coverage.

3.D split-queue FIFO §11 PR 4 -- "Mixed-version gate" expanded
(Claude low): the prior label could be read as covering only the
§8.5 capability advertisement (/sqs_health + catalog polling). The
§8 leadership-refusal hook in kv/lease_state.go is a distinct
implementation concern and §8 explicitly says it must land before
PR 4 marks the binary htfifo-eligible. Expanded the parenthetical
to enumerate both components so the PR 4 author cannot miss the
runtime hook.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Pushed addressing both low findings from your thirteenth-round review:

3.D split-queue FIFO — §3.2 cross-attribute rule at SetQueueAttributes is dead code (low):

  • New clarifying paragraph after the cross-attribute validation block: "In practice this rule is only reachable at CreateQueue time. SetQueueAttributes cannot present the {PartitionCount > 1, DeduplicationScope = "queue"} combination because both PartitionCount and DeduplicationScope are immutable: any change to either is caught by the immutability check (described in the next paragraph) before the cross-attribute check has the chance to run. Implementing the cross-attribute rule on the SetQueueAttributes path would therefore be dead code — it is harmless to add for symmetry, but PR 2's author should know it is unreachable rather than spend time hunting for the test case that hits it."

3.D split-queue FIFO — §11 PR 4 "Mixed-version gate" expanded (low):

  • PR 4 row's "Mixed-version gate" parenthetical now explicitly lists both components required before the binary is marked htfifo-eligible: (1) §8.5 capability advertisement via /sqs_health + catalog polling for CreateQueue gating, and (2) the §8 leadership-refusal hook in kv/lease_state.go that calls TransferLeadership when a non-htfifo binary discovers a partitioned queue in its shard on startup or leadership acquisition. PR 4 author cannot now satisfy the table by delivering only the capability badge.

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 2m 5s —— View job


Review

  • Read updated design documents
  • Verify fix: §3.2 dead-code clarification for SetQueueAttributes cross-attribute rule (3.D) — low
  • Verify fix: §11 PR 4 "Mixed-version gate" expanded to enumerate both components (3.D) — low
  • Check for new issues
  • Overall assessment

Both fixes from this push are correctly applied. No open items remain from any prior round. No new issues found.


Both trigger fixes: resolved ✓

3.D — §3.2 dead-code clarification for SetQueueAttributes cross-attribute rule (low)
Line 117 now reads: "In practice this rule is only reachable at CreateQueue time. SetQueueAttributes cannot present the {PartitionCount > 1, DeduplicationScope = "queue"} combination because both PartitionCount and DeduplicationScope are immutable: any change to either is caught by the immutability check (described in the next paragraph) before the cross-attribute check has the chance to run. Implementing the cross-attribute rule on the SetQueueAttributes path would therefore be dead code — it is harmless to add for symmetry, but PR 2's author should know it is unreachable rather than spend time hunting for the test case that hits it."

The reasoning is correct: any SetQueueAttributes call that would present the combination must carry a differing DeduplicationScope value (since the stored value is set at CreateQueue and cannot be "queue" while PartitionCount > 1 per the cross-attribute rule), and that differing value is caught by the immutability check before the cross-attribute check runs. PR 2's author now has explicit guidance not to burn time on dead code.

3.D — §11 PR 4 "Mixed-version gate" expanded (low)
Line 375 now reads: "Mixed-version gate (§8.5 capability advertisement via /sqs_health + catalog polling for CreateQueue gating, and the §8 leadership-refusal hook in kv/lease_state.go that calls TransferLeadership when a non-htfifo binary discovers a partitioned queue in its shard on startup or leadership acquisition — both components are required before the binary is marked htfifo-eligible)."

Both the capability-advertisement side (§8.5, needed for the CreateQueue gate) and the runtime leadership-admission side (the kv/lease_state.go hook, needed to prevent a downgraded node from serving partitioned queues) are explicitly enumerated, and the "both components are required" note removes any ambiguity about scope for the PR 4 author.


No open items from prior rounds ✓

All Codex P1 and Claude-flagged items across all prior review rounds are resolved.


No new findings ✓

I read both docs end-to-end. One minor observation that does not warrant a fix request: line 115 still names SetQueueAttributes alongside CreateQueue in the cross-attribute validation section heading ("Cross-attribute validation at CreateQueue and SetQueueAttributes"), while line 117 immediately explains that the SetQueueAttributes path is dead code. The two are consistent — document both paths, then explain one is unreachable — and an implementor reading both lines gets the right mental model. The alternative would be to narrow the heading to just CreateQueue, but the current form is clear enough and changing it would touch only a cosmetic detail.


Summary and merge recommendation

Both trigger fixes are cleanly applied, and the docs have no remaining open items from any review round. The proposal documents for Phase 3.C (per-queue throttling) and Phase 3.D (split-queue FIFO) are in excellent shape. Ready to merge. Implementation PR 2 can begin from this base.

| Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d246cc3e99

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +114 to +118
**Cache invalidation on `SetQueueAttributes`**: when an operator updates the throttle config via `SetQueueAttributes`, the handler — *after* the Raft commit that persists the new `sqsQueueThrottle` — calls `buckets.Delete(key)` for every `bucketKey` belonging to the updated queue (`Send`, `Receive`, `*` — the canonical action values from the `bucketKey` struct above). Without this step the in-memory bucket would keep enforcing the old limits until the idle-eviction sweep removes the stale entry (default 1 h window), defeating the operator's intent to throttle a noisy tenant in real time. The `LoadOrStore` race with the `Delete` call is benign: the next request rebuilds from the freshly-committed meta, and the rebuilt bucket starts at full capacity (same semantics as the failover case documented below). Claude P1 on PR #664 caught the gap.

**Cache invalidation on `DeleteQueue`**: when a queue is deleted, the handler — *after* the Raft commit that purges the queue meta — calls `buckets.Delete(key)` for every `bucketKey` belonging to the deleted queue (`Send`, `Receive`, `*` — the canonical action values from the `bucketKey` struct above), mirroring the `SetQueueAttributes` path above. Without this step, a `DeleteQueue` immediately followed by `CreateQueue` with the same name would inherit the previous incarnation's in-memory bucket (current token balance, capacity, refill rate) until the 1 h idle-eviction sweep removes the stale entry — so an operator using `DeleteQueue`+`CreateQueue` to reset a noisy queue's state would be surprised to see the old throttle still in effect. (Alternative considered: include the queue `generation` in `bucketKey` so old and new incarnations cannot collide structurally; the explicit `Delete` is cheaper and matches the `SetQueueAttributes` pattern, so for Phase 3.C we keep `bucketKey = (queueName, action)` per the struct definition above — the `partition` field is added later by Phase 3.D, see §4.3 — and document the lifecycle requirement here.) Claude P1 on PR #664 fifth-round review.

The bucket map is per-process. On leader failover, a fresh bucket starts at full capacity on the new leader — there is no Raft replication of bucket state. **Why this is correct**: the worst-case behaviour of "fresh bucket on failover" is that a noisy queue gets one extra burst worth of bandwidth right after a leader change. Replicating bucket state would cost a Raft commit per token decrement, which would defeat the entire point of the token bucket. AWS's own rate limiter has the same property at region failover boundaries.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Invalidate throttle caches on leader changes

This design only deletes bucket entries in the SetQueueAttributes/DeleteQueue handler on the current leader, while the cache is explicitly per-process; a follower that previously led the queue can keep an old (queue, action) bucket until idle eviction and later reuse it after failover. In that scenario the new leader can enforce stale limits/token balance (or a prior queue incarnation) despite recent control-plane updates, so the documented “fresh bucket on failover” behavior is not guaranteed. Please add a leadership-time invalidation/version check (or keying that includes queue incarnation/config version) so failover cannot resurrect stale throttle state.

Useful? React with 👍 / 👎.

Implements the design from
docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md.

Schema (sqs_catalog.go):
- sqsQueueMeta gains an optional Throttle *sqsQueueThrottle field
  with six float64 sub-fields (Send/Recv/Default x Capacity/RefillPerSecond).
- Six new ThrottleSendCapacity / ThrottleSendRefillPerSecond /
  ThrottleRecvCapacity / ThrottleRecvRefillPerSecond /
  ThrottleDefaultCapacity / ThrottleDefaultRefillPerSecond attributes
  go through the standard sqsAttributeAppliers dispatch.
- validateThrottleConfig enforces the cross-field rules from the
  design: each (capacity, refill) pair must be both-zero or
  both-positive; capacity must be >= refill so the bucket can burst;
  Send/Recv capacities must be >= 10 (the SendMessageBatch /
  DeleteMessageBatch max charge) so a small capacity does not make
  every full batch permanently unserviceable; per-field hard ceiling
  of 100k. Validator runs once at the end of applyAttributes so a
  multi-field update sees the post-apply state as a whole.
- queueMetaToAttributes surfaces the configured Throttle* fields so
  GetQueueAttributes("All") round-trips. Extracted into
  addThrottleAttributes to stay under the cyclop ceiling.

Bucket store (sqs_throttle.go, new):
- bucketStore: sync.Map[bucketKey]*tokenBucket; per-bucket sync.Mutex
  on the hot path so cross-queue traffic never serialises on a
  process-wide lock. Per Gemini medium on PR #664 the single-mutex
  alternative was rejected. Lazy idle-evict sweep removes inactive
  buckets after 1h to bound memory.
- charge(): lock-free Load, LoadOrStore on miss (race-tolerant -- both
  racers compute identical capacity/refill from the same meta), then
  per-bucket lock for refill + take + release. Refill is elapsed *
  refillRate, capped at capacity. On reject, computes Retry-After
  from the actual refillRate and the requestedCount per the §3.4
  formula (numerator is requested count, not 1, so a batch verb does
  not get told to retry in 1s when it really needs 10s).
- resolveActionConfig() handles the Default* fall-through: when only
  DefaultCapacity is set, Send and Recv requests share the same
  bucketKey{action:"*"} so Default behaves as one shared cap rather
  than three independent quotas.
- invalidateQueue() drops every bucket belonging to a queue. Called
  after the Raft commit on SetQueueAttributes / DeleteQueue so new
  limits take effect on the very next request, not after the 1h
  idle-evict sweep. Without this step the §3.1 cache-invalidation
  contract fails and operators see stale enforcement.

Charging (sqs_messages.go, sqs_messages_batch.go):
- All seven message-plane handlers wired:
  SendMessage / SendMessageBatch (Send bucket; batch charges by entry
  count), ReceiveMessage / DeleteMessage / DeleteMessageBatch /
  ChangeMessageVisibility / ChangeMessageVisibilityBatch (Receive
  bucket).
- Throttle check sits OUTSIDE the OCC retry loop per §4.2 -- a
  rejected request never reaches the coordinator, so the existing
  sendMessageWithRetry et al. cannot busy-loop on a permanent
  rate-limit failure.
- sendMessage extracted into prepareSendMessage + validateSend +
  body to stay under the cyclop ceiling once the throttle branch
  was added.

Error envelope (sqs.go):
- New sqsErrThrottling code + writeSQSThrottlingError helper that
  writes 400 + AWS-shaped JSON body + x-amzn-ErrorType + Retry-After
  header. The envelope is the same shape AWS uses; SDKs that key off
  x-amzn-ErrorType handle it without changes.

Tests:
- adapter/sqs_throttle_test.go: 18 unit tests covering bucket math
  (fresh capacity, refill elapsed, refill cap, batch reject preserves
  partial credit, Retry-After uses requested count, sub-1-RPS floor,
  per-action / per-queue / Default-fallthrough isolation,
  invalidateQueue, concurrent -race, default-off short-circuit) and
  the validator (nil/empty canonicalisation, both-zero-or-both-positive,
  capacity >= 10 batch floor, Default* exempt, capacity >= refill,
  parseThrottleFloat range checks, computeRetryAfter floor).
- adapter/sqs_throttle_integration_test.go: 8 end-to-end tests
  covering the §6 testing strategy (default-off allows unbounded,
  send/recv reject after capacity with correct envelope + Retry-After,
  batch charges by entry count, SetQueueAttributes invalidation,
  DeleteQueue + CreateQueue lifecycle invalidation, GetQueueAttributes
  round-trip, validator rejects below batch min). Run with -race
  clean.

Out of scope for this PR (deferred to a follow-up):
- Prometheus counter (sqs_throttled_requests_total) and gauge
  (sqs_throttle_tokens_remaining) per the design's §4.1 monitoring/
  registry.go entry. The wiring needs a new prometheus collector and
  hooking through the Registry seam; punted to keep this PR focused
  on the data-plane behaviour. The bucket store's chargeOutcome
  already exposes tokensAfter so the gauge wiring is one site.
- §6.5 cross-protocol (Query) parity test: the Query protocol layer
  lives on PR #662 and the throttle envelope already has the same
  shape on both protocols (writeSQSError handles both), so the test
  is a follow-up after #662 lands.
- §6.6 failover behaviour test: the §3.1 "fresh bucket on failover"
  contract is implementation-correct (the bucket map is per-process,
  no Raft state) but a 3-node failover test needs the cluster
  scaffolding and is best added with the Jepsen workload.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 2m 25s —— View job


Review

  • Read current state of both design documents
  • Verify last two Claude low fixes from previous push
  • Check open Codex P1 from latest Codex review (d246cc3e99)
  • Check for new issues
  • Overall assessment

Both trigger fixes from last push: resolved ✓

3.D — §3.2 dead-code clarification for SetQueueAttributes cross-attribute rule (low)
Line 117 correctly explains that the cross-attribute {PartitionCount > 1, DeduplicationScope = "queue"} rule is dead code on the SetQueueAttributes path (the immutability check fires first), and that an implementor adding it for symmetry should know it is unreachable. PR 2 author now has the right mental model.

3.D — §11 PR 4 "Mixed-version gate" expanded (low)
Line 375 explicitly names both required components: (1) the §8.5 capability advertisement + catalog polling gate for CreateQueue, and (2) the §8 kv/lease_state.go leadership-refusal hook for downgraded nodes. Both are tagged as required before the binary is marked htfifo-eligible. The PR 4 author cannot satisfy the table by delivering only one of the two.


Still-open Codex P1 from the d246cc3e99 review

[Medium] docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md §3.1 — "fresh bucket on failover" is incorrect for nodes that were previously leaders

The doc states (line 118):

"The bucket map is per-process. On leader failover, a fresh bucket starts at full capacity on the new leader — there is no Raft replication of bucket state. Why this is correct: the worst-case behaviour of 'fresh bucket on failover' is that a noisy queue gets one extra burst worth of bandwidth right after a leader change."

This claim is only true when the failover target has never been the leader for that queue. When a node cycles back to leadership (A → B → A), node A retains its in-memory bucket from its previous leadership term and uses it on re-election. The Codex P1 scenario:

  1. Node A is leader; queue Q is configured with SendCapacity=10, SendRefillPerSecond=1. Node A handles 10 messages → bucket is at 0 tokens.
  2. Operator calls SetQueueAttributes on node B (the new leader), raising SendCapacity=100, SendRefillPerSecond=100. Node B deletes its own bucket entry for (Q, Send) and lazily rebuilds from the new meta.
  3. Node A (now follower) still has its old bucket at 0 tokens in its local sync.Map — the buckets.Delete call ran on node B's process, not node A's.
  4. Node B fails; node A becomes leader again.
  5. Node A loads its stale bucket (0 tokens, old 10-token capacity) from sync.Map. Every SendMessage is throttled with Retry-After: 10 — until the idle-eviction sweep fires (up to 1 h later).

The "one extra burst" characterisation in the doc applies only to the case of first-time leadership. The actual worst case under the Codex scenario is up to 1 hour of incorrect throttle enforcement — not a burst, the opposite.

The natural fix is to move bucket invalidation from the leader's HTTP handler to the Raft FSM apply path for the SetQueueAttributes entry. The FSM apply runs on every node (leaders and followers alike), so:

  • When SetQueueAttributes is committed, the FSM's apply on node A also calls buckets.Delete(key) for the updated queue.
  • When node A next becomes leader, its bucket cache is clean and lazily rebuilds from the freshly-committed meta.
  • The clock/bucket lifecycle behaviour is identical to the first-time-leader case.

An alternative is to flush the bucket store for all local-shard queues on leadership acquisition (a one-shot purge at the start of each new leader term). This is simpler to implement but slightly more aggressive. Either approach is acceptable; the doc should specify which one and update the "Why this is correct" paragraph to accurately describe what the guarantee is after the fix. Fix this →


No new findings ✓

Both docs were read end-to-end. No new issues found beyond the Codex P1 above.


Summary

Both Claude lows from the previous push are correctly applied. The one open item is the Codex P1 from the last Codex review — the "fresh bucket on failover" claim and the bucket invalidation placement (leader HTTP handler vs. FSM apply). This is the only finding I'd consider blocking before the implementation PRs begin: it is a design decision that PR 2's author would otherwise have to resolve in code without guidance from the proposal doc. Once addressed, the docs are ready to merge and PR 2 can start.
| Branch

bootjp added 15 commits April 27, 2026 19:48
…ini high/medium)

Six fixes from the first round of automated reviews on PR #679:

Codex P1 -- bucket reconciliation on stale config (sqs_throttle.go):
loadOrInit returned cached buckets without checking that capacity /
refillRate still match the queue's current Throttle config. After a
leadership change, a node retaining buckets from a prior leader term
would keep enforcing the prior term's limits even after a new
SetQueueAttributes had committed -- the invalidation only runs on the
leader that processed the commit, so a different leader's stale
buckets survive. Now compares cap/refill on every Load hit and
rebuilds (Delete + LoadOrStore) on mismatch.

Codex P1 -- invalidate only on actual throttle change (sqs_catalog.go,
sqs_throttle.go): cache invalidation in setQueueAttributes ran
unconditionally after every successful commit, including unrelated-
field updates and no-op writes. Result: any caller could silently
restore a noisy tenant's burst capacity by writing a no-op
SetQueueAttributes. Now gated on throttleAttributesPresent(in.Attributes)
which checks the request for any Throttle* key. Bucket reconciliation
above acts as the safety net if a future code path bypasses the gate.

Codex P2 -- attributesEqual covers Throttle (sqs_catalog.go):
CreateQueue idempotency relied on attributesEqual which did not
include Throttle*. A re-create with different limits was treated as
idempotent and silently kept the old limits. Now compares the full
Throttle struct via throttleConfigEqual; baseAttributesEqual extracted
to keep cyclop under the ceiling.

Gemini high -- thread throttle through existing meta load
(sqs_messages.go, sqs_throttle.go): chargeQueue did one Pebble read
per request even though the hot-path handlers (sendMessage,
receiveMessage) load the meta moments later. Added
chargeQueueWithThrottle that takes pre-loaded throttle config; both
hot-path handlers now load meta once and pass throttle in. Throttle
check now sits AFTER the QueueDoesNotExist branch so a missing queue
no longer consumes a token. Batch + delete handlers keep chargeQueue
(one extra meta read) -- low-QPS verbs where the simplification of
not pulling meta out of the retry loop is worth the per-call cost.

Gemini high -- move sweep off hot path (sqs.go, sqs_throttle.go):
maybeSweep ran the O(N) sync.Map.Range on whichever request was
unlucky enough to trigger the per-minute sweep, causing latency
spikes on many-queue clusters. Replaced with runSweepLoop on a
background ticker tied to s.reaperCtx (started in Run alongside the
existing message reaper, cleaned up by the same reaperCancel in
Stop). The hot-path charge() no longer calls into the sweep at all.

Gemini medium -- cap retry-after duration (sqs_throttle.go):
computeRetryAfter could compute a multi-day Retry-After (or worse,
overflow time.Duration arithmetic) for pathologically small
refillRate / large requested values. Capped at throttleRetryAfterCap
(1h, matching the bucket idle-evict window). Cap is applied before
the Duration multiplication so overflow is impossible.

New tests:
- TestBucketStore_ReconcilesBucketOnConfigChange pins the Codex P1
  reconciliation contract.
- TestComputeRetryAfter_CapsAtMaximum pins the Gemini medium cap.
- TestThrottleAttributesPresent covers the request-gate helper used
  by the conditional invalidation.

All tests pass under -race; golangci-lint clean.
…est (PR #679 round 2)

Two items from the round-2 Claude review on PR #679:

(1) bucketStore.sweepMu and bucketStore.lastSweep became dead code
when the move from on-hot-path maybeSweep to the background
runSweepLoop landed in round 1. The sweep() ticker is the only
caller, so the serialisation primitive that protected against
concurrent on-hot-path callers no longer has a job. Removed both
fields and the sweepMu.Lock/lastSweep=now/sweepMu.Unlock block;
sweep() now reads b.clock() inline. Field comment updated to
describe the post-runSweepLoop design (single-goroutine driver).

(2) TestSQSServer_CatalogCreateIsIdempotent now exercises
throttleConfigEqual via a fourth case: same name, same
VisibilityTimeout, but the second create adds Throttle* attributes.
attributesEqual must notice the diff and the call must reject as
QueueNameExists. Without this case a future bug in
throttleConfigEqual (e.g. always returning true) would slip past
the existing VisibilityTimeout-only test.

Both findings noted in the round-2 Claude review on PR #679.
Round-3 Claude review on PR #679 caught two stale comments left over
from the runSweepLoop refactor:

throttleIdleEvictAfter said "no goroutine; lookups call sweep()
opportunistically" -- both clauses are now false (runSweepLoop is
the goroutine, hot-path charge() never calls sweep()).

throttleEvictSweepEvery said "bounds how often the sweep runs from
the hot path" -- the hot path no longer runs sweep at all; this
constant is now just the background ticker interval.

Both updated to describe the post-runSweepLoop design and to
reinforce that the sweep cost is amortised across the goroutine
ticker rather than concentrated on whichever request was unlucky
enough to trigger it (which was the old behaviour Gemini high
flagged in PR #679 round 1).
…(PR #679 round 4)

Round-4 Claude review on PR #679 surfaced a TOCTOU race in the
loadOrInit reconciliation path the round-1 fix introduced.

The race: two concurrent goroutines hit the same stale bucket. A
detects mismatch, Delete + LoadOrStore stores freshA. B (which
also Loaded the stale bucket before A's Delete) then runs its own
Delete -- which under the previous unconditional Delete would
evict freshA. B's LoadOrStore then stores freshB. Net effect: a
charge() that obtained freshA via the in-flight LoadOrStore from
A's path keeps charging a bucket that is no longer in the map,
while subsequent requests get freshB at full capacity. Total
tokens consumed across the mismatch window can exceed the new
capacity by one bucket's worth.

Fix: replace the unconditional b.buckets.Delete(key) with
b.buckets.CompareAndDelete(key, v) where v is the original stale
pointer. CompareAndDelete is a no-op when the map already holds
someone else's fresh bucket, so the racer that arrives second
takes the winner's fresh bucket via LoadOrStore (which returns
the existing entry) instead of orphaning it.

Test: TestBucketStore_ConcurrentReconciliationRespectsNewCapacity
seeds a stale bucket (cfgOld, drained), then races 200 goroutines
through a charge with cfgNew (capacity 50). Asserts exactly 50
successes -- a Delete-after-replace race would let some past the
cap. Runs under -race; the new test plus the existing
TestBucketStore_ConcurrentChargesPreserveCount and
TestBucketStore_ReconcilesBucketOnConfigChange cover the
concurrent-fresh-bucket, concurrent-stale-mismatch, and
sequential-mismatch cases.
Implements PR 2 of the §11 multi-PR rollout from
docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md. The schema
fields land on sqsQueueMeta with the design's wire types; the
validator enforces the §3.2 cross-attribute rules; a temporary
CreateQueue gate rejects PartitionCount > 1 until PR 5 lifts the
gate atomically with the data-plane fanout.

Schema (sqs_catalog.go):
- sqsQueueMeta gains three optional fields:
  * PartitionCount uint32 — number of FIFO partitions; 0/1 means
    legacy single-partition; > 1 enables HT-FIFO.
  * FifoThroughputLimit string — "perMessageGroupId" (default for
    HT-FIFO) or "perQueue" (collapses every group to partition 0).
  * DeduplicationScope string — "messageGroup" (default for HT-FIFO)
    or "queue" (legacy single-window).
- attributesEqual extended via baseAttributesEqual + htfifoAttributesEqual
  split (kept under cyclop ceiling).
- queueMetaToAttributes surfaces the configured HT-FIFO fields via
  addHTFIFOAttributes so GetQueueAttributes("All") round-trips.

Routing (sqs_partitioning.go, new):
- partitionFor(meta, messageGroupId) uint32 implements §3.3:
  FNV-1a over MessageGroupId & (PartitionCount - 1). Edge cases
  documented in the godoc — PartitionCount 0/1 → 0, FifoThroughputLimit
  perQueue short-circuits to 0, empty MessageGroupId → 0.
- The bitwise-mask optimisation requires PartitionCount be a power
  of two; the validator enforces it. A future bug that leaks a
  non-power-of-two value is caught by TestPartitionFor_PowerOfTwoMaskingMatchesMod.

Validation (sqs_partitioning.go):
- validatePartitionConfig enforces the §3.2 cross-attribute rules:
  * PartitionCount must be a power of two in [1, htfifoMaxPartitions=32].
  * FifoThroughputLimit / DeduplicationScope are FIFO-only.
  * {PartitionCount > 1, DeduplicationScope = "queue"} rejects with
    InvalidParameterValue (incoherent params, vs InvalidAttributeValue
    for malformed individual values) per §3.2 cross-attribute gate.
- Runs in parseAttributesIntoMeta (after resolveFifoQueueFlag, so
  the IsFIFO check sees the post-resolution flag) and in
  trySetQueueAttributesOnce (after applyAttributes; IsFIFO comes
  from the loaded meta).
- validatePartitionImmutability is the §3.2 rule that PartitionCount,
  FifoThroughputLimit, and DeduplicationScope are immutable from
  CreateQueue. SetQueueAttributes is all-or-nothing: a request that
  touches a mutable attribute alongside an attempted immutable
  change rejects the whole request before persisting either.
- snapshotImmutableHTFIFO captures the pre-apply values so the
  immutability check has a clean before/after pair.

Dormancy gate (sqs_partitioning.go + sqs_catalog.go):
- validatePartitionDormancyGate is the §11 PR 2 temporary gate that
  rejects CreateQueue with PartitionCount > 1 with
  InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data
  plane — not yet enabled"). The schema field exists in the meta
  type but no partitioned data can land. Removed in PR 5 in the same
  commit that wires the data-plane fanout — gate-and-lift is atomic
  so a half-deployed cluster can never accept a partitioned queue
  without the data plane to serve it.

Tests:
- adapter/sqs_partitioning_test.go: 14 unit tests covering
  partitionFor (legacy zero/one, perQueue short-circuit, empty
  MessageGroupId, determinism, distribution within ±5% across 8
  partitions on 100k samples, power-of-two masking < N), isPowerOfTwo,
  validatePartitionConfig (power-of-two, max cap, FIFO-only,
  cross-attr InvalidParameterValue, single-partition + queue-dedup OK),
  validatePartitionDormancyGate (rejects > 1, allows 0/1),
  validatePartitionImmutability (per-attribute change rejects, same-
  value no-op succeeds), htfifoAttributesPresent.
- adapter/sqs_partitioning_integration_test.go: 7 end-to-end tests
  against a real createNode cluster covering the wire surface:
  dormancy gate rejects PartitionCount > 1 (and the gate's reason
  surfaces to the operator), allows PartitionCount=1, validator
  rejects non-power-of-two, FIFO-only rejection on Standard queue,
  cross-attr {partitions, queue dedup} rejects, immutability rejects
  SetQueueAttributes change with same-value no-op succeeds, all-or-
  nothing — combined mutable+immutable rejects entirely without
  persisting the mutable change, GetQueueAttributes round-trip.
- All tests pass under -race; golangci-lint clean.

Out of scope for this PR (§11 PR 3-8):
- PR 3: keyspace threading — partitionIndex through every sqsMsg*Key
  constructor, defaulting to 0 so existing queues stay byte-identical.
  Gate from PR 2 still in place.
- PR 4: routing layer + --sqsFifoPartitionMap flag + mixed-version
  gate (§8.5 capability advertisement + §8 leadership-refusal hook
  in kv/lease_state.go). Gate still in place.
- PR 5: send/receive partition fanout, receipt-handle v2, removes
  the dormancy gate atomically with the data-plane fanout.
- PR 6: PurgeQueue/DeleteQueue partition iteration + tombstone +
  reaper.
- PR 7: Jepsen HT-FIFO workload + metrics.
- PR 8: doc lifecycle bump.
Three medium-priority Gemini findings + one cyclop fix from the
rebase onto the updated PR #679 throttling branch.

(1) sqsErrInvalidParameterValue removed (sqs_catalog.go,
sqs_partitioning.go, sqs_partitioning_test.go): the constant
duplicated sqsErrValidation which already carries
"InvalidParameterValue". Reuses the existing constant; comment on
the validator clarifies the choice.

(2) snapshotImmutableHTFIFO heap allocation gated on
htfifoAttributesPresent (sqs_catalog.go): the snapshot is only
needed when the request actually carries an HT-FIFO attribute. A
SetQueueAttributes that touches only mutable attributes (e.g.
VisibilityTimeout — the common path) now skips the allocation
entirely. The validator short-circuit on preApply == nil keeps
correctness identical: a request with no HT-FIFO attribute can
never change an HT-FIFO field.

(3) partitionFor inlined FNV-1a (sqs_partitioning.go): the
hash/fnv.New64a + h.Write([]byte(messageGroupID)) path heap-
allocates the byte slice and pays the hash.Hash interface dispatch
on every SendMessage. Replaced with a manual FNV-1a loop over the
string — measurably faster on the routing hot path. MessageGroupId
is capped at 128 chars by validation so the loop is bounded.

(4) cyclop (trySetQueueAttributesOnce): the conditional preApply
branch + immutability check + Throttle validator pushed cyclop to
12. Extracted applyAndValidateSetAttributes so the function stays
under the ceiling; the helper is also useful for any future
admin-surface callers that want the same apply+validate flow
without the OCC dispatch.

All tests pass under -race; golangci-lint clean.
…eanup

Three items from the round-2 Claude review on PR #681:

(1) Bug (latent, live in PR 5): validatePartitionConfig only guarded
FifoThroughputLimit / DeduplicationScope as FIFO-only, not
PartitionCount > 1. A Standard queue with PartitionCount=2 would
slip past the validator after PR 5 lifts the dormancy gate. Added
a guard inside the !meta.IsFIFO block: PartitionCount > 1 →
InvalidAttributeValue("PartitionCount > 1 is only valid on FIFO
queues"). PartitionCount 0 / 1 stay accepted on Standard (both
mean "single-partition layout"). Test
TestValidatePartitionConfig_StandardQueueRejectsHTFIFOAttrs
extended to cover PartitionCount=2/4/8/16/32 → reject and
PartitionCount=0/1 → accept.

(2) Low: TestSQSServer_HTFIFO_RejectsQueueScopedDedupOnPartitioned
had two contradictory comments. The function-level doc said "Test
sets PartitionCount=1 to bypass dormancy" but the test actually
sends PartitionCount=2; the inline comment claimed "dormancy fires
first" but validatePartitionConfig (inside parseAttributesIntoMeta)
runs before validatePartitionDormancyGate, so the cross-attr rule
fires first. Rewrote the doc to describe the actual control-plane
order (validatePartitionConfig → cross-attr rejection → wire 400)
and removed the misleading "dormancy fires first" assertion.

(3) Nit: replaced the custom 11-line contains/indexOf helpers with
strings.Contains. Identical behaviour, idiomatic.

All findings noted in the round-2 Claude review on PR #681.
…2) (#681)

## Summary

Phase 3.D **PR 2** (schema + validators + dormancy gate) per the §11
multi-PR rollout in
[`docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md`](https://github.com/bootjp/elastickv/blob/docs/sqs-phase3-proposals/docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md)
(currently on PR #664).

Stacks on **PR #679 (Phase 3.C)**: branched off
`feat/sqs-throttling-phase3c` so the 3.D schema work shares the same
base as the throttling implementation. Will be rebased once PR #679
merges to main.

- `sqsQueueMeta` gains `PartitionCount uint32`, `FifoThroughputLimit
string`, `DeduplicationScope string`.
- `partitionFor(meta, messageGroupId) uint32` implements §3.3 — FNV-1a
over `MessageGroupId`, masked by `PartitionCount-1` (validator-enforced
power-of-two so the bitwise mask is equivalent to mod). Edge cases:
`PartitionCount` 0/1 → 0, `FifoThroughputLimit=perQueue` short-circuits
to 0, empty `MessageGroupId` → 0.
- `validatePartitionConfig` enforces the §3.2 cross-attribute rules:
power-of-two `[1, htfifoMaxPartitions=32]`,
`FifoThroughputLimit`/`DeduplicationScope` are FIFO-only,
`{PartitionCount > 1, DeduplicationScope = "queue"}` rejects with
`InvalidParameterValue` at control-plane time so the operator sees the
error before the first `SendMessage`.
- `validatePartitionImmutability` enforces the §3.2 immutability rule
(the three HT-FIFO fields are immutable from `CreateQueue` onward).
`SetQueueAttributes` is **all-or-nothing** — a request that touches a
mutable attribute alongside an attempted immutable change rejects the
whole request before persisting either.
- `validatePartitionDormancyGate` is the §11 PR 2 temporary gate:
`CreateQueue` with `PartitionCount > 1` rejects with
`InvalidAttributeValue("PartitionCount > 1 requires HT-FIFO data plane —
not yet enabled")`. The schema field exists in the meta type but no
partitioned data can land. Removed in PR 5 in the same commit that wires
the data-plane fanout — gate-and-lift is **atomic** so a half-deployed
cluster can never accept a partitioned queue without the data plane to
serve it.
- `GetQueueAttributes("All")` round-trips the configured HT-FIFO fields
via `addHTFIFOAttributes`.

## Test plan

- [x] `adapter/sqs_partitioning_test.go` — 14 unit tests covering
`partitionFor` (legacy zero/one, perQueue short-circuit, empty
MessageGroupId, determinism, distribution within ±5% across 8 partitions
on 100k samples, power-of-two masking always < N), `isPowerOfTwo`,
`validatePartitionConfig` (power-of-two, max cap, FIFO-only, cross-attr
`InvalidParameterValue`, single-partition + queue-dedup OK),
`validatePartitionDormancyGate` (rejects > 1, allows 0/1, gate reason
surfaces to the operator), `validatePartitionImmutability`
(per-attribute change rejects, same-value no-op succeeds),
`htfifoAttributesPresent`.
- [x] `adapter/sqs_partitioning_integration_test.go` — 7 end-to-end
tests: dormancy gate rejects `PartitionCount > 1` (and the gate's "not
yet enabled" message surfaces), allows `PartitionCount=1`, validator
rejects non-power-of-two, FIFO-only rejection on Standard queue,
cross-attr `{PartitionCount=2, DeduplicationScope=queue}` rejects,
immutability `SetQueueAttributes` change rejects with same-value no-op
succeeding, all-or-nothing combined mutable+immutable rejects without
persisting the mutable change, `GetQueueAttributes` round-trip omits
unset fields.
- [x] All tests pass under `-race`.
- [x] `golangci-lint run ./adapter/...` → 0 issues.

## Self-review (5 lenses)

1. **Data loss** — Schema fields land on `sqsQueueMeta` and are
persisted via the existing OCC `Put` on `sqsQueueMetaKey`. The dormancy
gate runs at `CreateQueue` admission time; once PR 5 lifts the gate, no
schema migration is needed (existing queues have `PartitionCount=0`
which is equivalent to 1). The keyspace stays untouched in this PR (PR 3
threads `partitionIndex` through `sqsMsg*Key`).
2. **Concurrency / distributed failures** — Validation is purely
function-of-input; no shared state. The immutability check loads the
meta in the same `nextTxnReadTS` snapshot that `applyAttributes` uses,
so a concurrent `SetQueueAttributes` from another writer is caught by
the existing OCC `StartTS + ReadKeys` fence (already in place for
`setQueueAttributesWithRetry`). Dormancy gate runs at `CreateQueue`
admission so a concurrent gate-lift mid-request is impossible.
3. **Performance** — `partitionFor` is one FNV-1a hash + one bitwise
mask; constant time. `validatePartitionConfig` is field-comparison only,
runs once per `CreateQueue` / `SetQueueAttributes`.
`addHTFIFOAttributes` adds at most three map writes and only when fields
are non-empty. Hot path `SendMessage` / `ReceiveMessage` is unaffected
by this PR (no routing changes — those land in PR 5).
4. **Data consistency** — Cross-attribute rule `{partitions, queue
dedup}` rejects at control-plane time so an operator can never end up
with a created-but-unserviceable queue (the runtime path that would have
surfaced the same error at first send is unreachable). Immutability rule
prevents mid-life routing-key reconfiguration that would silently
violate within-group FIFO ordering.
5. **Test coverage** — 14 unit + 7 integration tests cover the
validator's full decision tree (each rule has a dedicated table-driven
case), the dormancy gate's accept/reject paths with the operator-visible
reason, the immutability all-or-nothing semantics, and `partitionFor`'s
edge cases plus a 100k-sample distribution test that catches non-uniform
routing immediately. Lint is clean.

## Out of scope (subsequent §11 rollout PRs)

- **PR 3** — keyspace threading: `partitionIndex` through every
`sqsMsg*Key` constructor, defaulting to 0 so existing queues stay
byte-identical. Dormancy gate from this PR remains in place.
- **PR 4** — routing layer + `--sqsFifoPartitionMap` flag +
mixed-version gate (§8.5 capability advertisement + §8
leadership-refusal hook in `kv/lease_state.go`). Dormancy gate still in
place.
- **PR 5** — send/receive partition fanout + receipt-handle v2 codec.
**Removes the dormancy gate atomically with the data-plane fanout** so
the gate-and-lift land in one commit.
- **PR 6** — `PurgeQueue` / `DeleteQueue` partition iteration +
tombstone schema update + reaper.
- **PR 7** — Jepsen HT-FIFO workload + metrics.
- **PR 8** — partial-doc lifecycle bump.

## Branch base

Branched off `feat/sqs-throttling-phase3c` (PR #679) which itself
branches off `docs/sqs-phase3-proposals` (PR #664). Will rebase against
`main` after both ancestors merge.
…round 5)

Two findings from the round-5 Codex review:

(1) P1 — Default* batch-capacity floor:
validateThrottleConfig disabled the >=10 batch floor for
ThrottleDefault* on the assumption that Default never serves a batch
verb. But resolveActionConfig in sqs_throttle.go falls Send and
Receive traffic through to the Default bucket whenever the
corresponding Send*/Recv* pair is unset. So a config like
ThrottleDefaultCapacity=5, ThrottleDefaultRefillPerSecond=1 was
accepted at the validator, then made every full SendMessageBatch /
DeleteMessageBatch (charge=10) permanently unserviceable because the
Default bucket could never accumulate 10 tokens. Flipped the third
validateThrottlePair call's requireBatchCapacity from false to true;
updated the inline comment to spell out why Default* needs the floor
(the design doc note in §3.2 about Default* being exempt was wrong
about the fall-through). New unit test
TestValidateThrottleConfig_DefaultBucketBatchFloor covers
DefaultCapacity=1/5 reject and DefaultCapacity=10 accept.

(2) P2 — sweep eviction race:
Earlier sweep computed idle under bucket.mu, released the lock,
then unconditionally Deleted the entry. A concurrent charge() in
that window could refill+take a token, then Delete evicted the
just-used bucket and the next request got a fresh full-cap bucket
— the just-taken token was effectively undone, allowing excess
throughput. Fixed in two parts:

- Hold bucket.mu across the Delete so the idle observation cannot
  be invalidated between check and delete. A concurrent charge()
  that wants the bucket either runs to completion before sweep
  acquires mu (sweep then sees the freshened lastRefill and skips
  the delete) or waits for sweep to release mu (charge then takes
  the bucket — but sweep has already removed it from the map, so
  the charge succeeds against an orphan bucket and only the
  next-after-charge request gets the full-cap rebuild — bounded
  one-token leak).
- CompareAndDelete(k, v) ensures sweep does not evict a
  replacement bucket inserted by invalidateQueue or a future
  reconciliation path.

Holding bucket.mu across sync.Map.Delete is safe — sync.Map.Load
is wait-free on the read-only path and never blocks on bucket.mu,
so there is no AB-BA cycle with charge().

New regression test TestBucketStore_SweepDoesNotEvictRecentlyUsedBucket
backdates a bucket's lastRefill by 2h, advances the clock by 2h to
force the bucket past the idle cutoff, then races sweep against a
50-iteration charge loop. After both finish the bucket must still
be in the store — at least one charge interleaves with sweep's
re-check window and sweep observes the freshened lastRefill.

All tests pass under -race; golangci-lint clean.
… round 6)

Round 5's sweep+CompareAndDelete fix closed half the eviction race —
sweep can no longer evict a bucket whose lastRefill was freshened by a
concurrent charge — but left a second window. Goroutines that loaded
the bucket from sync.Map *before* sweep removed it from the map were
still blocked on bucket.mu, and once sweep released mu they would
charge the orphan bucket while later requests created a fresh
full-capacity entry via loadOrInit. A burst aligned with a sweep tick
could therefore consume up to 2x the configured capacity.

Three changes plug the window:

- tokenBucket gains an `evicted bool` field, protected by mu.
- sweep, invalidateQueue, and the loadOrInit reconciliation path all
  set evicted=true under mu *after* successfully removing the bucket
  from the map (CompareAndDelete success or LoadAndDelete return).
- charge() loops on a chargeBucket helper that checks evicted under
  mu and signals retry; the caller drops the orphan reference and
  loops back through loadOrInit, which converges on whatever live
  bucket the map now holds.

CreateQueue now calls invalidateQueue on a genuine commit
(tryCreateQueueOnce post-Dispatch) — DeleteQueue invalidated, but a
sendMessage holding pre-delete meta could recreate a bucket between
DeleteQueue's invalidate and the next CreateQueue commit, so a same-
name recreate would inherit used token state. Invalidating after the
genuine create resets the bucket regardless of in-flight traffic to
the prior incarnation. The idempotent-return path (existing queue
with identical attributes) deliberately skips invalidation.

Tests:

- TestBucketStore_SweepRaceDoesNotInflateBudget: -race stress test
  that runs 4 sweeps against 64 chargers on a single backdated
  bucket; total successful charges must stay <= capacity. The old
  code path could yield up to 2x capacity.
- TestBucketStore_OrphanedBucketRetriesToLiveEntry: deterministic
  exercise of the evicted-flag retry — sweep evicts, the next
  charge allocates a fresh bucket via loadOrInit.
- TestBucketStore_InvalidateMarksOrphanEvicted: pins the round-6
  invalidateQueue change (LoadAndDelete + flag set under mu).
- The pre-existing TestBucketStore_SweepDoesNotEvictRecentlyUsedBucket
  is replaced; the prior assertion (stillThere) was vacuously true
  because the for-range 50 charge loop would always recreate the
  bucket via loadOrInit after any eviction (Claude Low on round 5).

go test -race -count=20 of the new tests passes; full SQS test
package passes under -race; golangci-lint clean.

Round-6 review pulled four other items not actioned:

- Codex P2 "CreateQueue idempotency must compare Throttle" — already
  addressed: attributesEqual at sqs_catalog.go:726 calls
  baseAttributesEqual + throttleConfigEqual + htfifoAttributesEqual.
- Gemini high "chargeQueue performs redundant storage read" — already
  addressed: hot-path handlers (sendMessage, receiveMessage) call
  chargeQueueWithThrottle to skip the second meta load.
- Gemini high "maybeSweep on hot path" — already addressed:
  runSweepLoop runs sweep on a background ticker wired to reaperCtx.
- Gemini medium "cap retryAfter at 10ms to prevent overflow" —
  already addressed: throttleRetryAfterCap=1h cap before multiply
  prevents overflow; 10ms cap is inappropriate for SQS rate-limit
  semantics (clients would retry constantly under contention).
Round 6's invalidateQueue used LoadAndDelete-then-lock — a charger
that loaded the pointer pre-LoadAndDelete could win the mu race and
charge the just-removed bucket while evicted was still false (Codex
P2 on round 6). The fix mirrors sweep's ordering:

  1. Load the pointer from the map (without deleting).
  2. Acquire bucket.mu.
  3. CompareAndDelete with the loaded v (rejects a replacement
     bucket inserted by a concurrent reconciliation).
  4. Set evicted=true under mu.
  5. Unlock.

A charger blocked on mu now sees evicted=true on entry and retries
against the live entry, instead of charging the orphan.

Replaced the over-strict TestBucketStore_InvalidateRace... with
TestBucketStore_InvalidateUnderConcurrencyIsRaceFree: invalidate's
reset-to-full-capacity is the *intended* semantics of an invalidation
event (DeleteQueue / SetQueueAttributes / CreateQueue all want the
new policy applied to a fresh bucket), so the post-invalidate fresh
bucket can legitimately absorb up to capacity additional tokens —
that 2x window is structural, not a bug. Asserting successful-charge
count <= capacity would also pass on the buggy code path under most
schedulings, so the assertion would not differentiate old vs new.
The deterministic TestBucketStore_InvalidateMarksOrphanEvicted
already pins the actual mechanism (evicted=true under mu on dropped
bucket); the new stress test exists to surface any -race finding the
new lock ordering might introduce.

go test -race -count=20 of every TestBucketStore_* test passes;
golangci-lint clean.
… round 6.2)

validatePartitionConfig documents PartitionCount=0 (unset) and =1
(explicit single partition) as semantically identical legacy /
single-partition routing, but htfifoAttributesEqual was using strict
numeric equality. So a queue created without PartitionCount (stored
as 0) followed by a CreateQueue retry that explicitly passes
PartitionCount=1 was rejected as "different attributes" — broken
idempotency for any provisioning flow that normalises defaults
differently between attempts. (Codex P2 on PR #679 round 6.1.)

Fix: normalisePartitionCount maps 0 -> 1 before comparison; values >1
pass through unchanged so HT-FIFO partition counts still require
exact match. The on-disk schema is unchanged.

Tests:

- TestHTFIFOAttributesEqual_PartitionCountZeroAndOneEquivalent —
  pins (0, 1) equality, symmetry, and that real divergence (2 vs 0,
  2 vs 1) still rejects.
- TestNormalisePartitionCount — pins the helper directly.

go test -race ./adapter/... pass; golangci-lint clean.
#679 round 6.3)

Round 6.2 normalised the CreateQueue idempotency path
(htfifoAttributesEqual) but left the SetQueueAttributes immutability
path (validatePartitionImmutability) using strict equality. Same
0 == 1 gap (Claude Low on round 6.2):

  1. CreateQueue with no PartitionCount -> stored as 0.
  2. SetQueueAttributes(PartitionCount=1) -> requested.PartitionCount = 1.
  3. validatePartitionImmutability: 0 != 1 -> rejects "PartitionCount
     is immutable", even though the partition layout is unchanged.

Fix: validatePartitionImmutability now normalises both sides through
normalisePartitionCount before comparison, mirroring the round-6.2
fix. The narrow gap surfaced under the §11 PR 2 dormancy gate
(SetQueueAttributes does not call the gate, so this path is
reachable on existing queues), so a fix is correct now rather than
deferring to PR 5.

Test:

- TestValidatePartitionImmutability_PartitionCountZeroAndOneEquivalent
  table-driven over all (stored, requested) ∈ {0,1,2,4,8} pairs;
  no-op pairs accept, real-change pairs reject. Reverting either
  call to strict equality flips the relevant rows.

go test -race ./adapter/... pass; golangci-lint clean.
## Summary

Phase 3.C implementation: per-queue rate limiting via token-bucket
throttling. Implements the design from
[`docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md`](https://github.com/bootjp/elastickv/blob/docs/sqs-phase3-proposals/docs/design/2026_04_26_proposed_sqs_per_queue_throttling.md)
(currently on PR #664).

- Token-bucket store on `*SQSServer`: `sync.Map[bucketKey]*tokenBucket`,
per-bucket `sync.Mutex` so cross-queue traffic never serialises on a
process-wide lock. Lazy idle-evict (1h) bounds memory.
- New `sqsQueueMeta.Throttle *sqsQueueThrottle` field with six float64
sub-fields (Send/Recv/Default × Capacity/RefillPerSecond), wired through
the existing `sqsAttributeAppliers` dispatch as `ThrottleSendCapacity` /
`ThrottleSendRefillPerSecond` / etc.
- Validator (`validateThrottleConfig`) enforces the §3.2 cross-attribute
rules: each (capacity, refill) pair both-zero or both-positive; capacity
≥ refill; Send/Recv capacities ≥ 10 (the SendMessageBatch /
DeleteMessageBatch max charge); per-field hard ceiling 100k.
- All seven message-plane handlers wired (SendMessage[Batch],
ReceiveMessage, DeleteMessage[Batch], ChangeMessageVisibility[Batch]).
Throttle check sits **outside** the OCC retry loop per §4.2 — a rejected
request never reaches the coordinator.
- On rejection: `400 Throttling` with `x-amzn-ErrorType` + AWS-shaped
JSON envelope + `Retry-After` header computed from the §3.4 formula
(numerator is requested count, not 1, so a batch verb does not get told
to retry in 1s when it really needs 10s).
- Cache invalidation on `SetQueueAttributes` and `DeleteQueue` (§3.1):
drops every bucket belonging to the queue after the Raft commit so new
limits take effect on the very next request, not after the 1h idle-evict
sweep.
- `GetQueueAttributes("All")` round-trips Throttle* fields so SDKs can
confirm the config landed.

## Test plan

- [x] `adapter/sqs_throttle_test.go` — 18 unit tests:
- bucket math: fresh capacity, refill elapsed, refill cap-at-capacity,
batch reject preserves partial credit, Retry-After uses requested count,
sub-1-RPS floor;
- isolation: per-action, per-queue, Default-fallthrough shares one
bucket;
  - lifecycle: `invalidateQueue` drops all action keys;
  - concurrency: `-race` clean, exactly-capacity successes;
  - default-off short-circuit;
- validator: nil/empty canonicalisation, both-zero-or-both-positive,
capacity ≥ 10 batch floor, Default* exempt, capacity ≥ refill,
`parseThrottleFloat` range checks, `computeRetryAfter` floor.
- [x] `adapter/sqs_throttle_integration_test.go` — 8 end-to-end tests
against a real `createNode` cluster:
  - default-off allows unbounded;
- send/recv reject after capacity with correct envelope + `Retry-After`;
  - batch charges by entry count;
  - `SetQueueAttributes` invalidation (raise-and-immediate-success);
  - `DeleteQueue` + `CreateQueue` lifecycle invalidation;
  - `GetQueueAttributes` round-trip;
  - validator rejects below batch min.
- [x] `golangci-lint run` clean.
- [x] `go test -race -run
"TestBucketStore|TestValidateThrottleConfig|TestParseThrottleFloat|TestComputeRetryAfter|TestSQSServer_Throttle"
./adapter/...` clean.

## Self-review (5 lenses)

1. **Data loss** — Throttle config is on `sqsQueueMeta`, persisted via
the existing `setQueueAttributesWithRetry` OCC commit. Bucket state is
per-process and never written to storage, by design (§3.1: replicating
bucket state would cost a Raft commit per token decrement, defeating the
bucket). On leader failover the new leader rebuilds at full capacity —
same as the AWS region-failover semantic. No write path can lose
throttle config.
2. **Concurrency** — Per-bucket `sync.Mutex`, never held across
`sync.Map` ops. `LoadOrStore` race on first insert is safe (both racers
compute identical config from the same meta snapshot). Concurrent
`SetQueueAttributes` + in-flight charge: the charge may briefly use the
old bucket; the next request rebuilds from the fresh meta. `-race` test
pins the count invariant: exactly `capacity` successes out of N
concurrent goroutines, never more.
3. **Performance** — Hot path on unconfigured queues is one `nil` check
(`Throttle == nil`) → return. On configured queues: one `sync.Map.Load`
(lock-free), one mutex acquire on the bucket. No global lock.
`sync.Map`'s read-mostly optimisation matches the access pattern. The
lazy idle-evict sweep runs at most once per minute from the hot path so
a many-queue cluster does not pay an O(N) cost per request.
4. **Data consistency** — Throttling sits outside OCC; a rejected
request never touches the coordinator (§4.2). The OCC retry loop in
`sendMessageWithRetry` cannot busy-loop on a permanent rate-limit
failure. Cache invalidation after the Raft commit (not before) so the
rebuilt bucket reads the freshly committed config — there is no window
where a request can read a stale bucket built from the old config.
5. **Test coverage** — 18 unit tests cover bucket math, isolation,
lifecycle, concurrency, default-off; 8 integration tests cover the
end-to-end wire path. Both pass under `-race`. Validator has a dedicated
table-driven test per rule.

## Out of scope (deferred to follow-ups)

- **Prometheus metrics** (counter
`sqs_throttled_requests_total{queue,action}` + gauge
`sqs_throttle_tokens_remaining{queue,action}` per the design's §4.1
monitoring/registry.go entry). The bucket store's `chargeOutcome`
already exposes `tokensAfter` so the gauge wiring is one site; the
counter is a one-line increment after `writeSQSThrottlingError`. Punted
to keep this PR focused on data-plane behaviour.
- **§6.5 cross-protocol (Query) parity test** — the Query protocol layer
lives on PR #662 (separate branch); the throttle envelope already has
the same shape on both protocols (`writeSQSError` handles both), so the
test is a follow-up after #662 lands.
- **§6.6 failover behaviour test** — the §3.1 "fresh bucket on failover"
contract is implementation-correct (the bucket map is per-process, no
Raft state) but a 3-node failover test needs cluster scaffolding and is
best added with the Jepsen workload.

## Branch base

Branched off
[`docs/sqs-phase3-proposals`](https://github.com/bootjp/elastickv/tree/docs/sqs-phase3-proposals)
(PR #664) so the design doc is part of the reviewable surface. Will
rebase against `main` after #664 merges.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 5m 42s —— View job


Review

  • Read current state of both design documents
  • Check open Codex P1: "fresh bucket on failover" correctness (3.C §3.1)
  • Review implementation files added to the PR
  • Check for new issues
  • Overall assessment

Context

Since the last review round, two implementation PRs have landed on this branch:

This changes the scope of this PR review from docs-only to docs + implementation.


Open Codex P1 from last round: RESOLVED in code, STILL OPEN in doc

Codex P1: §3.1 "fresh bucket on failover" is incorrect for re-elected nodes

The implementation resolves this via lazy config reconciliation in loadOrInit (adapter/sqs_throttle.go:272–328):

// On Load hit, compare cached config to current meta config.
bucket.mu.Lock()
matches := bucket.capacity == capacity && bucket.refillRate == refill
bucket.mu.Unlock()
if matches {
    return bucket
}
// Config mismatch: stale bucket — CompareAndDelete + evicted=true,
// then fall through to LoadOrStore with a fresh full-capacity bucket.

When node A re-acquires leadership after a SetQueueAttributes commit on node B:

  1. Node A's charge() calls loadOrInit(queue, action, capacity=NEW, refill=NEW) where the config comes from the freshly-read meta.
  2. loadOrInit finds node A's stale bucket (capacity=OLD).
  3. Mismatch detected → stale bucket evicted (CompareAndDelete + evicted=true) → fresh full-capacity bucket under the new config.

The evicted=true flag (see tokenBucket.evicted) ensures any concurrent chargeBucket goroutine holding the stale bucket retries against the live entry (chargeBucket returns retry=true on evicted, charge loops up to 4 times).

However, the design doc §3.1 line 118 has not been updated:

"On leader failover, a fresh bucket starts at full capacity on the new leader — there is no Raft replication of bucket state. Why this is correct: the worst-case behaviour of 'fresh bucket on failover' is that a noisy queue gets one extra burst worth of bandwidth right after a leader change."

This claim is only accurate for first-time-leader failover. For re-elected nodes, the path is reconciliation, not a fresh bucket. The doc should describe the two paths:

  1. First-time leader (no existing bucket): loadOrInit miss → fresh bucket at full capacity under current meta config. ✓ (doc is correct for this case)
  2. Re-elected node with stale bucket: loadOrInit hit → config mismatch → old bucket evicted → fresh bucket at full capacity under the now-current meta config. ✓ (implementation is correct, doc doesn't describe this)

The "one extra burst" worst-case also needs updating: for a re-elected node with a stale exhausted bucket, the new bucket starts at full capacity of the current config (which may be higher or lower than the old config). Fix this →


Implementation review: LGTM with one low note

The implementation is well-structured and the eviction-race handling in particular (evicted flag, CompareAndDelete in sweep / invalidateQueue / loadOrInit) is carefully done. Specific observations:

sqs_throttle.go — correct throughout:

  • invalidateQueue (line 350) uses the lock-then-CompareAndDelete-then-flag pattern matching sweep. Consistent. ✓
  • runSweepLoop background goroutine wired to the server's context so it cleans up alongside the reaper. ✓
  • charge retry loop bounded at 4 with fail-open fallback. ✓
  • computeRetryAfter uses requested - current in the numerator (matching the PR history's §3.4 formula update). ✓
  • resolveActionConfig correctly shares the "*" bucket key for Default so all fall-through actions share one budget. ✓

sqs_throttle.go — one note [Low]:

resolveActionConfig returns return action, 0, 0 as the no-throttle sentinel (last line, currently action is passed through). The caller in charge() checks if capacity == 0 || refill == 0 and returns early, so the first return value is never used in that path — functionally harmless. But returning the raw action string instead of a canonical bucket-action constant makes the sentinel ambiguous for future callers who might not read the callsite's guard. A cleaner return would be return bucketActionAny, 0, 0 or just the zero value. Not blocking, just a readability note.

sqs_partitioning.go — correct throughout:

  • partitionFor handles all edge cases (nil, PartitionCount ≤ 1, perQueue short-circuit, empty GroupId). Inlined FNV-1a avoids allocation vs. the hash.Hash interface path — the comment explains the perf motivation well. ✓
  • validatePartitionConfig has the FIFO-only guard for PartitionCount > 1 (PR feat(sqs): HT-FIFO schema + validators + dormancy gate (Phase 3.D PR 2) #681 round 2 fix) and the {PartitionCount > 1, DeduplicationScope = "queue"} cross-attribute rejection. ✓
  • validatePartitionDormancyGate correctly rejects at PartitionCount > 1. Will be removed atomically with the PR 5 data-plane fanout. ✓
  • validatePartitionImmutability uses normalisePartitionCount for the 0/1 equivalence edge case (PR feat(sqs): per-queue throttling (Phase 3.C) #679 round 6.2 fix). ✓

sqs_catalog.go invalidateQueue gate [Low]:

The setQueueAttributes handler gates invalidateQueue on throttleAttributesPresent(in.Attributes) (line 1384). The comment correctly notes this is "purely a hot-path optimisation" because loadOrInit reconciliation catches stale buckets from config changes on other nodes. This is correct — but the comment refers to loadOrInit reconciliation as a backstop, which is now the primary mechanism for the ex-follower-turned-leader scenario. The doc should describe both mechanisms and their relationship (explicit invalidation for the current leader's store, lazy reconciliation for all other nodes including future re-elected ones).


Summary

The Codex P1 from the last round is addressed in the implementation via loadOrInit reconciliation — but the design doc §3.1 has not been updated to reflect the two-path failover model (first-time-leader vs. re-elected-with-stale-bucket). This is the only finding I'd call blocking before this PR merges — the doc should accurately describe the mechanism that protects operators from stale throttle state after leadership changes.

The implementation itself is in good shape across both sqs_throttle.go and sqs_partitioning.go.

| Branch |

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2edf9d49d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/sqs_throttle.go
Comment on lines +91 to +94
type bucketKey struct {
queue string
action string
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include queue generation in throttle bucket keys

Fresh evidence from this implementation: bucketKey is still only (queue, action), and reconciliation in loadOrInit only checks capacity/refill. If a queue is deleted/recreated with the same throttle values (or leadership moves to a node holding an old cached bucket), the old token balance is reused for the new incarnation because invalidation is process-local, so the recreated queue can start already drained (or unexpectedly bursty) instead of starting from a fresh full bucket.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant