Deployment: Staging deploy workflow for EC2 via SSM#770
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new GitHub Actions workflow to deploy staging to an EC2 instance via SSM (assume-role, send RunShellScript that pulls, builds, runs migrations, and starts containers), updates the production workflow to tag images and run Alembic migrations in a one-off ECS Fargate task, and adds a staging docker-compose file with backend and Celery services. Changes
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub Actions
participant STS as AWS STS (AssumeRole)
participant SSM as AWS SSM (SendCommand / Wait)
participant EC2 as EC2 Instance (ssm-agent + Docker)
GH->>STS: Assume role using `STAGING_EC2_DEPLOY_ROLE_ARN`
STS-->>GH: Temporary credentials
GH->>SSM: SendCommand (AWS-RunShellScript) to `STAGING_EC2_INSTANCE_ID` (script)
SSM-->>GH: CommandId
SSM->>EC2: Deliver RunShellScript to ssm-agent
EC2->>EC2: execute script: git fetch/pull, docker compose build, run migration container, docker compose up -d, docker image prune
EC2-->>SSM: Command execution status, stdout/stderr
GH->>SSM: aws ssm wait command-executed (CommandId)
SSM-->>GH: final invocation status and logs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
.github/workflows/cd-staging-ec2.yml (1)
44-56: Always fetch SSM output, even when the waiter fails.If
aws ssm wait command-executedexits non-zero, the step can stop before printingStdout/Stderr, which makes failed deploys harder to debug. Capture the waiter status, fetch invocation output, then exit with the original status.🔎 Proposed diagnostic improvement
- name: Wait for SSM command to finish env: INSTANCE_ID: ${{ vars.STAGING_EC2_INSTANCE_ID }} CMD_ID: ${{ steps.ssm.outputs.cmd_id }} run: | - aws ssm wait command-executed \ + set +e + aws ssm wait command-executed \ --command-id "$CMD_ID" \ --instance-id "$INSTANCE_ID" + WAIT_STATUS=$? aws ssm get-command-invocation \ --command-id "$CMD_ID" \ --instance-id "$INSTANCE_ID" \ --query '{Status:Status,Stdout:StandardOutputContent,Stderr:StandardErrorContent}' \ --output json + GET_STATUS=$? + if [ "$WAIT_STATUS" -ne 0 ]; then + exit "$WAIT_STATUS" + fi + exit "$GET_STATUS"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cd-staging-ec2.yml around lines 44 - 56, The step currently runs aws ssm wait command-executed and then aws ssm get-command-invocation, but if the waiter exits non-zero the step can stop before printing Stdout/Stderr; update the run block to capture the waiter's exit status (e.g., save $? into a variable after running aws ssm wait command-executed using the env vars INSTANCE_ID and CMD_ID), then always call aws ssm get-command-invocation with the same --command-id "$CMD_ID" and --instance-id "$INSTANCE_ID" to print Status/Stdout/Stderr, and finally exit with the saved waiter exit code so the job preserves the original failure status.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/cd-staging-ec2.yml:
- Line 38: Replace the docker-compose flag --abort-on-container-exit with
--exit-code-from prestart in the long commands parameter so the prestart
service's exit code is propagated and will cause the SSM command to fail; update
the commands string that runs docker compose (the parameter containing "docker
compose --profile prestart up prestart --build ...") to use "--exit-code-from
prestart" (which implies abort-on-container-exit) so failures in the prestart
service prevent the subsequent "docker compose up -d" step from running while
keeping the surrounding "set -eux" behavior.
- Around line 3-6: The workflow is checking out origin main instead of the
commit that triggered the run; update the checkout step (where actions/checkout
is used) to use the triggering commit instead of hardcoding main — e.g., replace
any ref: main with ref: ${{ github.sha }} (or remove ref to let the action
default to the event commit) and ensure fetch-depth/refs are set appropriately
(fetch-depth: 0 if you need full history) so the deployed revision matches the
triggering push on feat/staging-deployment-ec2.
---
Nitpick comments:
In @.github/workflows/cd-staging-ec2.yml:
- Around line 44-56: The step currently runs aws ssm wait command-executed and
then aws ssm get-command-invocation, but if the waiter exits non-zero the step
can stop before printing Stdout/Stderr; update the run block to capture the
waiter's exit status (e.g., save $? into a variable after running aws ssm wait
command-executed using the env vars INSTANCE_ID and CMD_ID), then always call
aws ssm get-command-invocation with the same --command-id "$CMD_ID" and
--instance-id "$INSTANCE_ID" to print Status/Stdout/Stderr, and finally exit
with the saved waiter exit code so the job preserves the original failure
status.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 47e35146-1215-4326-bf15-ffdfec3b6967
📒 Files selected for processing (1)
.github/workflows/cd-staging-ec2.yml
| --instance-ids "$INSTANCE_ID" \ | ||
| --document-name "AWS-RunShellScript" \ | ||
| --comment "Deploy kaapi-backend @ ${GIT_SHA::7}" \ | ||
| --parameters commands='["set -eux","sudo -iu ubuntu bash -lc \"cd /data/kaapi-backend && git fetch --all && git pull origin main && docker compose --profile prestart up prestart --build --abort-on-container-exit && docker compose up -d --build --remove-orphans && docker image prune -f\""]' \ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n .github/workflows/cd-staging-ec2.yml | head -50Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 2142
🌐 Web query:
Docker compose up --exit-code-from documentation
💡 Result:
The --exit-code-from option in docker compose up returns the exit code of the selected service container, implying --abort-on-container-exit which stops all containers if the specified one stops. Usage: docker compose up --exit-code-from SERVICE [OPTIONS] [SERVICE...]. This is from the official Docker documentation for docker compose up. It is incompatible with --detach (-d). Without this flag, docker compose up exits with code 1 on errors or 0 on SIGINT/SIGTERM interruption, regardless of individual container exit codes.
Citations:
- 1: https://docs.docker.com/reference/cli/docker/compose/up/
- 2: https://docs.docker.com/compose/reference/up
- 3: https://docs.docker.com/engine/reference/commandline/compose_up/
Replace --abort-on-container-exit with --exit-code-from prestart to propagate prestart failures.
When prestart fails, the deploy gate should exit with its exit code. The --exit-code-from prestart flag returns the prestart service's exit code and implies --abort-on-container-exit, ensuring failed migrations or prestart scripts fail the SSM command before the new app starts. Combined with set -eux, this guarantees the subsequent docker compose up -d won't execute on prestart failures.
Proposed fix
- --parameters commands='["set -eux","sudo -iu ubuntu bash -lc \"cd /data/kaapi-backend && git fetch --all && git pull origin main && docker compose --profile prestart up prestart --build --abort-on-container-exit && docker compose up -d --build --remove-orphans && docker image prune -f\""]' \
+ --parameters commands='["set -eux","sudo -iu ubuntu bash -lc \"cd /data/kaapi-backend && git fetch --all && git pull origin main && docker compose --profile prestart up prestart --build --exit-code-from prestart && docker compose up -d --build --remove-orphans && docker image prune -f\""]' \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| --parameters commands='["set -eux","sudo -iu ubuntu bash -lc \"cd /data/kaapi-backend && git fetch --all && git pull origin main && docker compose --profile prestart up prestart --build --abort-on-container-exit && docker compose up -d --build --remove-orphans && docker image prune -f\""]' \ | |
| --parameters commands='["set -eux","sudo -iu ubuntu bash -lc \"cd /data/kaapi-backend && git fetch --all && git pull origin main && docker compose --profile prestart up prestart --build --exit-code-from prestart && docker compose up -d --build --remove-orphans && docker image prune -f\""]' \ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/cd-staging-ec2.yml at line 38, Replace the docker-compose
flag --abort-on-container-exit with --exit-code-from prestart in the long
commands parameter so the prestart service's exit code is propagated and will
cause the SSM command to fail; update the commands string that runs docker
compose (the parameter containing "docker compose --profile prestart up prestart
--build ...") to use "--exit-code-from prestart" (which implies
abort-on-container-exit) so failures in the prestart service prevent the
subsequent "docker compose up -d" step from running while keeping the
surrounding "set -eux" behavior.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| --instance-ids "$INSTANCE_ID" \ | ||
| --document-name "AWS-RunShellScript" \ | ||
| --comment "Deploy kaapi-backend kaapi-staging" \ | ||
| --parameters commands='["set -eux","sudo -iu ubuntu bash -lc \"cd /data/kaapi-backend && git fetch --all && git pull origin main && docker compose build && docker compose run --rm backend uv run alembic upgrade head && docker compose up -d --remove-orphans && docker image prune -f\""]' \ |
There was a problem hiding this comment.
Update this script only for backend, celery processes.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
.github/workflows/cd-production.yml (2)
66-72: Exit-code check is fragile when the task never started.If the migration task fails before the container runs (image pull failure, ENI/networking issues, capacity),
containers[?name==…].exitCodereturnsNone. The current check correctly fails the job ("None" != "0"), but the diagnostic message is misleading and you lose the real reason. Consider also surfacingstopCode/stoppedReasonand the container'sreasonfor faster triage.♻️ Proposed change
EXIT_CODE=$(aws ecs describe-tasks --cluster $CLUSTER --tasks "$TASK_ARN" \ --query "tasks[0].containers[?name=='$CONTAINER_NAME'].exitCode | [0]" --output text) if [ "$EXIT_CODE" != "0" ]; then - echo "Migration failed with exit code $EXIT_CODE" + echo "Migration failed with exit code $EXIT_CODE" + aws ecs describe-tasks --cluster $CLUSTER --tasks "$TASK_ARN" \ + --query "tasks[0].{stopCode:stopCode,stoppedReason:stoppedReason,containerReason:containers[0].reason}" \ + --output table exit 1 fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cd-production.yml around lines 66 - 72, The current EXIT_CODE check treats a missing exitCode ("None") the same as a non-zero exit, losing real failure reasons; update the aws ecs describe-tasks call to also fetch task.stopCode and task.stoppedReason and the container's reason (e.g., containers[?name=='$CONTAINER_NAME'].reason) alongside containers[?name=='$CONTAINER_NAME'].exitCode, then change the logic that reads EXIT_CODE to detect a "None"/empty exitCode and print a clear diagnostic containing stopCode, stoppedReason and container reason (falling back to exitCode when present) before failing; reference variables: EXIT_CODE, CLUSTER, TASK_ARN, CONTAINER_NAME and fields stopCode, stoppedReason, containers[].reason.
33-40: Production image is only taggedlatest— release tag is discarded.
TAG: ${{ github.ref_name }}is defined but never used; only:latestis built and pushed. This loses traceability of whichvX.Y.Zis actually running in production and makes rollbacks awkward (you cannotdocker pull <repo>:v1.2.3). Also, the migration step on lines 53-61 reuses the existing task definition (which presumably points to:latest), so whether the new image is migrated depends entirely on the task def — pinning would make this deterministic.♻️ Proposed change
run: | - docker build -t $REGISTRY/$REPOSITORY:latest ./backend - docker push $REGISTRY/$REPOSITORY:latest + docker build -t $REGISTRY/$REPOSITORY:latest -t $REGISTRY/$REPOSITORY:$TAG ./backend + docker push $REGISTRY/$REPOSITORY:latest + docker push $REGISTRY/$REPOSITORY:$TAG🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cd-production.yml around lines 33 - 40, The workflow currently builds and pushes only :latest; update the docker build/push commands to also tag and push the release tag (use the existing TAG variable) so the image is published as $REGISTRY/$REPOSITORY:$TAG (and optionally also :latest), i.e. run docker build -t $REGISTRY/$REPOSITORY:$TAG ... and docker push $REGISTRY/$REPOSITORY:$TAG (plus push :latest if wanted). Then change the migration/deployment steps that reference the task definition to pin the image to $REGISTRY/$REPOSITORY:$TAG (update the task definition JSON or the ECS update step to set the container image to the $TAGged image) so migrations and service updates use the exact release image instead of an implicit :latest.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/cd-production.yml:
- Around line 53-61: The run-task invocation passes --network-configuration
twice, which causes the first derived NETWORK_CONFIG to be ignored; fix by using
a single network configuration source: either remove the inline
awsvpcConfiguration JSON and reuse the existing $NETWORK_CONFIG (and if needed
add assignPublicIp=ENABLED into how $NETWORK_CONFIG is built), or drop
$NETWORK_CONFIG and keep the inline awsvpcConfiguration but validate that
secrets.SUBNET_IDS and secrets.SECURITY_GROUP_IDS exist; update the aws ecs
run-task command (TASK_ARN assignment) to include only one
--network-configuration and ensure CONTAINER_NAME and the chosen network config
are consistent.
- Around line 42-46: The workflow's migration step is pointing at staging
resources (env vars CLUSTER, SERVICE, TASK_DEF) so `alembic upgrade head` will
run against staging; update these env vars in cd-production.yml to reference the
correct production cluster/service/task used by the Deploy to ECS step (replace
the `-staging-` suffix with the production resource names or the exact task
family name used in the Deploy to ECS step) so migrations run against the
production DB before deployment.
- Around line 42-74: The "Run database migrations" step uses staging resource
names; change the env values CLUSTER, SERVICE, TASK_DEF in that step from `${{
vars.AWS_RESOURCE_PREFIX }}-staging-cluster`, `-staging-service`,
`-staging-task` to the production names `${{ vars.AWS_RESOURCE_PREFIX
}}-cluster`, `${{ vars.AWS_RESOURCE_PREFIX }}-service`, `${{
vars.AWS_RESOURCE_PREFIX }}-task`; after updating, verify that
secrets.SUBNET_IDS and secrets.SECURITY_GROUP_IDS are present and
comma-separated for production, ensure the production TASK_DEF's container image
tag matches the image pushed by the build (or is pinned, e.g., :latest as
intended), and confirm the IAM role
`arn:aws:iam::024209611402:role/github-action-role` has ecs:RunTask,
ecs:DescribeTasks, ecs:DescribeServices, ecs:DescribeTaskDefinition and
iam:PassRole permissions.
---
Nitpick comments:
In @.github/workflows/cd-production.yml:
- Around line 66-72: The current EXIT_CODE check treats a missing exitCode
("None") the same as a non-zero exit, losing real failure reasons; update the
aws ecs describe-tasks call to also fetch task.stopCode and task.stoppedReason
and the container's reason (e.g., containers[?name=='$CONTAINER_NAME'].reason)
alongside containers[?name=='$CONTAINER_NAME'].exitCode, then change the logic
that reads EXIT_CODE to detect a "None"/empty exitCode and print a clear
diagnostic containing stopCode, stoppedReason and container reason (falling back
to exitCode when present) before failing; reference variables: EXIT_CODE,
CLUSTER, TASK_ARN, CONTAINER_NAME and fields stopCode, stoppedReason,
containers[].reason.
- Around line 33-40: The workflow currently builds and pushes only :latest;
update the docker build/push commands to also tag and push the release tag (use
the existing TAG variable) so the image is published as
$REGISTRY/$REPOSITORY:$TAG (and optionally also :latest), i.e. run docker build
-t $REGISTRY/$REPOSITORY:$TAG ... and docker push $REGISTRY/$REPOSITORY:$TAG
(plus push :latest if wanted). Then change the migration/deployment steps that
reference the task definition to pin the image to $REGISTRY/$REPOSITORY:$TAG
(update the task definition JSON or the ECS update step to set the container
image to the $TAGged image) so migrations and service updates use the exact
release image instead of an implicit :latest.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 44088606-9df7-491b-a4f7-889a87c659ec
📒 Files selected for processing (1)
.github/workflows/cd-production.yml
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
.github/workflows/cd-production.yml (1)
52-52: Hard-codingcontainerDefinitions[0]is brittle if a sidecar is ever added.Picking the first container by index works today but silently breaks the moment a log-router/sidecar (FireLens, Datadog agent, etc.) is added to the task definition —
run-taskwould then issuealembic upgrade headto the wrong container and the migration would either fail or, worse, succeed as a no-op. Consider matching by a known name (e.g.backend) or byessential: true:- CONTAINER_NAME=$(aws ecs describe-task-definition --task-definition $TASK_DEF --query 'taskDefinition.containerDefinitions[0].name' --output text) + CONTAINER_NAME=$(aws ecs describe-task-definition --task-definition $TASK_DEF \ + --query "taskDefinition.containerDefinitions[?name=='backend'].name | [0]" --output text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cd-production.yml at line 52, The current CONTAINER_NAME extraction uses taskDefinition.containerDefinitions[0].name which breaks if a sidecar is added; update the aws ecs describe-task-definition --query to select the intended container by a stable identifier (e.g. match name == 'backend' or filter essential==`true`) instead of index 0 so run-task executes alembic against the real app container (use a JMESPath filter such as taskDefinition.containerDefinitions[?name=='backend'].name | [0] or taskDefinition.containerDefinitions[?essential==`true`].name | [0] when setting CONTAINER_NAME)..github/workflows/cd-staging-ec2.yml (1)
53-57:StandardOutputContent/StandardErrorContentare truncated; surface the CloudWatch log location too.
get-command-invocationtruncatesStandardOutputContent/StandardErrorContentto ~2500 chars, which is easy to overflow with adocker compose build. Since you've already enabled--cloud-watch-output-config CloudWatchOutputEnabled=trueon line 38, also echo the log group/stream so failures are debuggable from the workflow run without round-tripping to the AWS console:aws ssm get-command-invocation \ --command-id "$CMD_ID" \ --instance-id "$INSTANCE_ID" \ --query '{Status:Status,Stdout:StandardOutputContent,Stderr:StandardErrorContent}' \ --output json + + echo "Full output: CloudWatch log group /aws/ssm/AWS-RunShellScript, stream ${CMD_ID}/${INSTANCE_ID}/aws-runShellScript/stdout (and .../stderr)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cd-staging-ec2.yml around lines 53 - 57, get-command-invocation currently only prints truncated StandardOutputContent/StandardErrorContent; update the aws ssm get-command-invocation call to also return and echo the CloudWatch log location so failures are debuggable without visiting the console (you already enable --cloud-watch-output-config CloudWatchOutputEnabled=true). Modify the --query used in the get-command-invocation invocation to include the CloudWatch output/log group and stream fields (from the get-command-invocation response object such as the CloudWatch output config/log identifiers) so the workflow prints the CloudWatch log group/stream (or URL) alongside Status/Stdout/Stderr.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/cd-staging-ec2.yml:
- Around line 47-59: Replace the unreliable aws ssm waiter with a manual polling
loop that repeatedly calls aws ssm get-command-invocation using the existing
CMD_ID and INSTANCE_ID until the invocation Status is one of the terminal values
(Success, Failed, Cancelled, TimedOut); implement a sleep between polls (e.g.,
5s) and enforce a maximum timeout equal to the SSM document executionTimeout
(3600s) to avoid infinite waiting, then exit with code 0 on Success and non‑zero
on Failed/Cancelled/TimedOut (propagate that exit code) instead of using the
WAIT_EXIT waiter approach.
---
Nitpick comments:
In @.github/workflows/cd-production.yml:
- Line 52: The current CONTAINER_NAME extraction uses
taskDefinition.containerDefinitions[0].name which breaks if a sidecar is added;
update the aws ecs describe-task-definition --query to select the intended
container by a stable identifier (e.g. match name == 'backend' or filter
essential==`true`) instead of index 0 so run-task executes alembic against the
real app container (use a JMESPath filter such as
taskDefinition.containerDefinitions[?name=='backend'].name | [0] or
taskDefinition.containerDefinitions[?essential==`true`].name | [0] when setting
CONTAINER_NAME).
In @.github/workflows/cd-staging-ec2.yml:
- Around line 53-57: get-command-invocation currently only prints truncated
StandardOutputContent/StandardErrorContent; update the aws ssm
get-command-invocation call to also return and echo the CloudWatch log location
so failures are debuggable without visiting the console (you already enable
--cloud-watch-output-config CloudWatchOutputEnabled=true). Modify the --query
used in the get-command-invocation invocation to include the CloudWatch
output/log group and stream fields (from the get-command-invocation response
object such as the CloudWatch output config/log identifiers) so the workflow
prints the CloudWatch log group/stream (or URL) alongside Status/Stdout/Stderr.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 815aae9e-5812-4105-90c9-cb8d941d4110
📒 Files selected for processing (3)
.github/workflows/cd-production.yml.github/workflows/cd-staging-ec2.ymldocker-compose.staging.yml
✅ Files skipped from review due to trivial changes (1)
- docker-compose.staging.yml
Issue: #794
Summary
backendandcelery_workersince staging uses external RDS / EC2-hosted Redis & RabbitMQ.cd-staging.yml→cd-staging-ecs.ymlto make room for the new EC2-based variant.Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Summary by CodeRabbit