fix(frontend_multi_user): unblock /viewplan and gunicorn control server#643
Merged
fix(frontend_multi_user): unblock /viewplan and gunicorn control server#643
Conversation
Two issues were causing gunicorn worker timeouts in frontend_multi_user.
1. /viewplan handler was loading the deferred ~800KB
generated_report_html column, then handing it to make_response as a
single string. The deferred load coincided with a 134-second postgres
checkpoint in one captured incident, so the worker sat past the 120s
gunicorn timeout and got SIGABRT'd. The traceback showed gunicorn's
recv() because that is just where the abort interrupted the worker
between requests; the slow request had been /viewplan.
Now:
- Pre-check uses the cheap has_generated_report_html column_property
so the deferred column is not loaded merely to test for null.
- The deferred load is timed and logged, so any future timeout points
at DB load vs. response send.
- The response is streamed in 64KB chunks instead of one big send.
2. Gunicorn 25.x writes control-server state to <cwd>/.gunicorn at
master startup, resolved before --chdir takes effect, so it lands in
/app — which is root-owned in the image. The master logged
"Permission denied: '/app/.gunicorn'" on every restart. Pre-create
the directory with appuser ownership in the Dockerfile.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a gunicorn worker times out, the traceback is always its idle recv() loop and tells us nothing about which request consumed the 120s. Add two layers of request logging so the next timeout names the URL: - Gunicorn --access-logfile - emits one access line per completed request. - Flask before_request/after_request hooks emit "→ METHOD path" on arrival and "← STATUS METHOD path durationMs" on completion. The arrival log fires before the handler runs, so a hung handler's URL is captured even if the response never lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workaround for repeated sync-worker timeouts firing exactly 120s after a successful large response (zip download, /viewplan). Root cause is the sync worker waiting in recv() on a kept-alive TCP connection that Cloudflare's edge holds open without ever sending the next request; the idle wait counts against --timeout so the master kills the worker. --keep-alive 0 forces the connection to close immediately after each response, so the worker returns to accept() instead of sitting in recv(). Trivial perf cost; eliminates the timeout class entirely. Full rationale, alternatives, and revisit conditions documented in the Dockerfile next to the CMD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each /healthcheck currently produces three log lines: Flask before hook, Flask after hook, gunicorn access log. Healthchecks fire every 10s from the docker healthcheck, so this dominates the console. Drop the Flask before/after lines for /healthcheck specifically; the gunicorn access line still records each one. If healthchecks ever start hanging the missing access-log line plus the WORKER TIMEOUT pattern remain a clear signal, so we keep diagnostic value while losing the noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/viewplanwas hanging gunicorn workers past the 120s timeout. The handler loads thedeferredgenerated_report_htmlcolumn (~800KB typical) and hands it tomake_responseas one big string. In one captured incident this lazy load overlapped a 134s postgres checkpoint, so the worker silently sat on the DB read until the master killed it. The "no URI read" traceback was misleading — that's just where SIGABRT interrupted the worker after the slow request returned.<cwd>/.gunicornat master startup, before--chdiris applied, so it landed in/app— root-owned in the image, hencePermission denied: '/app/.gunicorn'on every restart.Changes
frontend_multi_user/src/plan_routes.py:has_generated_report_htmlcolumn_property so the deferred TEXT column isn't loaded just to test for null.ViewPlan: loaded report for plan_id=… in N.NNs (B bytes).make_response(big_string).frontend_multi_user/Dockerfile: pre-create/app/.gunicornowned byappuser.Notes
Streaming/chunking won't make a slow DB read fast. The new timing log is so the next timeout cleanly attributes time to DB load vs. response send. If it confirms the DB read is the bottleneck, the right follow-up is to move report HTML out of
plans.generated_report_htmland onto disk/object storage.Test plan
frontend_multi_userimage and confirm no morePermission denied: '/app/.gunicorn'in master logs/viewplanand confirm the newViewPlan: loaded report …log line appears