Skip to content

fix(frontend_multi_user): unblock /viewplan and gunicorn control server#643

Merged
neoneye merged 4 commits intomainfrom
fix/viewplan-stream-and-gunicorn-perms
Apr 28, 2026
Merged

fix(frontend_multi_user): unblock /viewplan and gunicorn control server#643
neoneye merged 4 commits intomainfrom
fix/viewplan-stream-and-gunicorn-perms

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented Apr 28, 2026

Summary

  • /viewplan was hanging gunicorn workers past the 120s timeout. The handler loads the deferred generated_report_html column (~800KB typical) and hands it to make_response as one big string. In one captured incident this lazy load overlapped a 134s postgres checkpoint, so the worker silently sat on the DB read until the master killed it. The "no URI read" traceback was misleading — that's just where SIGABRT interrupted the worker after the slow request returned.
  • Gunicorn 25.x writes its control-server state to <cwd>/.gunicorn at master startup, before --chdir is applied, so it landed in /app — root-owned in the image, hence Permission denied: '/app/.gunicorn' on every restart.

Changes

  • frontend_multi_user/src/plan_routes.py:
    • First check now uses the cheap has_generated_report_html column_property so the deferred TEXT column isn't loaded just to test for null.
    • Deferred load is timed and logged: ViewPlan: loaded report for plan_id=… in N.NNs (B bytes).
    • Response is streamed in 64KB chunks via a generator instead of make_response(big_string).
  • frontend_multi_user/Dockerfile: pre-create /app/.gunicorn owned by appuser.

Notes

Streaming/chunking won't make a slow DB read fast. The new timing log is so the next timeout cleanly attributes time to DB load vs. response send. If it confirms the DB read is the bottleneck, the right follow-up is to move report HTML out of plans.generated_report_html and onto disk/object storage.

Test plan

  • Rebuild frontend_multi_user image and confirm no more Permission denied: '/app/.gunicorn' in master logs
  • Open a completed plan via /viewplan and confirm the new ViewPlan: loaded report … log line appears
  • Confirm the report renders end-to-end with chunked streaming

neoneye and others added 4 commits April 28, 2026 17:28
Two issues were causing gunicorn worker timeouts in frontend_multi_user.

1. /viewplan handler was loading the deferred ~800KB
   generated_report_html column, then handing it to make_response as a
   single string. The deferred load coincided with a 134-second postgres
   checkpoint in one captured incident, so the worker sat past the 120s
   gunicorn timeout and got SIGABRT'd. The traceback showed gunicorn's
   recv() because that is just where the abort interrupted the worker
   between requests; the slow request had been /viewplan.

   Now:
   - Pre-check uses the cheap has_generated_report_html column_property
     so the deferred column is not loaded merely to test for null.
   - The deferred load is timed and logged, so any future timeout points
     at DB load vs. response send.
   - The response is streamed in 64KB chunks instead of one big send.

2. Gunicorn 25.x writes control-server state to <cwd>/.gunicorn at
   master startup, resolved before --chdir takes effect, so it lands in
   /app — which is root-owned in the image. The master logged
   "Permission denied: '/app/.gunicorn'" on every restart. Pre-create
   the directory with appuser ownership in the Dockerfile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a gunicorn worker times out, the traceback is always its idle
recv() loop and tells us nothing about which request consumed the
120s. Add two layers of request logging so the next timeout names
the URL:

- Gunicorn --access-logfile - emits one access line per completed
  request.
- Flask before_request/after_request hooks emit "→ METHOD path" on
  arrival and "← STATUS METHOD path durationMs" on completion. The
  arrival log fires before the handler runs, so a hung handler's
  URL is captured even if the response never lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workaround for repeated sync-worker timeouts firing exactly 120s after
a successful large response (zip download, /viewplan). Root cause is
the sync worker waiting in recv() on a kept-alive TCP connection that
Cloudflare's edge holds open without ever sending the next request;
the idle wait counts against --timeout so the master kills the worker.

--keep-alive 0 forces the connection to close immediately after each
response, so the worker returns to accept() instead of sitting in
recv(). Trivial perf cost; eliminates the timeout class entirely.

Full rationale, alternatives, and revisit conditions documented in the
Dockerfile next to the CMD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each /healthcheck currently produces three log lines: Flask before
hook, Flask after hook, gunicorn access log. Healthchecks fire every
10s from the docker healthcheck, so this dominates the console.

Drop the Flask before/after lines for /healthcheck specifically; the
gunicorn access line still records each one. If healthchecks ever
start hanging the missing access-log line plus the WORKER TIMEOUT
pattern remain a clear signal, so we keep diagnostic value while
losing the noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye neoneye merged commit fdc9367 into main Apr 28, 2026
3 checks passed
@neoneye neoneye deleted the fix/viewplan-stream-and-gunicorn-perms branch April 28, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant