fix(frontend_multi_user): unblock /viewplan and gunicorn control server by neoneye · Pull Request #643 · PlanExeOrg/PlanExe

neoneye · 2026-04-28T15:28:33Z

Summary

/viewplan was hanging gunicorn workers past the 120s timeout. The handler loads the deferred generated_report_html column (~800KB typical) and hands it to make_response as one big string. In one captured incident this lazy load overlapped a 134s postgres checkpoint, so the worker silently sat on the DB read until the master killed it. The "no URI read" traceback was misleading — that's just where SIGABRT interrupted the worker after the slow request returned.
Gunicorn 25.x writes its control-server state to <cwd>/.gunicorn at master startup, before --chdir is applied, so it landed in /app — root-owned in the image, hence Permission denied: '/app/.gunicorn' on every restart.

Changes

frontend_multi_user/src/plan_routes.py:
- First check now uses the cheap has_generated_report_html column_property so the deferred TEXT column isn't loaded just to test for null.
- Deferred load is timed and logged: ViewPlan: loaded report for plan_id=… in N.NNs (B bytes).
- Response is streamed in 64KB chunks via a generator instead of make_response(big_string).
frontend_multi_user/Dockerfile: pre-create /app/.gunicorn owned by appuser.

Notes

Streaming/chunking won't make a slow DB read fast. The new timing log is so the next timeout cleanly attributes time to DB load vs. response send. If it confirms the DB read is the bottleneck, the right follow-up is to move report HTML out of plans.generated_report_html and onto disk/object storage.

Test plan

Rebuild frontend_multi_user image and confirm no more Permission denied: '/app/.gunicorn' in master logs
Open a completed plan via /viewplan and confirm the new ViewPlan: loaded report … log line appears
Confirm the report renders end-to-end with chunked streaming

Two issues were causing gunicorn worker timeouts in frontend_multi_user. 1. /viewplan handler was loading the deferred ~800KB generated_report_html column, then handing it to make_response as a single string. The deferred load coincided with a 134-second postgres checkpoint in one captured incident, so the worker sat past the 120s gunicorn timeout and got SIGABRT'd. The traceback showed gunicorn's recv() because that is just where the abort interrupted the worker between requests; the slow request had been /viewplan. Now: - Pre-check uses the cheap has_generated_report_html column_property so the deferred column is not loaded merely to test for null. - The deferred load is timed and logged, so any future timeout points at DB load vs. response send. - The response is streamed in 64KB chunks instead of one big send. 2. Gunicorn 25.x writes control-server state to <cwd>/.gunicorn at master startup, resolved before --chdir takes effect, so it lands in /app — which is root-owned in the image. The master logged "Permission denied: '/app/.gunicorn'" on every restart. Pre-create the directory with appuser ownership in the Dockerfile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a gunicorn worker times out, the traceback is always its idle recv() loop and tells us nothing about which request consumed the 120s. Add two layers of request logging so the next timeout names the URL: - Gunicorn --access-logfile - emits one access line per completed request. - Flask before_request/after_request hooks emit "→ METHOD path" on arrival and "← STATUS METHOD path durationMs" on completion. The arrival log fires before the handler runs, so a hung handler's URL is captured even if the response never lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Workaround for repeated sync-worker timeouts firing exactly 120s after a successful large response (zip download, /viewplan). Root cause is the sync worker waiting in recv() on a kept-alive TCP connection that Cloudflare's edge holds open without ever sending the next request; the idle wait counts against --timeout so the master kills the worker. --keep-alive 0 forces the connection to close immediately after each response, so the worker returns to accept() instead of sitting in recv(). Trivial perf cost; eliminates the timeout class entirely. Full rationale, alternatives, and revisit conditions documented in the Dockerfile next to the CMD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each /healthcheck currently produces three log lines: Flask before hook, Flask after hook, gunicorn access log. Healthchecks fire every 10s from the docker healthcheck, so this dominates the console. Drop the Flask before/after lines for /healthcheck specifically; the gunicorn access line still records each one. If healthchecks ever start hanging the missing access-log line plus the WORKER TIMEOUT pattern remain a clear signal, so we keep diagnostic value while losing the noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye and others added 4 commits April 28, 2026 17:28

neoneye merged commit fdc9367 into main Apr 28, 2026
3 checks passed

neoneye deleted the fix/viewplan-stream-and-gunicorn-perms branch April 28, 2026 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(frontend_multi_user): unblock /viewplan and gunicorn control server#643

fix(frontend_multi_user): unblock /viewplan and gunicorn control server#643
neoneye merged 4 commits intomainfrom
fix/viewplan-stream-and-gunicorn-perms

neoneye commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented Apr 28, 2026

Summary

Changes

Notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant