Add max_test_duration cap to heartbeat thread

In #384 we removed the heartbeat countdown to fix the poisoning bug, but that means a stuck test will heartbeat forever now. The old countdown was the right idea -- it just used the wrong unit (`config.timeout`, the staleness threshold) instead of `max_test_duration`.

The heartbeat thread is the natural place to enforce this. If it stops heartbeating after `max_test_duration`, the entry goes stale in the running ZSET, `reserve_lost` reclaims it, and the system self-heals through the existing lease mechanism. No new coordination needed.

Right now we're relying on the application layer (`max_test_duration` at the app level) and supervisor-side timeouts (`report_timeout` / `inactive_workers_timeout`) as safety nets, but having the heartbeat thread own this directly would be cleaner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add max_test_duration cap to heartbeat thread #395

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add max_test_duration cap to heartbeat thread #395

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions