In #384 we removed the heartbeat countdown to fix the poisoning bug, but that means a stuck test will heartbeat forever now. The old countdown was the right idea -- it just used the wrong unit (config.timeout, the staleness threshold) instead of max_test_duration.
The heartbeat thread is the natural place to enforce this. If it stops heartbeating after max_test_duration, the entry goes stale in the running ZSET, reserve_lost reclaims it, and the system self-heals through the existing lease mechanism. No new coordination needed.
Right now we're relying on the application layer (max_test_duration at the app level) and supervisor-side timeouts (report_timeout / inactive_workers_timeout) as safety nets, but having the heartbeat thread own this directly would be cleaner.
In #384 we removed the heartbeat countdown to fix the poisoning bug, but that means a stuck test will heartbeat forever now. The old countdown was the right idea -- it just used the wrong unit (
config.timeout, the staleness threshold) instead ofmax_test_duration.The heartbeat thread is the natural place to enforce this. If it stops heartbeating after
max_test_duration, the entry goes stale in the running ZSET,reserve_lostreclaims it, and the system self-heals through the existing lease mechanism. No new coordination needed.Right now we're relying on the application layer (
max_test_durationat the app level) and supervisor-side timeouts (report_timeout/inactive_workers_timeout) as safety nets, but having the heartbeat thread own this directly would be cleaner.