Skip to content

Fix distributed deadlock when wandb.init hangs on rank 0#249

Open
davidoj wants to merge 1 commit intomainfrom
fix-wandb-init-deadlock
Open

Fix distributed deadlock when wandb.init hangs on rank 0#249
davidoj wants to merge 1 commit intomainfrom
fix-wandb-init-deadlock

Conversation

@davidoj
Copy link
Copy Markdown
Contributor

@davidoj davidoj commented Apr 29, 2026

Summary

If wandb.init failed, a bergson run would stall with uninformative logs. Two changes

  • Bound wandb.init via WANDB__SERVICE_WAIT=30 / WANDB_INIT_TIMEOUT=60 defaults (user-overridable), and degrade to a no-op log_fn with a UserWarning if wandb.init fails — a broken wandb setup means "no logging" instead of "hang the run".
  • Add dist.barrier() right after the rank-0-only wandb_log_fn call in bergson/magic/cli.py worker(...) so a future hung wandb.init attributes to rank 0 inside wandb instead of ranks 1..N stuck on the next Gloo collective (dist.new_group inside fwd_state.save).

The second part might be excessively defensive.

Test plan

  • pytest tests/test_log_fn.py -v — 4/4 pass, including new test_wandb_log_fn_falls_back_when_init_fails
  • pytest tests/test_magic.py — 3/3 pass (single-process path where the new barrier is a no-op)
  • 8-GPU failure injection on pythia-14m smoke config: WANDB_API_KEY=bogus_test_key bergson magic configs/...UserWarning fires from rank 0, run completes cleanly
  • 8-GPU happy path on pythia-14m smoke config (real wandb, no env override) — recommend running once before merging to confirm the new barrier doesn't introduce timing weirdness

🤖 Generated with Claude Code, edited by David

If rank 0's wandb.init() hangs (dead wandb-service daemon, missing API
key, or unreachable wandb.ai), the other ranks race ahead into the next
collective — the Gloo new_group() inside dcp.async_save during
fwd_state.save(state0) — and the whole job deadlocks. The hang surfaces
as ranks 1..N stuck on a collective rather than rank 0 stuck in wandb,
which is maximally misleading.

Two changes:

* bergson/utils/logging.py: bound wandb.init via WANDB__SERVICE_WAIT=30
  and WANDB_INIT_TIMEOUT=60 (os.environ.setdefault, user-overridable),
  and catch wandb.init failures so we degrade to a no-op log_fn with a
  UserWarning instead of propagating. A broken wandb setup now means
  "no logging" instead of "hang the run".

* bergson/magic/cli.py: dist.barrier() right after the rank-0-only
  wandb_log_fn call. The barrier doesn't fix a hung wandb but ensures
  the hang attributes to rank 0 inside wandb instead of ranks 1..N
  stuck on a Gloo collective.

Verified by failure injection (WANDB_API_KEY=bogus_test_key) on an
8-GPU magic run: wandb.init raises AuthenticationError, the warning
fires, and training proceeds to completion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@davidoj davidoj force-pushed the fix-wandb-init-deadlock branch from f3f0eef to fdd2a74 Compare April 29, 2026 06:46
@davidoj davidoj marked this pull request as ready for review April 29, 2026 23:46
@davidoj davidoj requested a review from norabelrose April 29, 2026 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant