Fix distributed deadlock when wandb.init hangs on rank 0 by davidoj · Pull Request #249 · EleutherAI/bergson

davidoj · 2026-04-29T06:36:56Z

Summary

If wandb.init failed, a bergson run would stall with uninformative logs. Two changes

Bound wandb.init via WANDB__SERVICE_WAIT=30 / WANDB_INIT_TIMEOUT=60 defaults (user-overridable), and degrade to a no-op log_fn with a UserWarning if wandb.init fails — a broken wandb setup means "no logging" instead of "hang the run".
Add dist.barrier() right after the rank-0-only wandb_log_fn call in bergson/magic/cli.py worker(...) so a future hung wandb.init attributes to rank 0 inside wandb instead of ranks 1..N stuck on the next Gloo collective (dist.new_group inside fwd_state.save).

The second part might be excessively defensive.

Test plan

pytest tests/test_log_fn.py -v — 4/4 pass, including new test_wandb_log_fn_falls_back_when_init_fails
pytest tests/test_magic.py — 3/3 pass (single-process path where the new barrier is a no-op)
8-GPU failure injection on pythia-14m smoke config: WANDB_API_KEY=bogus_test_key bergson magic configs/... → UserWarning fires from rank 0, run completes cleanly
8-GPU happy path on pythia-14m smoke config (real wandb, no env override) — recommend running once before merging to confirm the new barrier doesn't introduce timing weirdness

🤖 Generated with Claude Code, edited by David

If rank 0's wandb.init() hangs (dead wandb-service daemon, missing API key, or unreachable wandb.ai), the other ranks race ahead into the next collective — the Gloo new_group() inside dcp.async_save during fwd_state.save(state0) — and the whole job deadlocks. The hang surfaces as ranks 1..N stuck on a collective rather than rank 0 stuck in wandb, which is maximally misleading. Two changes: * bergson/utils/logging.py: bound wandb.init via WANDB__SERVICE_WAIT=30 and WANDB_INIT_TIMEOUT=60 (os.environ.setdefault, user-overridable), and catch wandb.init failures so we degrade to a no-op log_fn with a UserWarning instead of propagating. A broken wandb setup now means "no logging" instead of "hang the run". * bergson/magic/cli.py: dist.barrier() right after the rank-0-only wandb_log_fn call. The barrier doesn't fix a hung wandb but ensures the hang attributes to rank 0 inside wandb instead of ranks 1..N stuck on a Gloo collective. Verified by failure injection (WANDB_API_KEY=bogus_test_key) on an 8-GPU magic run: wandb.init raises AuthenticationError, the warning fires, and training proceeds to completion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

davidoj force-pushed the fix-wandb-init-deadlock branch from f3f0eef to fdd2a74 Compare April 29, 2026 06:46

davidoj marked this pull request as ready for review April 29, 2026 23:46

davidoj requested a review from norabelrose April 29, 2026 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix distributed deadlock when wandb.init hangs on rank 0#249

Fix distributed deadlock when wandb.init hangs on rank 0#249
davidoj wants to merge 1 commit intomainfrom
fix-wandb-init-deadlock

davidoj commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidoj commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidoj commented Apr 29, 2026 •

edited

Loading