compilAR is a compiler and runtime for straggler-aware AllReduce over GPU clusters, inspired by the paper Efficient AllReduce with Stragglers (Devraj et al.). It takes a schedule produced by the StragglAR algorithm as input and emits a complete, standalone CUDA + MPI + NCCL implementation of that schedule for any number of GPUs.
The core problem it solves is that standard AllReduce algorithms (ring, tree, recursive-halving-doubling) stall every healthy rank until the slowest GPU catches up. StragglAR lets the N-1 healthy ranks make progress among themselves while the straggler is still computing, then merges the straggler's contribution with a minimal number of additional communication rounds once it is ready. This yields measurable throughput improvements when one rank is consistently slower than the rest.
Given N GPUs with one designated straggler (rank N-1):
-
Reduce-scatter phase (while straggler is delayed): The N-1 healthy ranks run
ncclReduceScatteramong themselves over N-1 equal chunks of the buffer. After this, rank r holds the partial sum of chunk r across all healthy ranks. -
Straggler merge phase (schedule-driven): The schedule synthesizer produces a sequence of rounds, each containing a batch of pairwise exchanges. Three exchange types are used:
- StragglerMatching: a healthy rank and the straggler both hold a partial sum of the same chunk. They swap into a scratch buffer, then both call
reduce_addto finalize. After this, both hold the complete N-rank sum for that chunk. - OneWayMatching: a rank holding a fully-reduced chunk pushes it to a rank that does not. Plain copy, no reduction.
- TwoWayMatching: two ranks each hold a fully-reduced chunk the other lacks. They swap simultaneously. Plain copy in each direction.
- StragglerMatching: a healthy rank and the straggler both hold a partial sum of the same chunk. They swap into a scratch buffer, then both call
-
After the last round, every rank holds every chunk fully reduced.
compilAR.py takes a schedule file (from synthesizer_pow2.py or synthesizer_nonpow2.py) and generates a complete .cu source file. It uses allreduce_multinode.cu.template as its skeleton and substitutes three things:
NUM_RANKS: number of GPUs, inferred from the schedulekStragglerRank: straggler rank, inferred from the schedule- Body of
stragglar_allreduce_helper: per-round NCCL group blocks, generated from the schedule matchings
Everything else in the template (MPI bootstrap, NCCL communicator init, reduce-scatter sub-communicator, benchmark loop, correctness check, cleanup) is agnostic of the number of GPUs and remains unchanged between schedules.
Python:
- Python 3.12+
- numpy
CUDA build (on our cluster):
- CUDA Toolkit 12.x
- NCCL 2.x
- OpenMPI or MPICH
- nvcc with MPI-aware host compiler (
mpicxx)
uv sync # or: pip install numpycd stragglar/schedules
python synthesizer_pow2.py 8 > 8gpusched.txtThe most common use case is when N is a power of 2. Pre-generated schedules for N=2, 4, 8 are already in schedules/.
cd stragglar
python compilAR.py schedules/8gpusched.txt generated_8gpu.cuOn success:
Wrote generated_8gpu.cu (N=8, straggler=7)
nvcc -ccbin mpicxx -O3 -arch=sm_89 generated_8gpu.cu -lnccl -lmpi -o stragglar_8gpuReplace sm_89 with your GPU's compute capability:
nvidia-smi --query-gpu=compute_cap --format=csv,noheaderThere are two ways to launch the binary depending on whether you want to detect a real straggler or simulate one.
launch.sh combines the smoketester, the rank-to-GPU mapping, and the mpirun invocation in one step:
./stragglar/launch.sh 8 ./stragglar_8gpu 1073741824 stragglar 10 -1Under the hood it:
- Runs the smoketester on all N GPUs to identify the physically slow one
- Builds a rank-to-GPU mapping so MPI rank N-1 binds to that GPU
- Exports the mapping and calls
mpirunwithrank_wrapper.sh, which setsLOCAL_RANKper process
When using launch.sh, pass -1 for sleep_ms since the physically slow GPU will lag on its own and no simulated delay is needed.
For correctness validation or benchmarking without a real straggler:
# No straggler delay — purely for correctness checks
mpirun -n 8 ./stragglar_8gpu 1073741824 stragglar 10 -1
# With 100ms simulated delay injected on rank N-1
mpirun -n 8 ./stragglar_8gpu 1073741824 stragglar 10 100.0In this mode, rank N-1 is always the straggler regardless of physical GPU placement, and the delay is injected via gpu_sleep_kernel.
Binary arguments: <buffer_bytes> <algorithm> <num_iters> <sleep_ms>
| Argument | Description |
|---|---|
buffer_bytes |
Total AllReduce buffer size in bytes (e.g. 1073741824 = 1 GiB) |
algorithm |
Must be stragglar |
num_iters |
Timed iterations; first is discarded as warmup |
sleep_ms |
Milliseconds to delay rank N-1. -1 skips reduce-scatter and runs the merge schedule only |
Output (rank 0):
algorithm,buffer_size_bytes,iteration,delay,runtime_ms,BW(GB/s)
stragglar,1073741824,1,100.000,12.345,82.345
Correctness: every element of every rank's output buffer must equal 6.0f. Failures print Rank X, idx Y, val Z.
The files in reference_code/ and allreduce_4GPU_rewrite.cu use a single-process model: one process manages all GPUs via ncclCommInitAll. This only works when all GPUs are on one machine and is kept for reference.
allreduce_multinode.cu (and all generated files) use a multi-process MPI model: one MPI rank per GPU. Each process initializes its NCCL communicator via ncclCommInitRank with a token distributed by MPI_Bcast. This scales to multi-node configurations.
Two NCCL communicators are maintained per process:
comm: all N ranks; used during the straggler merge schedulesubComm: ranks 0 through N-2 only; used for the reduce-scatter phase while the straggler is delayed, built viaMPI_Comm_splitto exclude rank N-1
Each MPI process binds to its GPU via the LOCAL_RANK environment variable (set by mpirun, torchrun, or SLURM). Without it, the process falls back to myRank % cudaDeviceCount. For accurate straggler behavior, the process assigned rank N-1 must be bound to the physically slow GPU. launch.sh handles this automatically by parsing the smoketester's output and constructing the correct mapping.
The StragglAR algorithm and the schedule synthesizer in stragglar/schedules/ are the work of Devraj et al., Efficient AllReduce with Stragglers. This project is not the original algorithm — it is a compiler and launch harness built around their work. The algorithmic contribution (the round-matching formulation, optimality bounds, and synthesis procedure) is theirs; what is new here is the code-generation pipeline, the MPI+NCCL runtime template, and the straggler-aware launch integration.