compilAR

Overview

compilAR is a compiler and runtime for straggler-aware AllReduce over GPU clusters, inspired by the paper Efficient AllReduce with Stragglers (Devraj et al.). It takes a schedule produced by the StragglAR algorithm as input and emits a complete, standalone CUDA + MPI + NCCL implementation of that schedule for any number of GPUs.

The core problem it solves is that standard AllReduce algorithms (ring, tree, recursive-halving-doubling) stall every healthy rank until the slowest GPU catches up. StragglAR lets the N-1 healthy ranks make progress among themselves while the straggler is still computing, then merges the straggler's contribution with a minimal number of additional communication rounds once it is ready. This yields measurable throughput improvements when one rank is consistently slower than the rest.

How It Works

The algorithm

Given N GPUs with one designated straggler (rank N-1):

Reduce-scatter phase (while straggler is delayed): The N-1 healthy ranks run ncclReduceScatter among themselves over N-1 equal chunks of the buffer. After this, rank r holds the partial sum of chunk r across all healthy ranks.
Straggler merge phase (schedule-driven): The schedule synthesizer produces a sequence of rounds, each containing a batch of pairwise exchanges. Three exchange types are used:
- StragglerMatching: a healthy rank and the straggler both hold a partial sum of the same chunk. They swap into a scratch buffer, then both call reduce_add to finalize. After this, both hold the complete N-rank sum for that chunk.
- OneWayMatching: a rank holding a fully-reduced chunk pushes it to a rank that does not. Plain copy, no reduction.
- TwoWayMatching: two ranks each hold a fully-reduced chunk the other lacks. They swap simultaneously. Plain copy in each direction.
After the last round, every rank holds every chunk fully reduced.

The compiler

compilAR.py takes a schedule file (from synthesizer_pow2.py or synthesizer_nonpow2.py) and generates a complete .cu source file. It uses allreduce_multinode.cu.template as its skeleton and substitutes three things:

NUM_RANKS: number of GPUs, inferred from the schedule
kStragglerRank: straggler rank, inferred from the schedule
Body of stragglar_allreduce_helper: per-round NCCL group blocks, generated from the schedule matchings

Everything else in the template (MPI bootstrap, NCCL communicator init, reduce-scatter sub-communicator, benchmark loop, correctness check, cleanup) is agnostic of the number of GPUs and remains unchanged between schedules.

Dependencies

Python:

Python 3.12+
numpy

CUDA build (on our cluster):

CUDA Toolkit 12.x
NCCL 2.x
OpenMPI or MPICH
nvcc with MPI-aware host compiler (mpicxx)

uv sync   # or: pip install numpy

Example Usage

1. Generate a schedule

cd stragglar/schedules
python synthesizer_pow2.py 8 > 8gpusched.txt

The most common use case is when N is a power of 2. Pre-generated schedules for N=2, 4, 8 are already in schedules/.

2. Compile the schedule to CUDA

cd stragglar
python compilAR.py schedules/8gpusched.txt generated_8gpu.cu

On success:

Wrote generated_8gpu.cu (N=8, straggler=7)

3. Build the binary (on the cluster)

nvcc -ccbin mpicxx -O3 -arch=sm_89 generated_8gpu.cu -lnccl -lmpi -o stragglar_8gpu

Replace sm_89 with your GPU's compute capability:

nvidia-smi --query-gpu=compute_cap --format=csv,noheader

4. Run

There are two ways to launch the binary depending on whether you want to detect a real straggler or simulate one.

Automated straggler detection (recommended)

launch.sh combines the smoketester, the rank-to-GPU mapping, and the mpirun invocation in one step:

./stragglar/launch.sh 8 ./stragglar_8gpu 1073741824 stragglar 10 -1

Under the hood it:

Runs the smoketester on all N GPUs to identify the physically slow one
Builds a rank-to-GPU mapping so MPI rank N-1 binds to that GPU
Exports the mapping and calls mpirun with rank_wrapper.sh, which sets LOCAL_RANK per process

When using launch.sh, pass -1 for sleep_ms since the physically slow GPU will lag on its own and no simulated delay is needed.

Manual launch (simulated straggler)

For correctness validation or benchmarking without a real straggler:

# No straggler delay — purely for correctness checks
mpirun -n 8 ./stragglar_8gpu 1073741824 stragglar 10 -1

# With 100ms simulated delay injected on rank N-1
mpirun -n 8 ./stragglar_8gpu 1073741824 stragglar 10 100.0

In this mode, rank N-1 is always the straggler regardless of physical GPU placement, and the delay is injected via gpu_sleep_kernel.

Binary arguments: <buffer_bytes> <algorithm> <num_iters> <sleep_ms>

Argument	Description
`buffer_bytes`	Total AllReduce buffer size in bytes (e.g. `1073741824` = 1 GiB)
`algorithm`	Must be `stragglar`
`num_iters`	Timed iterations; first is discarded as warmup
`sleep_ms`	Milliseconds to delay rank N-1. `-1` skips reduce-scatter and runs the merge schedule only

Output (rank 0):

algorithm,buffer_size_bytes,iteration,delay,runtime_ms,BW(GB/s)
stragglar,1073741824,1,100.000,12.345,82.345

Correctness: every element of every rank's output buffer must equal 6.0f. Failures print Rank X, idx Y, val Z.

Architecture Notes

Single-process vs. multi-process model

The files in reference_code/ and allreduce_4GPU_rewrite.cu use a single-process model: one process manages all GPUs via ncclCommInitAll. This only works when all GPUs are on one machine and is kept for reference.

allreduce_multinode.cu (and all generated files) use a multi-process MPI model: one MPI rank per GPU. Each process initializes its NCCL communicator via ncclCommInitRank with a token distributed by MPI_Bcast. This scales to multi-node configurations.

Communicator structure

Two NCCL communicators are maintained per process:

comm: all N ranks; used during the straggler merge schedule
subComm: ranks 0 through N-2 only; used for the reduce-scatter phase while the straggler is delayed, built via MPI_Comm_split to exclude rank N-1

GPU binding

Each MPI process binds to its GPU via the LOCAL_RANK environment variable (set by mpirun, torchrun, or SLURM). Without it, the process falls back to myRank % cudaDeviceCount. For accurate straggler behavior, the process assigned rank N-1 must be bound to the physically slow GPU. launch.sh handles this automatically by parsing the smoketester's output and constructing the correct mapping.

Acknowledgements

The StragglAR algorithm and the schedule synthesizer in stragglar/schedules/ are the work of Devraj et al., Efficient AllReduce with Stragglers. This project is not the original algorithm — it is a compiler and launch harness built around their work. The algorithmic contribution (the round-matching formulation, optimality bounds, and synthesis procedure) is theirs; what is new here is the code-generation pipeline, the MPI+NCCL runtime template, and the straggler-aware launch integration.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
resources		resources
stragglar		stragglar
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

compilAR

Overview

How It Works

The algorithm

The compiler

Dependencies

Example Usage

1. Generate a schedule

2. Compile the schedule to CUDA

3. Build the binary (on the cluster)

4. Run

Automated straggler detection (recommended)

Manual launch (simulated straggler)

Architecture Notes

Single-process vs. multi-process model

Communicator structure

GPU binding

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

compilAR

Overview

How It Works

The algorithm

The compiler

Dependencies

Example Usage

1. Generate a schedule

2. Compile the schedule to CUDA

3. Build the binary (on the cluster)

4. Run

Automated straggler detection (recommended)

Manual launch (simulated straggler)

Architecture Notes

Single-process vs. multi-process model

Communicator structure

GPU binding

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages