Modeling AllReduce / AllGather Scaling in ASTRA-sim
June 2026
A simulation study of collective communication for
distributed ML: how AllReduce and AllGather latency
scales across a multi-hop 2-D torus versus a one-hop
switch fabric, and where each collective crosses from
latency-bound to bandwidth-bound as a
function of message size (1 KiB–1 GiB) and node count
(4–64).
Built on ASTRA-sim with
a reproducible Docker + Chakra + Python harness. The analytical
backend sweeps the full 220-point grid; the ns-3 packet-level
backend independently validates the regimes.
Headline: the torus's advantage over a switch is largest when
latency-bound at scale (up to 14× at 64 NPUs / 1 KiB)
and collapses to ~1.1–1.5× when bandwidth-bound at
1 GiB — topology matters most exactly where hop count, not bytes,
dominates.
Code & data on GitHub
·
About this work
10 Jun 2026
The analytical results
came from an idealized link model — fast, great for sweeping a 220-point grid,
but it does not model packet-level congestion, PFC, or congestion control.
ASTRA-sim’s ns-3 backend does. The question for this final post: do the
regimes survive once real protocol overhead is in the loop?
09 Jun 2026
The previous post
fixed the node count and swept message size. Here I do the opposite — fix the
message size and sweep the node count — because that’s where the two regimes
diverge most, and where topology choice either pays off enormously or barely
matters.
08 Jun 2026
In the overview
I set up a sweep of AllReduce / AllGather across a switch and a 2-D torus,
holding per-link physics identical. Here’s the first result: every collective
lives in one of two regimes, and the message size decides which.
07 Jun 2026
When you train a large model across many accelerators, a surprising fraction of
wall-clock time is not spent doing math — it is spent in collective
communication: AllReduce to average gradients, AllGather to assemble
sharded tensors. How long those collectives take depends on three things that
interact in non-obvious ways: the message size, the number of nodes, and
the interconnect topology.