ASTRA-sim · Collective Scaling

Modeling AllReduce / AllGather Scaling in ASTRA-sim

A simulation study of collective communication for distributed ML: how AllReduce and AllGather latency scales across a multi-hop 2-D torus versus a one-hop switch fabric, and where each collective crosses from latency-bound to bandwidth-bound as a function of message size (1 KiB–1 GiB) and node count (4–64).

Built on ASTRA-sim with a reproducible Docker + Chakra + Python harness. The analytical backend sweeps the full 220-point grid; the ns-3 packet-level backend independently validates the regimes.

Headline: the torus's advantage over a switch is largest when latency-bound at scale (up to 14× at 64 NPUs / 1 KiB) and collapses to ~1.1–1.5× when bandwidth-bound at 1 GiB — topology matters most exactly where hop count, not bytes, dominates.

Code & data on GitHub  ·  About this work

Posts

Validating with the ns-3 Packet-Level Backend

The analytical results came from an idealized link model — fast, great for sweeping a 220-point grid, but it does not model packet-level congestion, PFC, or congestion control. ASTRA-sim’s ns-3 backend does. The question for this final post: do the regimes survive once real protocol overhead is in the loop?

When Does Topology Matter? Scaling with Node Count

The previous post fixed the node count and swept message size. Here I do the opposite — fix the message size and sweep the node count — because that’s where the two regimes diverge most, and where topology choice either pays off enormously or barely matters.

Latency-Bound vs Bandwidth-Bound: The Two Regimes

In the overview I set up a sweep of AllReduce / AllGather across a switch and a 2-D torus, holding per-link physics identical. Here’s the first result: every collective lives in one of two regimes, and the message size decides which.

Modeling Collective-Communication Scaling in ASTRA-sim

When you train a large model across many accelerators, a surprising fraction of wall-clock time is not spent doing math — it is spent in collective communication: AllReduce to average gradients, AllGather to assemble sharded tensors. How long those collectives take depends on three things that interact in non-obvious ways: the message size, the number of nodes, and the interconnect topology.