ASTRA-sim · Collective Scaling

Modeling Collective-Communication Scaling in ASTRA-sim

07 Jun 2026

When you train a large model across many accelerators, a surprising fraction of wall-clock time is not spent doing math — it is spent in collective communication: AllReduce to average gradients, AllGather to assemble sharded tensors. How long those collectives take depends on three things that interact in non-obvious ways: the message size, the number of nodes, and the interconnect topology.

This is the first in a short series modeling that interaction directly in ASTRA-sim — Georgia Tech and Intel’s distributed-ML network simulator. The question: as message size and node count vary, when is a collective latency-bound versus bandwidth-bound, and how does that boundary move between a multi-hop torus and a one-hop switch fabric?

TL;DR for the series

Why ASTRA-sim makes this clean

ASTRA-sim separates three concerns, which is exactly what lets us isolate topology:

Layer What it specifies How I varied it
Workload the collective + message size (Chakra execution trace) synthetic single-op AllReduce/AllGather traces, 1 KiB → 1 GiB
System the collective algorithm (ring, etc.) ring per dimension
Network the topology + per-link BW/latency switch (1 dim) vs 2-D torus (Ring×Ring)

Topologies, held to identical link physics (50 GB/s, 500 ns):

# switch (one hop, bandwidth shared across the collective)
topology:   [ Switch ]
npus_count: [ 16 ]
bandwidth:  [ 50.0 ]   # GB/s
latency:    [ 500.0 ]  # ns

# 2-D torus (a multi-hop ring mesh; here 4x4 = 16 NPUs)
topology:   [ Ring, Ring ]
npus_count: [ 4, 4 ]
bandwidth:  [ 50.0, 50.0 ]
latency:    [ 500.0, 500.0 ]

The stock workload generator only takes integer MB, which can’t reach the small-message latency-bound regime, so I wrote a bytes-based Chakra generator to sweep cleanly on a log scale from 1 KiB. The analytical runs use the congestion-unaware backend — the model built for multi-dimensional (hierarchical) topologies — so the same backend drives both fabrics. Collective latency is the max sys[i] finished cycle across ranks (the collective finishes when the slowest rank does).

Everything is reproducible from a single Docker image; the harness lives at github.com/kredd2506/Astro. The next post gets into the first result: the two regimes, and where torus separates from switch.

Tweet