ASTRA-sim · Collective Scaling

Validating with the ns-3 Packet-Level Backend

2026-06-10T00:00:00+00:00

The analytical results came from an idealized link model — fast, great for sweeping a 220-point grid, but it does not model packet-level congestion, PFC, or congestion control. ASTRA-sim’s ns-3 backend does. The question for this final post: do the regimes survive once real protocol overhead is in the loop?

The setup

I rebuilt the ns-3 backend (./ns3 configure --enable-mpi && ./ns3 build AstraSimNetwork) and ran a matched 8-node slice: a one-hop switch fabric vs a ring (= 1-D torus), both pinned to 400 Gbps / 500 ns links so the only difference is fabric structure, driving the same Chakra workloads through ns-3’s packet-level RDMA model across 16 KiB → 16 MiB.

The regimes survive

ns-3 sits above the analytical model everywhere (left panel) — it pays for packet headers and the congestion-control ramp the idealized model ignores, and that overhead is relatively larger for small messages — but the shape is the same: a flat latency-bound floor, a bandwidth-bound ramp, and ring consistently below switch.

The clincher is the right panel: the ring-over-switch speedup shrinks from ~2× (latency-bound) toward ~1.1× (bandwidth-bound) in both backends, tracking each other closely.

Size	switch (ns-3)	ring (ns-3)	speedup — ns-3 / analytical
16 KiB	113.9 µs	57.3 µs	1.99× / 1.96× (latency-bound)
16 MiB	784.6 µs	700.7 µs	1.12× / 1.05× (bandwidth-bound)

A second, more detailed simulator independently reproduces the central finding — topology matters most when you’re latency-bound — which is exactly the confidence cross-backend validation is supposed to buy.

What I’d model next

3-D torus and fat-tree at 256–1024 NPUs, where diameter differences widen.
Algorithm × topology: halving-doubling and direct AllReduce, not just ring — the latency floor is set by the algorithm’s step count, so the regime map shifts.
ns-3 congestion at scale: where does PFC / back-pressure make the analytical model optimistic?

Code, configs, and all five figures: github.com/kredd2506/Astro.

When Does Topology Matter? Scaling with Node Count

2026-06-09T00:00:00+00:00

The previous post fixed the node count and swept message size. Here I do the opposite — fix the message size and sweep the node count — because that’s where the two regimes diverge most, and where topology choice either pays off enormously or barely matters.

Scaling with node count depends on the regime

This is the part that bites in practice. Latency-bound (left, 4 KB): the switch’s latency grows almost linearly with N — 24 → 57 → 121 → 251 → 509 µs from 4 to 64 NPUs — because ring-on-switch executes N−1 sequential hops. The torus grows far more slowly (steps scale with the longest dimension, ~√N). Bandwidth-bound (right, 256 MB): both fabrics flatten out — adding nodes barely changes time, because the per-node data volume of a ring collective is nearly independent of N — and the gap narrows to a constant factor.

One picture: when does topology matter?

Torus speedup over switch for AllReduce across every (size, node) cell. The story is a gradient:

Top-left (large messages): ~1.1–1.8×. Bandwidth-bound — topology is a second-order effect; you’re paying for bytes either way.
Bottom-right (small messages, many nodes): up to 14.2×. Latency-bound at scale — topology is everything, because hop count is what you’re paying for, and that’s exactly where torus and switch differ most (N−1 sequential switch hops vs ~2(√N−1) on the torus).

The takeaway for system design: if your collectives are small and frequent (latency-bound — small gradients, frequent syncs, large clusters), fabric topology dominates and a low-diameter mesh pays off enormously. If they’re large and infrequent (bandwidth-bound — big tensors), you are buying raw link bandwidth and the topology choice matters far less.

The analytical backend is an idealized link model, though. The final post checks these regimes against ASTRA-sim’s ns-3 packet-level backend.

Latency-Bound vs Bandwidth-Bound: The Two Regimes

2026-06-08T00:00:00+00:00

In the overview I set up a sweep of AllReduce / AllGather across a switch and a 2-D torus, holding per-link physics identical. Here’s the first result: every collective lives in one of two regimes, and the message size decides which.

The two regimes, and where torus separates from switch

Read each curve left to right. On the left, latency is flat — doubling a tiny message barely moves it, because time is dominated by the fixed per-step link latency, not the payload. This is the latency-bound regime. On the right, every curve becomes a straight slope-1 line on log-log axes: latency is now proportional to bytes, i.e. bandwidth-bound.

The two topologies sit on top of each other while latency-bound (same number of algorithm steps), then the torus pulls clearly below the switch as messages grow — its 2 links/node give more aggregate bandwidth than the switch’s shared fabric. At 16 NPUs, AllReduce crosses from latency- to bandwidth-bound around ~1 MB on the torus but only around ~4 MB on the switch: the switch’s higher latency floor keeps it latency-bound longer.

The same data as effective bandwidth

Plotting delivered bandwidth (bytes ÷ time) makes the transition tangible. Small messages waste the fabric — almost all the time is latency, so effective bandwidth is near zero. As messages grow, each curve climbs and saturates toward a topology-dependent roofline. The torus’s roofline is higher; the switch saturates lower. The knee of this curve is the latency→bandwidth crossover from the first figure.

The practical reading: there is a minimum message size below which you simply cannot use your fabric efficiently, and that threshold is higher on the switch. If your collectives are smaller than the knee, you’re paying for latency and buying more bandwidth won’t help.

Next: what happens as you scale the node count — where the two regimes diverge most, and a single picture of when topology actually matters.

Modeling Collective-Communication Scaling in ASTRA-sim

2026-06-07T00:00:00+00:00

When you train a large model across many accelerators, a surprising fraction of wall-clock time is not spent doing math — it is spent in collective communication: AllReduce to average gradients, AllGather to assemble sharded tensors. How long those collectives take depends on three things that interact in non-obvious ways: the message size, the number of nodes, and the interconnect topology.

This is the first in a short series modeling that interaction directly in ASTRA-sim — Georgia Tech and Intel’s distributed-ML network simulator. The question: as message size and node count vary, when is a collective latency-bound versus bandwidth-bound, and how does that boundary move between a multi-hop torus and a one-hop switch fabric?

TL;DR for the series

I swept AllReduce and AllGather over 5 node counts (4–64) × 11 message sizes (1 KiB–1 GiB) × 2 topologies on ASTRA-sim’s analytical backend (220 runs), holding per-link bandwidth (50 GB/s) and latency (500 ns) identical so the only variable is fabric structure.
Every curve shows the same two regimes: a flat latency-bound floor for small messages (time ≈ algorithm steps × per-link latency) and a slope-1 bandwidth-bound ramp for large messages (time ≈ bytes ÷ link bandwidth).
The torus’s advantage is a strong function of regime — up to 14× latency-bound at scale, collapsing to ~1.1–1.5× bandwidth-bound. (the full size sweep, scaling with nodes)
An ns-3 packet-level run independently reproduces those regimes. (ns-3 validation)

Why ASTRA-sim makes this clean

ASTRA-sim separates three concerns, which is exactly what lets us isolate topology:

Layer	What it specifies	How I varied it
Workload	the collective + message size (Chakra execution trace)	synthetic single-op `AllReduce`/`AllGather` traces, 1 KiB → 1 GiB
System	the collective algorithm (ring, etc.)	ring per dimension
Network	the topology + per-link BW/latency	switch (1 dim) vs 2-D torus (Ring×Ring)

Topologies, held to identical link physics (50 GB/s, 500 ns):

# switch (one hop, bandwidth shared across the collective)
topology:   [ Switch ]
npus_count: [ 16 ]
bandwidth:  [ 50.0 ]   # GB/s
latency:    [ 500.0 ]  # ns

# 2-D torus (a multi-hop ring mesh; here 4x4 = 16 NPUs)
topology:   [ Ring, Ring ]
npus_count: [ 4, 4 ]
bandwidth:  [ 50.0, 50.0 ]
latency:    [ 500.0, 500.0 ]

The stock workload generator only takes integer MB, which can’t reach the small-message latency-bound regime, so I wrote a bytes-based Chakra generator to sweep cleanly on a log scale from 1 KiB. The analytical runs use the congestion-unaware backend — the model built for multi-dimensional (hierarchical) topologies — so the same backend drives both fabrics. Collective latency is the max sys[i] finished cycle across ranks (the collective finishes when the slowest rank does).

Everything is reproducible from a single Docker image; the harness lives at github.com/kredd2506/Astro. The next post gets into the first result: the two regimes, and where torus separates from switch.