This site documents a simulation study of collective-communication scaling for distributed ML, built on ASTRA-sim. It models AllReduce and AllGather across torus and switch topologies on the analytical and ns-3 backends, comparing latency-bound vs bandwidth-bound behavior as a function of message size and node count.
The full harness — a Docker build, a bytes-based Chakra workload generator, the sweep runner, and the plotting code — is reproducible and open: github.com/kredd2506/Astro.
— Manish Reddy