The paper formalizes a 3×3 experimental design: three representative ML workloads evaluated across three hardware backends, for nine distinct cells. Each cell has defined KPIs.
The matrix
| TPU v6e | NVIDIA H200 | AMD MI300X | |
|---|---|---|---|
| Task A: LLM Inference (Llama 3.1 70B) | TTFT, TPS, $/tok | TTFT, TPS, $/tok | TTFT, TPS, $/tok |
| Task B: Dense Training (ViT / ResNet-50) | TFLOPS, TTC | TFLOPS, TTC | TFLOPS, TTC |
| Task C: Sparse Training (DLRM v2) | SC util, BW | BW, throughput | BW, throughput |
FP8 is an experimental variable on H200 and MI300X. Structured 2:4 sparsity is evaluated where hardware support is available.
Four-step experimental protocol
- Environment and baseline standardization. Unified JAX + PJRT stack. Input dataset sizes exceed host memory so the benchmark actually measures data movement, not cached hits. Accuracy is validated against reference metrics; per-operation numerical tolerances are documented (XLA’s
erflowering, for instance, can diverge by several ULPs depending on scalar vs. vectorized code paths). - HLO capture and isolated micro-benchmarking.
XLA_FLAGS="--xla_dump_to=/tmp/experiment"captures pre- and post-optimization graphs;hlo-optisolates specific passes (e.g.,ReshapeMover) to measure impact across backends. - End-to-end performance measurement. XProf traces capture TTFT, tokens/s, time-to-convergence. Power is sampled at 1-second intervals via
nvidia-smi --query-gpu=power.draw,rocm-smi --showpower, and the Google Cloud Monitoring API for TPU. Measurement stability requires coefficient of variation <5% across trials (following Sada et al. 2025). Energy efficiency (tokens/s/W) is reported alongside throughput and latency for all 9 cells. - Bottleneck attribution. XProf’s Trace Viewer correlates accelerator gaps with host-side preprocessing. GFLOPS/s for critical kernels (
matmul) are compared across architectures. Results feed a comparative TCO analysis.
Compiler version sensitivity
All experiments pin a specific XLA commit hash. Individual commits can materially shift results—AMD shuffle promotion to DPP, collective permute barrier placement, cost-model corrections for integer GEMMs. Reproduction efforts should use the same commit or document deviations.
Statistical protocol
- 10 warm-up iterations (discarded).
- ≥30 timed iterations, extended until the 95% CI for the primary metric is within ±3% of the sample mean.
- Report median, mean, standard deviation, and 95% CI.
- Report both wall-clock and device-only time.
- Record XLA commit hash, PJRT plugin version, driver version, cloud instance type, HBM temperature at run start, co-tenancy warnings.
- Repeat across ≥2 independent cloud instances (hardware lottery control).
- Runs >3σ from the mean are flagged and excluded from primary statistics.
Three testable hypotheses
- H1 (Abstraction Overhead). StableHLO/PJRT overhead vs. native paths is <5% for compute-bound workloads (dense matmul) but potentially >15% for memory-bound workloads (sparse embedding lookups) where data movement is less amenable to fusion.
- H2 (Optimization Transferability). Target-independent optimizations—particularly fusion—transfer unevenly. Largest gains on TPU v6e (systolic dataflow benefits most from reduced memory round-trips); smaller on MI300X (5.3 TB/s HBM partially masks the memory wall). Confounder: XLA’s internal cost model has known inaccuracies for integer GEMMs (up to 576% prediction error at medium shapes).
- H3 (SPMD Scaling Efficiency). Shardy-partitioned SPMD scales sub-linearly across multi-node configs. Efficiency gap widest on NVIDIA (NVLink → InfiniBand transition); narrowest on TPU pods (ICI torus provides uniform bisection bandwidth).
Limitations (deliberate)
- No Groq LPU, Cerebras WSE, or Intel Gaudi 3 (different paradigms).
- No fine-tuning, RL, or MoE workloads (different compilation challenges).
- JAX only as frontend (cross-framework confounds avoided).
- Point-in-time snapshot at a specific XLA commit.
Full methodology and reporting schema: the paper, Sections 8–9.