The 3×3 study spans three distinct accelerator paradigms. Each interacts with XLA’s fusion, buffer analysis, and partitioning strategies differently.
Google Cloud TPU v6e (Trillium) — systolic array
Google’s TPUs are specialized matrix processors built around systolic arrays: grids of multiply-accumulators (MACs) that process matrix operations in a weight-stationary dataflow.
- 256×256 systolic array, 65,536 MAC ops per cycle (up from 128×128 on v5).
- Optimized for Transformers: ~2.8× performance and ~2.1× perf/watt over the prior generation.
- SparseCore: a dataflow processor dedicated to sparse ops such as recommendation-system embedding lookups. SparseCores operate in parallel with TensorCores, enabling pipelined dense/sparse computation.
NVIDIA H200 (Hopper) — SIMT
NVIDIA’s integration leverages the maturity of the CUDA ecosystem. Two features stand out in the paper’s context:
- NVSHMEM (PGAS-style communication) integrated with XLA delivers up to 36% speedup over NCCL at sequence lengths up to 256K tokens.
- CUDA Graphs support in XLA amortizes kernel launch overhead by capturing operation sequences into a single graph executable—particularly valuable for models with many small, short-lived ops.
AMD Instinct MI300X — CDNA
AMD, a founding OpenXLA member, integrates via the ROCm stack:
- CDNA 3 with Matrix Core Technology; supports FP8 and MXFP4.
- 64-wide wavefronts (NVIDIA uses 32-wide warps); each compute unit has dedicated Matrix Cores for accelerated GEMM.
- RCCL (AMD’s fork of NCCL) runs over Infinity Fabric rather than NVLink.
- Intra-wavefront primitives: DPP and ds_swizzle instead of NVIDIA warp-shuffle. Recent XLA contributions promote generic
gpu.shuffleoperations to AMD-specific DPP register-to-register instructions for reduction kernels, eliminating shared-memory overhead.
Meta has deployed MI300X for Llama 3 and Llama 4 inference.
Specs at a glance
| Specification | TPU v6e (Trillium) | NVIDIA H200 | AMD MI300X |
|---|---|---|---|
| Peak BF16 Compute | 918 TFLOPS/chip | 989 TFLOPS | 1,307 TFLOPS |
| HBM Capacity | 32 GB | 141 GB | 192 GB |
| HBM Bandwidth | 1.6 TB/s | 4.8 TB/s | 5.3 TB/s |
| Interconnect | 2D Torus (ICI) | NVLink | Infinity Fabric |
| Pod / Node Scale | 256 chips | 8 GPUs (DGX) | 8 GPUs |
| Sparse Acceleration | SparseCore (2/chip) | Structured Sparsity | Sparse Matrix Ops |
The cross-paradigm comparison is the point: a single compiler adapting to three fundamentally different hardware philosophies. Hypothesis H2 predicts the target-independent passes (fusion, CSE) don’t help these architectures equally.
See the paper, Section 5, and the extended TPU v5p vs. v6e comparison in Appendix A.