Technical architecture and a proposed 3x3 benchmarking methodology for the OpenXLA compiler ecosystem across TPU v6e, NVIDIA H200, and AMD MI300X.

The 3×3 study spans three distinct accelerator paradigms. Each interacts with XLA’s fusion, buffer analysis, and partitioning strategies differently.

Google Cloud TPU v6e (Trillium) — systolic array

Google’s TPUs are specialized matrix processors built around systolic arrays: grids of multiply-accumulators (MACs) that process matrix operations in a weight-stationary dataflow.

256×256 systolic array, 65,536 MAC ops per cycle (up from 128×128 on v5).
Optimized for Transformers: ~2.8× performance and ~2.1× perf/watt over the prior generation.
SparseCore: a dataflow processor dedicated to sparse ops such as recommendation-system embedding lookups. SparseCores operate in parallel with TensorCores, enabling pipelined dense/sparse computation.

NVIDIA H200 (Hopper) — SIMT

NVIDIA’s integration leverages the maturity of the CUDA ecosystem. Two features stand out in the paper’s context:

NVSHMEM (PGAS-style communication) integrated with XLA delivers up to 36% speedup over NCCL at sequence lengths up to 256K tokens.
CUDA Graphs support in XLA amortizes kernel launch overhead by capturing operation sequences into a single graph executable—particularly valuable for models with many small, short-lived ops.

AMD Instinct MI300X — CDNA

AMD, a founding OpenXLA member, integrates via the ROCm stack:

CDNA 3 with Matrix Core Technology; supports FP8 and MXFP4.
64-wide wavefronts (NVIDIA uses 32-wide warps); each compute unit has dedicated Matrix Cores for accelerated GEMM.
RCCL (AMD’s fork of NCCL) runs over Infinity Fabric rather than NVLink.
Intra-wavefront primitives: DPP and ds_swizzle instead of NVIDIA warp-shuffle. Recent XLA contributions promote generic gpu.shuffle operations to AMD-specific DPP register-to-register instructions for reduction kernels, eliminating shared-memory overhead.

Meta has deployed MI300X for Llama 3 and Llama 4 inference.

Specs at a glance

Specification	TPU v6e (Trillium)	NVIDIA H200	AMD MI300X
Peak BF16 Compute	918 TFLOPS/chip	989 TFLOPS	1,307 TFLOPS
HBM Capacity	32 GB	141 GB	192 GB
HBM Bandwidth	1.6 TB/s	4.8 TB/s	5.3 TB/s
Interconnect	2D Torus (ICI)	NVLink	Infinity Fabric
Pod / Node Scale	256 chips	8 GPUs (DGX)	8 GPUs
Sparse Acceleration	SparseCore (2/chip)	Structured Sparsity	Sparse Matrix Ops

The cross-paradigm comparison is the point: a single compiler adapting to three fundamentally different hardware philosophies. Hypothesis H2 predicts the target-independent passes (fusion, CSE) don’t help these architectures equally.

See the paper, Section 5, and the extended TPU v5p vs. v6e comparison in Appendix A.

Three Accelerator Paradigms: TPU v6e, NVIDIA H200, and AMD MI300X

Google Cloud TPU v6e (Trillium) — systolic array

NVIDIA H200 (Hopper) — SIMT

AMD Instinct MI300X — CDNA

Specs at a glance