OpenXLA Benchmark

Three Accelerator Paradigms: TPU v6e, NVIDIA H200, and AMD MI300X

12 Apr 2026

The 3×3 study spans three distinct accelerator paradigms. Each interacts with XLA’s fusion, buffer analysis, and partitioning strategies differently.

Google Cloud TPU v6e (Trillium) — systolic array

Google’s TPUs are specialized matrix processors built around systolic arrays: grids of multiply-accumulators (MACs) that process matrix operations in a weight-stationary dataflow.

NVIDIA H200 (Hopper) — SIMT

NVIDIA’s integration leverages the maturity of the CUDA ecosystem. Two features stand out in the paper’s context:

AMD Instinct MI300X — CDNA

AMD, a founding OpenXLA member, integrates via the ROCm stack:

Meta has deployed MI300X for Llama 3 and Llama 4 inference.

Specs at a glance

Specification TPU v6e (Trillium) NVIDIA H200 AMD MI300X
Peak BF16 Compute 918 TFLOPS/chip 989 TFLOPS 1,307 TFLOPS
HBM Capacity 32 GB 141 GB 192 GB
HBM Bandwidth 1.6 TB/s 4.8 TB/s 5.3 TB/s
Interconnect 2D Torus (ICI) NVLink Infinity Fabric
Pod / Node Scale 256 chips 8 GPUs (DGX) 8 GPUs
Sparse Acceleration SparseCore (2/chip) Structured Sparsity Sparse Matrix Ops

The cross-paradigm comparison is the point: a single compiler adapting to three fundamentally different hardware philosophies. Hypothesis H2 predicts the target-independent passes (fusion, CSE) don’t help these architectures equally.

See the paper, Section 5, and the extended TPU v5p vs. v6e comparison in Appendix A.