OpenXLA Benchmark

The 3×3 Benchmarking Methodology and Testable Hypotheses

15 Apr 2026

The paper formalizes a 3×3 experimental design: three representative ML workloads evaluated across three hardware backends, for nine distinct cells. Each cell has defined KPIs.

The matrix

  TPU v6e NVIDIA H200 AMD MI300X
Task A: LLM Inference (Llama 3.1 70B) TTFT, TPS, $/tok TTFT, TPS, $/tok TTFT, TPS, $/tok
Task B: Dense Training (ViT / ResNet-50) TFLOPS, TTC TFLOPS, TTC TFLOPS, TTC
Task C: Sparse Training (DLRM v2) SC util, BW BW, throughput BW, throughput

FP8 is an experimental variable on H200 and MI300X. Structured 2:4 sparsity is evaluated where hardware support is available.

Four-step experimental protocol

  1. Environment and baseline standardization. Unified JAX + PJRT stack. Input dataset sizes exceed host memory so the benchmark actually measures data movement, not cached hits. Accuracy is validated against reference metrics; per-operation numerical tolerances are documented (XLA’s erf lowering, for instance, can diverge by several ULPs depending on scalar vs. vectorized code paths).
  2. HLO capture and isolated micro-benchmarking. XLA_FLAGS="--xla_dump_to=/tmp/experiment" captures pre- and post-optimization graphs; hlo-opt isolates specific passes (e.g., ReshapeMover) to measure impact across backends.
  3. End-to-end performance measurement. XProf traces capture TTFT, tokens/s, time-to-convergence. Power is sampled at 1-second intervals via nvidia-smi --query-gpu=power.draw, rocm-smi --showpower, and the Google Cloud Monitoring API for TPU. Measurement stability requires coefficient of variation <5% across trials (following Sada et al. 2025). Energy efficiency (tokens/s/W) is reported alongside throughput and latency for all 9 cells.
  4. Bottleneck attribution. XProf’s Trace Viewer correlates accelerator gaps with host-side preprocessing. GFLOPS/s for critical kernels (matmul) are compared across architectures. Results feed a comparative TCO analysis.

Compiler version sensitivity

All experiments pin a specific XLA commit hash. Individual commits can materially shift results—AMD shuffle promotion to DPP, collective permute barrier placement, cost-model corrections for integer GEMMs. Reproduction efforts should use the same commit or document deviations.

Statistical protocol

Three testable hypotheses

Limitations (deliberate)

Full methodology and reporting schema: the paper, Sections 8–9.