Technical architecture and a proposed 3x3 benchmarking methodology for the OpenXLA compiler ecosystem across TPU v6e, NVIDIA H200, and AMD MI300X.

The paper formalizes a 3×3 experimental design: three representative ML workloads evaluated across three hardware backends, for nine distinct cells. Each cell has defined KPIs.

The matrix

	TPU v6e	NVIDIA H200	AMD MI300X
Task A: LLM Inference (Llama 3.1 70B)	TTFT, TPS, $/tok	TTFT, TPS, $/tok	TTFT, TPS, $/tok
Task B: Dense Training (ViT / ResNet-50)	TFLOPS, TTC	TFLOPS, TTC	TFLOPS, TTC
Task C: Sparse Training (DLRM v2)	SC util, BW	BW, throughput	BW, throughput

FP8 is an experimental variable on H200 and MI300X. Structured 2:4 sparsity is evaluated where hardware support is available.

Four-step experimental protocol

Environment and baseline standardization. Unified JAX + PJRT stack. Input dataset sizes exceed host memory so the benchmark actually measures data movement, not cached hits. Accuracy is validated against reference metrics; per-operation numerical tolerances are documented (XLA’s erf lowering, for instance, can diverge by several ULPs depending on scalar vs. vectorized code paths).
HLO capture and isolated micro-benchmarking. XLA_FLAGS="--xla_dump_to=/tmp/experiment" captures pre- and post-optimization graphs; hlo-opt isolates specific passes (e.g., ReshapeMover) to measure impact across backends.
End-to-end performance measurement. XProf traces capture TTFT, tokens/s, time-to-convergence. Power is sampled at 1-second intervals via nvidia-smi --query-gpu=power.draw, rocm-smi --showpower, and the Google Cloud Monitoring API for TPU. Measurement stability requires coefficient of variation <5% across trials (following Sada et al. 2025). Energy efficiency (tokens/s/W) is reported alongside throughput and latency for all 9 cells.
Bottleneck attribution. XProf’s Trace Viewer correlates accelerator gaps with host-side preprocessing. GFLOPS/s for critical kernels (matmul) are compared across architectures. Results feed a comparative TCO analysis.

Compiler version sensitivity

All experiments pin a specific XLA commit hash. Individual commits can materially shift results—AMD shuffle promotion to DPP, collective permute barrier placement, cost-model corrections for integer GEMMs. Reproduction efforts should use the same commit or document deviations.

Statistical protocol

10 warm-up iterations (discarded).
≥30 timed iterations, extended until the 95% CI for the primary metric is within ±3% of the sample mean.
Report median, mean, standard deviation, and 95% CI.
Report both wall-clock and device-only time.
Record XLA commit hash, PJRT plugin version, driver version, cloud instance type, HBM temperature at run start, co-tenancy warnings.
Repeat across ≥2 independent cloud instances (hardware lottery control).
Runs >3σ from the mean are flagged and excluded from primary statistics.

Three testable hypotheses

H1 (Abstraction Overhead). StableHLO/PJRT overhead vs. native paths is <5% for compute-bound workloads (dense matmul) but potentially >15% for memory-bound workloads (sparse embedding lookups) where data movement is less amenable to fusion.
H2 (Optimization Transferability). Target-independent optimizations—particularly fusion—transfer unevenly. Largest gains on TPU v6e (systolic dataflow benefits most from reduced memory round-trips); smaller on MI300X (5.3 TB/s HBM partially masks the memory wall). Confounder: XLA’s internal cost model has known inaccuracies for integer GEMMs (up to 576% prediction error at medium shapes).
H3 (SPMD Scaling Efficiency). Shardy-partitioned SPMD scales sub-linearly across multi-node configs. Efficiency gap widest on NVIDIA (NVLink → InfiniBand transition); narrowest on TPU pods (ICI torus provides uniform bisection bandwidth).

Limitations (deliberate)

No Groq LPU, Cerebras WSE, or Intel Gaudi 3 (different paradigms).
No fine-tuning, RL, or MoE workloads (different compilation challenges).
JAX only as frontend (cross-framework confounds avoided).
Point-in-time snapshot at a specific XLA commit.

Full methodology and reporting schema: the paper, Sections 8–9.

The 3×3 Benchmarking Methodology and Testable Hypotheses