OpenXLA Benchmark

Diagnostic Tools and the Economics of Cross-Platform Deployment

17 Apr 2026

Two shorter threads in the paper: the diagnostic tooling that makes compiler-level benchmarking possible, and the economic pressures that make cross-platform benchmarking urgent.

XProf: end-to-end profiling

XProf works across JAX, TensorFlow, and PyTorch/XLA. Three components get direct use in the methodology:

hlo-opt: isolated pass measurement

The hlo-opt tool executes individual compiler passes independently of the full pipeline. This isolation is what makes the paper’s optimization-transferability measurement possible: you can run just AlgebraicSimplifier or HloRematerialization on a given input module and attribute a performance delta to that specific pass on that specific backend.

The diagnostic toolkit

Tool Primary Use Case Target Platform
hlo-opt Pass development and IR conversion CPU, GPU, TPU
run_hlo_module Microbenchmarking HLO snippets CPU, GPU, TPU
xprof End-to-end execution profiling GPU, TPU
multihost_hlo_runner SPMD and multi-node benchmarking Distributed

Why cross-platform benchmarking is urgent now

SemiAnalysis projects that by 2030, inference will consume 75% of all AI compute. At that scale, platform economics become decisive for infrastructure planning.

Vendor benchmarks suggest large cost differentials:

Metric TPU v6e NVIDIA H200 Advantage
Cost per Hour ~$1.38 ~$2.50+ TPU (45% cheaper)
Inference Perf. / $ 4× baseline Baseline TPU
Power Efficiency 60–65% less Baseline TPU
Framework Maturity JAX (native) CUDA (universal) NVIDIA

But these are vendor-controlled configurations and may not generalize. Independent validation—which the 3×3 methodology is designed to provide—is necessary to substantiate or qualify the claims.

Full details in the paper, Sections 6–7.