OpenXLA Benchmark

Diagnostic Tools and the Economics of Cross-Platform Deployment

2026-04-17T00:00:00+00:00

Two shorter threads in the paper: the diagnostic tooling that makes compiler-level benchmarking possible, and the economic pressures that make cross-platform benchmarking urgent.

XProf: end-to-end profiling

XProf works across JAX, TensorFlow, and PyTorch/XLA. Three components get direct use in the methodology:

Trace Viewer — host/device execution timeline; identifies communication gaps and idle periods.
HLO Op Stats — highlights time-consuming operations; reports GFLOPS/s and rematerialization overhead.
Memory Profile Viewer — monitors HBM usage; surfaces peak heap consumption and potential stack exhaustion.

hlo-opt: isolated pass measurement

The hlo-opt tool executes individual compiler passes independently of the full pipeline. This isolation is what makes the paper’s optimization-transferability measurement possible: you can run just AlgebraicSimplifier or HloRematerialization on a given input module and attribute a performance delta to that specific pass on that specific backend.

The diagnostic toolkit

Tool	Primary Use Case	Target Platform
`hlo-opt`	Pass development and IR conversion	CPU, GPU, TPU
`run_hlo_module`	Microbenchmarking HLO snippets	CPU, GPU, TPU
`xprof`	End-to-end execution profiling	GPU, TPU
`multihost_hlo_runner`	SPMD and multi-node benchmarking	Distributed

Why cross-platform benchmarking is urgent now

SemiAnalysis projects that by 2030, inference will consume 75% of all AI compute. At that scale, platform economics become decisive for infrastructure planning.

Vendor benchmarks suggest large cost differentials:

Metric	TPU v6e	NVIDIA H200	Advantage
Cost per Hour	~$1.38	~$2.50+	TPU (45% cheaper)
Inference Perf. / $	4× baseline	Baseline	TPU
Power Efficiency	60–65% less	Baseline	TPU
Framework Maturity	JAX (native)	CUDA (universal)	NVIDIA

But these are vendor-controlled configurations and may not generalize. Independent validation—which the 3×3 methodology is designed to provide—is necessary to substantiate or qualify the claims.

Full details in the paper, Sections 6–7.

The 3×3 Benchmarking Methodology and Testable Hypotheses

2026-04-15T00:00:00+00:00

The paper formalizes a 3×3 experimental design: three representative ML workloads evaluated across three hardware backends, for nine distinct cells. Each cell has defined KPIs.

The matrix

	TPU v6e	NVIDIA H200	AMD MI300X
Task A: LLM Inference (Llama 3.1 70B)	TTFT, TPS, $/tok	TTFT, TPS, $/tok	TTFT, TPS, $/tok
Task B: Dense Training (ViT / ResNet-50)	TFLOPS, TTC	TFLOPS, TTC	TFLOPS, TTC
Task C: Sparse Training (DLRM v2)	SC util, BW	BW, throughput	BW, throughput

FP8 is an experimental variable on H200 and MI300X. Structured 2:4 sparsity is evaluated where hardware support is available.

Four-step experimental protocol

Environment and baseline standardization. Unified JAX + PJRT stack. Input dataset sizes exceed host memory so the benchmark actually measures data movement, not cached hits. Accuracy is validated against reference metrics; per-operation numerical tolerances are documented (XLA’s erf lowering, for instance, can diverge by several ULPs depending on scalar vs. vectorized code paths).
HLO capture and isolated micro-benchmarking. XLA_FLAGS="--xla_dump_to=/tmp/experiment" captures pre- and post-optimization graphs; hlo-opt isolates specific passes (e.g., ReshapeMover) to measure impact across backends.
End-to-end performance measurement. XProf traces capture TTFT, tokens/s, time-to-convergence. Power is sampled at 1-second intervals via nvidia-smi --query-gpu=power.draw, rocm-smi --showpower, and the Google Cloud Monitoring API for TPU. Measurement stability requires coefficient of variation <5% across trials (following Sada et al. 2025). Energy efficiency (tokens/s/W) is reported alongside throughput and latency for all 9 cells.
Bottleneck attribution. XProf’s Trace Viewer correlates accelerator gaps with host-side preprocessing. GFLOPS/s for critical kernels (matmul) are compared across architectures. Results feed a comparative TCO analysis.

Compiler version sensitivity

All experiments pin a specific XLA commit hash. Individual commits can materially shift results—AMD shuffle promotion to DPP, collective permute barrier placement, cost-model corrections for integer GEMMs. Reproduction efforts should use the same commit or document deviations.

Statistical protocol

10 warm-up iterations (discarded).
≥30 timed iterations, extended until the 95% CI for the primary metric is within ±3% of the sample mean.
Report median, mean, standard deviation, and 95% CI.
Report both wall-clock and device-only time.
Record XLA commit hash, PJRT plugin version, driver version, cloud instance type, HBM temperature at run start, co-tenancy warnings.
Repeat across ≥2 independent cloud instances (hardware lottery control).
Runs >3σ from the mean are flagged and excluded from primary statistics.

Three testable hypotheses

H1 (Abstraction Overhead). StableHLO/PJRT overhead vs. native paths is <5% for compute-bound workloads (dense matmul) but potentially >15% for memory-bound workloads (sparse embedding lookups) where data movement is less amenable to fusion.
H2 (Optimization Transferability). Target-independent optimizations—particularly fusion—transfer unevenly. Largest gains on TPU v6e (systolic dataflow benefits most from reduced memory round-trips); smaller on MI300X (5.3 TB/s HBM partially masks the memory wall). Confounder: XLA’s internal cost model has known inaccuracies for integer GEMMs (up to 576% prediction error at medium shapes).
H3 (SPMD Scaling Efficiency). Shardy-partitioned SPMD scales sub-linearly across multi-node configs. Efficiency gap widest on NVIDIA (NVLink → InfiniBand transition); narrowest on TPU pods (ICI torus provides uniform bisection bandwidth).

Limitations (deliberate)

No Groq LPU, Cerebras WSE, or Intel Gaudi 3 (different paradigms).
No fine-tuning, RL, or MoE workloads (different compilation challenges).
JAX only as frontend (cross-framework confounds avoided).
Point-in-time snapshot at a specific XLA commit.

Full methodology and reporting schema: the paper, Sections 8–9.

Three Accelerator Paradigms: TPU v6e, NVIDIA H200, and AMD MI300X

2026-04-12T00:00:00+00:00

The 3×3 study spans three distinct accelerator paradigms. Each interacts with XLA’s fusion, buffer analysis, and partitioning strategies differently.

Google Cloud TPU v6e (Trillium) — systolic array

Google’s TPUs are specialized matrix processors built around systolic arrays: grids of multiply-accumulators (MACs) that process matrix operations in a weight-stationary dataflow.

256×256 systolic array, 65,536 MAC ops per cycle (up from 128×128 on v5).
Optimized for Transformers: ~2.8× performance and ~2.1× perf/watt over the prior generation.
SparseCore: a dataflow processor dedicated to sparse ops such as recommendation-system embedding lookups. SparseCores operate in parallel with TensorCores, enabling pipelined dense/sparse computation.

NVIDIA H200 (Hopper) — SIMT

NVIDIA’s integration leverages the maturity of the CUDA ecosystem. Two features stand out in the paper’s context:

NVSHMEM (PGAS-style communication) integrated with XLA delivers up to 36% speedup over NCCL at sequence lengths up to 256K tokens.
CUDA Graphs support in XLA amortizes kernel launch overhead by capturing operation sequences into a single graph executable—particularly valuable for models with many small, short-lived ops.

AMD Instinct MI300X — CDNA

AMD, a founding OpenXLA member, integrates via the ROCm stack:

CDNA 3 with Matrix Core Technology; supports FP8 and MXFP4.
64-wide wavefronts (NVIDIA uses 32-wide warps); each compute unit has dedicated Matrix Cores for accelerated GEMM.
RCCL (AMD’s fork of NCCL) runs over Infinity Fabric rather than NVLink.
Intra-wavefront primitives: DPP and ds_swizzle instead of NVIDIA warp-shuffle. Recent XLA contributions promote generic gpu.shuffle operations to AMD-specific DPP register-to-register instructions for reduction kernels, eliminating shared-memory overhead.

Meta has deployed MI300X for Llama 3 and Llama 4 inference.

Specs at a glance

Specification	TPU v6e (Trillium)	NVIDIA H200	AMD MI300X
Peak BF16 Compute	918 TFLOPS/chip	989 TFLOPS	1,307 TFLOPS
HBM Capacity	32 GB	141 GB	192 GB
HBM Bandwidth	1.6 TB/s	4.8 TB/s	5.3 TB/s
Interconnect	2D Torus (ICI)	NVLink	Infinity Fabric
Pod / Node Scale	256 chips	8 GPUs (DGX)	8 GPUs
Sparse Acceleration	SparseCore (2/chip)	Structured Sparsity	Sparse Matrix Ops

The cross-paradigm comparison is the point: a single compiler adapting to three fundamentally different hardware philosophies. Hypothesis H2 predicts the target-independent passes (fusion, CSE) don’t help these architectures equally.

See the paper, Section 5, and the extended TPU v5p vs. v6e comparison in Appendix A.

PJRT: The Pluggable Just-in-Time Runtime

2026-04-09T00:00:00+00:00

To deliver the “run anywhere” half of OpenXLA’s promise, the ecosystem ships PJRT—a hardware- and framework-independent interface for ML compilers and runtimes. PJRT simplifies new hardware integration by exposing a stable C API that abstracts device management, memory allocation, and executable execution.

Core abstractions

PjRtClient — manages all communication and owns the devices and memory spaces for a given plugin.
PjRtDevice — describes a single device, including its unique hash identifier and location within a local or global grid.
PjRtMemorySpace — distinguishes between unpinned and pinned memory associated with specific devices.
PjRtBuffer — holds data on the device in a format optimized for the plugin (proprietary tensor formats, MLIR element attributes, etc.).
PjRtCompiler — takes an input module (such as StableHLO) and returns a PjRtLoadedExecutable.

Why this matters for portability

The plugin architecture lets hardware vendors develop support independently from the main OpenXLA or framework repositories. Two concrete examples:

Intel Extension for TensorFlow and the AMD ROCm plugin both use the PJRT C API to integrate seamlessly with JAX and PyTorch without modifying core framework code.

From the benchmarking perspective, the PJRT layer is exactly what the paper’s portability tax measurement targets: for each hardware platform we compare the OpenXLA path (JAX → StableHLO → XLA → hardware) against the native path (vLLM/CUDA on NVIDIA, PyTorch/ROCm on AMD, direct HLO on TPU) to quantify the overhead this abstraction introduces.

Details in the paper, Section 4.

The XLA Compiler Pipeline: Target-Independent Passes and LLVM Lowering

2026-04-06T00:00:00+00:00

The XLA compiler splits cleanly into target-independent analysis passes and target-specific code generation. This separation lets high-level optimizations benefit every backend while still exploiting the microarchitectural features of specific hardware.

Target-independent optimization

During the initial compilation stage, XLA performs optimizations on the StableHLO graph that are hardware-agnostic:

Common Subexpression Elimination (CSE). Redundant computations producing identical results are identified and consolidated.
Algebraic Simplification. Mathematically equivalent but computationally cheaper operation sequences are substituted.
Operation Fusion. Multiple subgraphs are combined into a single kernel, eliminating short-lived operations and intermediate buffers. For Y = reduce_sum(X + Y * Z), the multiplication, addition, and reduction fuse into one kernel that streams through registers instead of writing back to HBM.

Beyond these passes, the compiler performs buffer analysis to allocate runtime memory and shape specialization to enable more aggressive constant propagation.

Target-specific code generation

After target-independent passes, the compiler converts StableHLO into an internal HLO dialect and dispatches to a hardware-specific backend. CPU and GPU backends use the LLVM framework for low-level IR generation. The backend pattern-matches operation combinations to optimized library calls (cuDNN, MKL) and determines optimal partitioning of computations into parallel streams.

The full pipeline

Stage	Dialect / Format	Optimization Level	Core Tools
Frontend	Python / NumPy (JAX)	Framework-level	JAX / TF / PyTorch
Intermediate	StableHLO / CHLO	Target-independent	MLIR / XLA Passes
Lowering	HLO Dialect	Target-specific	XLA Backend
Backend	LLVM IR / SPIR-V	Microarchitectural	LLVM / NVCC / ROCm
Native	PTX / Machine Code	Hardware-specific	LLVM / Driver

The key insight for the benchmarking methodology: because fusion/CSE/algebraic simplification are identical across backends, we can measure whether each pass helps equally on systolic arrays vs. SIMT vs. CDNA. Hypothesis H2 predicts uneven transfer—largest gains on TPU v6e (memory-wall-sensitive systolic dataflow), smaller on MI300X (5.3 TB/s HBM partially masks the memory wall).

For details, see the paper, Section 3.

The OpenXLA IR Hierarchy: StableHLO, CHLO, and VHLO

2026-04-04T00:00:00+00:00

OpenXLA’s effectiveness as a portability layer hinges on a multi-level dialect hierarchy within the MLIR framework. This post summarizes how the hierarchy is organized and why the design choices matter.

StableHLO: the portability contract

StableHLO is the bridge between frontend frameworks and backend compilers such as XLA and IREE. Any framework capable of producing StableHLO programs is compatible with any compiler that can consume them.

The specification defines approximately 100 operations and commits to long-term stability: five years of backward compatibility and two years of forward compatibility. Stability is managed via VHLO (Versioned StableHLO), an add-only dialect that snapshots the StableHLO dialect at specific points in time, so consumers can deserialize and upgrade payloads from older versions of the stack.

CHLO: client-level abstractions

The CHLO (Client HLO) dialect aligns with the API surface of XlaBuilder in C++. It models “syntactic sugar” and high-level mathematical compositions before they are materialized into lower-level dialects. This hierarchy enables hardware-independent program simplification and refines dynamically-shaped programs using concrete input arguments.

Quantization semantics

StableHLO’s types emphasize domain-specific tensor requirements. Quantized element types are governed by rigorous constraints:

storage_type (integer), expressed_type (floating-point), scales (floating-point constants).
For per-axis quantization, quantization_dimension < rank(T).
Constraints C1–C13 in the spec enforce that storage_min/storage_max fit within the storage type and that scales are strictly positive and finite.

Enforcing these semantics at the compiler level prevents numerical issues common in lower-precision data formats.

Shardy: unified tensor partitioning

Shardy is an MLIR-based partitioning system from the merged GSPMD and PartIR teams. It provides a unified partitioning dialect integrated into the StableHLO layer, delivering consistent SPMD partitioning semantics across all OpenXLA backends. Shardy is directly relevant to the paper’s H3 hypothesis on SPMD scaling efficiency.

Deprecation note

The internal MHLO (Meta-HLO) dialect is being deprecated. Useful passes (canonicalization, folder patterns) are migrating into StableHLO itself, ensuring that all hardware-independent graph simplifications are available to every PJRT plugin regardless of target compiler IR.

See the full discussion in the paper, Section 2.

Paper Overview: The OpenXLA Ecosystem and a 3×3 Benchmarking Protocol

2026-04-01T00:00:00+00:00

The paper “Technical Architecture and Systematic Benchmarking of the OpenXLA Ecosystem” is a proposed research protocol: the architectural analysis and benchmarking methodology are complete, and experimental results are forthcoming pending hardware access.

The problem

The proliferation of frontend frameworks (JAX, PyTorch, TensorFlow) combined with an increasingly heterogeneous hardware landscape (GPUs, TPUs, custom ASICs) has created a fragmentation problem that impedes portable and efficient model deployment. OpenXLA—developed jointly by Google, AMD, Intel, NVIDIA, and AWS—addresses this through a unified compiler ecosystem built on StableHLO as a portability layer and PJRT as a pluggable hardware interface.

What the paper delivers

An architectural analysis of the OpenXLA compiler pipeline, informed by direct contributions to the OpenXLA codebase.
A detailed characterization of three contemporary accelerator families—TPU v6e (Trillium), NVIDIA H200 (Hopper), and AMD MI300X—and their interaction with the XLA compiler.
A comparative economic analysis of cross-platform deployment costs and energy efficiency.
A formal 3×3 benchmarking methodology: three representative workloads (LLM inference, dense training, sparse embedding training) evaluated across all three backends.

Five ways this differs from existing benchmarks

Benchmarks the compiler, not just the hardware. Uses hlo-opt ablation to isolate fusion, CSE, and algebraic simplification.
Measures the portability tax. Compares OpenXLA (JAX → StableHLO → XLA → hardware) against native paths (vLLM/CUDA, PyTorch/ROCm, direct HLO on TPU).
Quantifies optimization transferability. Measures whether the same passes help equally on systolic array (TPU), SIMT (NVIDIA), and CDNA (AMD).
Exposes compiler version sensitivity. All experiments pin a specific XLA commit hash.
Covers three accelerator paradigms rather than the two-platform designs typical of prior work.

Read the paper

paper1.pdf
paper1.tex (LaTeX source)