OpenXLA Benchmark

The XLA Compiler Pipeline: Target-Independent Passes and LLVM Lowering

06 Apr 2026

The XLA compiler splits cleanly into target-independent analysis passes and target-specific code generation. This separation lets high-level optimizations benefit every backend while still exploiting the microarchitectural features of specific hardware.

Target-independent optimization

During the initial compilation stage, XLA performs optimizations on the StableHLO graph that are hardware-agnostic:

Beyond these passes, the compiler performs buffer analysis to allocate runtime memory and shape specialization to enable more aggressive constant propagation.

Target-specific code generation

After target-independent passes, the compiler converts StableHLO into an internal HLO dialect and dispatches to a hardware-specific backend. CPU and GPU backends use the LLVM framework for low-level IR generation. The backend pattern-matches operation combinations to optimized library calls (cuDNN, MKL) and determines optimal partitioning of computations into parallel streams.

The full pipeline

Stage Dialect / Format Optimization Level Core Tools
Frontend Python / NumPy (JAX) Framework-level JAX / TF / PyTorch
Intermediate StableHLO / CHLO Target-independent MLIR / XLA Passes
Lowering HLO Dialect Target-specific XLA Backend
Backend LLVM IR / SPIR-V Microarchitectural LLVM / NVCC / ROCm
Native PTX / Machine Code Hardware-specific LLVM / Driver

The key insight for the benchmarking methodology: because fusion/CSE/algebraic simplification are identical across backends, we can measure whether each pass helps equally on systolic arrays vs. SIMT vs. CDNA. Hypothesis H2 predicts uneven transfer—largest gains on TPU v6e (memory-wall-sensitive systolic dataflow), smaller on MI300X (5.3 TB/s HBM partially masks the memory wall).

For details, see the paper, Section 3.