Technical architecture and a proposed 3x3 benchmarking methodology for the OpenXLA compiler ecosystem across TPU v6e, NVIDIA H200, and AMD MI300X.

The XLA compiler splits cleanly into target-independent analysis passes and target-specific code generation. This separation lets high-level optimizations benefit every backend while still exploiting the microarchitectural features of specific hardware.

Target-independent optimization

During the initial compilation stage, XLA performs optimizations on the StableHLO graph that are hardware-agnostic:

Common Subexpression Elimination (CSE). Redundant computations producing identical results are identified and consolidated.
Algebraic Simplification. Mathematically equivalent but computationally cheaper operation sequences are substituted.
Operation Fusion. Multiple subgraphs are combined into a single kernel, eliminating short-lived operations and intermediate buffers. For Y = reduce_sum(X + Y * Z), the multiplication, addition, and reduction fuse into one kernel that streams through registers instead of writing back to HBM.

Beyond these passes, the compiler performs buffer analysis to allocate runtime memory and shape specialization to enable more aggressive constant propagation.

Target-specific code generation

After target-independent passes, the compiler converts StableHLO into an internal HLO dialect and dispatches to a hardware-specific backend. CPU and GPU backends use the LLVM framework for low-level IR generation. The backend pattern-matches operation combinations to optimized library calls (cuDNN, MKL) and determines optimal partitioning of computations into parallel streams.

The full pipeline

Stage	Dialect / Format	Optimization Level	Core Tools
Frontend	Python / NumPy (JAX)	Framework-level	JAX / TF / PyTorch
Intermediate	StableHLO / CHLO	Target-independent	MLIR / XLA Passes
Lowering	HLO Dialect	Target-specific	XLA Backend
Backend	LLVM IR / SPIR-V	Microarchitectural	LLVM / NVCC / ROCm
Native	PTX / Machine Code	Hardware-specific	LLVM / Driver

The key insight for the benchmarking methodology: because fusion/CSE/algebraic simplification are identical across backends, we can measure whether each pass helps equally on systolic arrays vs. SIMT vs. CDNA. Hypothesis H2 predicts uneven transfer—largest gains on TPU v6e (memory-wall-sensitive systolic dataflow), smaller on MI300X (5.3 TB/s HBM partially masks the memory wall).

For details, see the paper, Section 3.

The XLA Compiler Pipeline: Target-Independent Passes and LLVM Lowering

Target-independent optimization

Target-specific code generation

The full pipeline