The XLA compiler splits cleanly into target-independent analysis passes and target-specific code generation. This separation lets high-level optimizations benefit every backend while still exploiting the microarchitectural features of specific hardware.
Target-independent optimization
During the initial compilation stage, XLA performs optimizations on the StableHLO graph that are hardware-agnostic:
- Common Subexpression Elimination (CSE). Redundant computations producing identical results are identified and consolidated.
- Algebraic Simplification. Mathematically equivalent but computationally cheaper operation sequences are substituted.
- Operation Fusion. Multiple subgraphs are combined into a single kernel, eliminating short-lived operations and intermediate buffers. For
Y = reduce_sum(X + Y * Z), the multiplication, addition, and reduction fuse into one kernel that streams through registers instead of writing back to HBM.
Beyond these passes, the compiler performs buffer analysis to allocate runtime memory and shape specialization to enable more aggressive constant propagation.
Target-specific code generation
After target-independent passes, the compiler converts StableHLO into an internal HLO dialect and dispatches to a hardware-specific backend. CPU and GPU backends use the LLVM framework for low-level IR generation. The backend pattern-matches operation combinations to optimized library calls (cuDNN, MKL) and determines optimal partitioning of computations into parallel streams.
The full pipeline
| Stage | Dialect / Format | Optimization Level | Core Tools |
|---|---|---|---|
| Frontend | Python / NumPy (JAX) | Framework-level | JAX / TF / PyTorch |
| Intermediate | StableHLO / CHLO | Target-independent | MLIR / XLA Passes |
| Lowering | HLO Dialect | Target-specific | XLA Backend |
| Backend | LLVM IR / SPIR-V | Microarchitectural | LLVM / NVCC / ROCm |
| Native | PTX / Machine Code | Hardware-specific | LLVM / Driver |
The key insight for the benchmarking methodology: because fusion/CSE/algebraic simplification are identical across backends, we can measure whether each pass helps equally on systolic arrays vs. SIMT vs. CDNA. Hypothesis H2 predicts uneven transfer—largest gains on TPU v6e (memory-wall-sensitive systolic dataflow), smaller on MI300X (5.3 TB/s HBM partially masks the memory wall).
For details, see the paper, Section 3.