The paper “Technical Architecture and Systematic Benchmarking of the OpenXLA Ecosystem” is a proposed research protocol: the architectural analysis and benchmarking methodology are complete, and experimental results are forthcoming pending hardware access.
The problem
The proliferation of frontend frameworks (JAX, PyTorch, TensorFlow) combined with an increasingly heterogeneous hardware landscape (GPUs, TPUs, custom ASICs) has created a fragmentation problem that impedes portable and efficient model deployment. OpenXLA—developed jointly by Google, AMD, Intel, NVIDIA, and AWS—addresses this through a unified compiler ecosystem built on StableHLO as a portability layer and PJRT as a pluggable hardware interface.
What the paper delivers
- An architectural analysis of the OpenXLA compiler pipeline, informed by direct contributions to the OpenXLA codebase.
- A detailed characterization of three contemporary accelerator families—TPU v6e (Trillium), NVIDIA H200 (Hopper), and AMD MI300X—and their interaction with the XLA compiler.
- A comparative economic analysis of cross-platform deployment costs and energy efficiency.
- A formal 3×3 benchmarking methodology: three representative workloads (LLM inference, dense training, sparse embedding training) evaluated across all three backends.
Five ways this differs from existing benchmarks
- Benchmarks the compiler, not just the hardware. Uses
hlo-optablation to isolate fusion, CSE, and algebraic simplification. - Measures the portability tax. Compares OpenXLA (JAX → StableHLO → XLA → hardware) against native paths (vLLM/CUDA, PyTorch/ROCm, direct HLO on TPU).
- Quantifies optimization transferability. Measures whether the same passes help equally on systolic array (TPU), SIMT (NVIDIA), and CDNA (AMD).
- Exposes compiler version sensitivity. All experiments pin a specific XLA commit hash.
- Covers three accelerator paradigms rather than the two-platform designs typical of prior work.
Read the paper
- paper1.pdf
- paper1.tex (LaTeX source)