Technical architecture and a proposed 3x3 benchmarking methodology for the OpenXLA compiler ecosystem across TPU v6e, NVIDIA H200, and AMD MI300X.

About OpenXLA Benchmark

This site accompanies the paper “Technical Architecture and Systematic Benchmarking of the OpenXLA Ecosystem: A Cross-Platform Analysis of Modern ML Compilers” (DRAFT v2 – April 2026).

Status: This paper presents a proposed research protocol. The architectural analysis and benchmarking methodology are complete; experimental results are forthcoming pending hardware access. Feedback on the methodology, hypotheses, and experimental design is welcome.

Abstract

The fragmentation of machine learning (ML) frameworks and hardware backends presents a critical barrier to portable, cost-effective model deployment. OpenXLA addresses this challenge through a unified compiler ecosystem built on StableHLO as a portability layer and PJRT as a pluggable hardware interface.

This paper provides an architectural analysis of the OpenXLA compiler pipeline—from its intermediate representation hierarchy (CHLO, StableHLO, VHLO) through target-independent optimization passes to hardware-specific code generation via LLVM—informed by direct contributions to the OpenXLA codebase. We characterize three contemporary accelerator families: Google Cloud TPU v6e (Trillium), NVIDIA H200 (Hopper), and AMD Instinct MI300X, examining how each interacts with the XLA compiler’s fusion, buffer analysis, and partitioning strategies. We propose a systematic 3×3 benchmarking methodology—evaluating three representative workloads (LLM inference, dense model training, and sparse embedding training) across all three hardware backends—and formalize testable hypotheses regarding abstraction overhead, optimization transferability, and SPMD scaling efficiency.

Contributions

An architectural analysis of the OpenXLA compiler pipeline, informed by direct contributions to the OpenXLA codebase.
A detailed characterization of three contemporary accelerator families—TPU v6e, NVIDIA H200, and AMD MI300X—and their interaction with the XLA compiler.
A comparative economic analysis of cross-platform deployment costs and energy efficiency.
A formal 3×3 benchmarking methodology for systematic, reproducible evaluation of ML compiler performance across heterogeneous hardware.

How this work differs from existing benchmarks

Benchmarks the compiler, not just the hardware. Uses hlo-opt ablation to isolate the contribution of individual optimizations (fusion, CSE, algebraic simplification).
Measures the portability tax. Compares the OpenXLA path (JAX → StableHLO → XLA → hardware) against native paths (vLLM/CUDA, PyTorch/ROCm, direct HLO on TPU).
Quantifies optimization transferability. Measures whether the same target-independent passes help equally across systolic array (TPU), SIMT (NVIDIA), and CDNA (AMD).
Exposes compiler version sensitivity. All experiments pin a specific XLA commit hash.
Covers three accelerator paradigms. Spans systolic array (TPU), SIMT (NVIDIA), and CDNA (AMD).

Paper

PDF: paper1.pdf
LaTeX source: paper1.tex