Technical Architecture and Systematic Benchmarking of the OpenXLA Ecosystem
DRAFT v2 — April 2026
A cross-platform analysis of modern ML compilers: an architectural study of
the OpenXLA pipeline (CHLO → StableHLO → XLA → LLVM) paired
with a proposed 3×3 benchmarking protocol evaluating three workloads
(LLM inference, dense training, sparse embedding training) across TPU v6e,
NVIDIA H200, and AMD MI300X.
Status: Proposed research protocol. Architectural analysis
and methodology are complete; experimental results are forthcoming pending
hardware access.
Read the paper (PDF)
·
LaTeX source
·
About this work
17 Apr 2026
Two shorter threads in the paper: the diagnostic tooling that makes compiler-level benchmarking possible, and the economic pressures that make cross-platform benchmarking urgent.
15 Apr 2026
The paper formalizes a 3×3 experimental design: three representative ML workloads evaluated across three hardware backends, for nine distinct cells. Each cell has defined KPIs.
12 Apr 2026
The 3×3 study spans three distinct accelerator paradigms. Each interacts with XLA’s fusion, buffer analysis, and partitioning strategies differently.
09 Apr 2026
To deliver the “run anywhere” half of OpenXLA’s promise, the ecosystem ships PJRT—a hardware- and framework-independent interface for ML compilers and runtimes. PJRT simplifies new hardware integration by exposing a stable C API that abstracts device management, memory allocation, and executable execution.
06 Apr 2026
The XLA compiler splits cleanly into target-independent analysis passes and target-specific code generation. This separation lets high-level optimizations benefit every backend while still exploiting the microarchitectural features of specific hardware.
04 Apr 2026
OpenXLA’s effectiveness as a portability layer hinges on a multi-level dialect hierarchy within the MLIR framework. This post summarizes how the hierarchy is organized and why the design choices matter.