Technical architecture and a proposed 3x3 benchmarking methodology for the OpenXLA compiler ecosystem across TPU v6e, NVIDIA H200, and AMD MI300X.

To deliver the “run anywhere” half of OpenXLA’s promise, the ecosystem ships PJRT—a hardware- and framework-independent interface for ML compilers and runtimes. PJRT simplifies new hardware integration by exposing a stable C API that abstracts device management, memory allocation, and executable execution.

Core abstractions

PjRtClient — manages all communication and owns the devices and memory spaces for a given plugin.
PjRtDevice — describes a single device, including its unique hash identifier and location within a local or global grid.
PjRtMemorySpace — distinguishes between unpinned and pinned memory associated with specific devices.
PjRtBuffer — holds data on the device in a format optimized for the plugin (proprietary tensor formats, MLIR element attributes, etc.).
PjRtCompiler — takes an input module (such as StableHLO) and returns a PjRtLoadedExecutable.

Why this matters for portability

The plugin architecture lets hardware vendors develop support independently from the main OpenXLA or framework repositories. Two concrete examples:

Intel Extension for TensorFlow and the AMD ROCm plugin both use the PJRT C API to integrate seamlessly with JAX and PyTorch without modifying core framework code.

From the benchmarking perspective, the PJRT layer is exactly what the paper’s portability tax measurement targets: for each hardware platform we compare the OpenXLA path (JAX → StableHLO → XLA → hardware) against the native path (vLLM/CUDA on NVIDIA, PyTorch/ROCm on AMD, direct HLO on TPU) to quantify the overhead this abstraction introduces.

Details in the paper, Section 4.

PJRT: The Pluggable Just-in-Time Runtime

Core abstractions

Why this matters for portability