<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/feed.xml" rel="self" type="application/atom+xml" /><link href="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/" rel="alternate" type="text/html" /><updated>2026-04-18T04:51:45+00:00</updated><id>https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/feed.xml</id><title type="html">OpenXLA Benchmark</title><subtitle>Technical architecture and a proposed 3x3 benchmarking methodology for the OpenXLA compiler ecosystem across TPU v6e, NVIDIA H200, and AMD MI300X.
</subtitle><entry><title type="html">Diagnostic Tools and the Economics of Cross-Platform Deployment</title><link href="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/17/diagnostic-tools-and-economics.html" rel="alternate" type="text/html" title="Diagnostic Tools and the Economics of Cross-Platform Deployment" /><published>2026-04-17T00:00:00+00:00</published><updated>2026-04-17T00:00:00+00:00</updated><id>https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/17/diagnostic-tools-and-economics</id><content type="html" xml:base="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/17/diagnostic-tools-and-economics.html"><![CDATA[<p>Two shorter threads in the paper: the diagnostic tooling that makes compiler-level benchmarking possible, and the economic pressures that make cross-platform benchmarking urgent.</p>

<h3 id="xprof-end-to-end-profiling">XProf: end-to-end profiling</h3>

<p>XProf works across JAX, TensorFlow, and PyTorch/XLA. Three components get direct use in the methodology:</p>

<ul>
  <li><strong>Trace Viewer</strong> — host/device execution timeline; identifies communication gaps and idle periods.</li>
  <li><strong>HLO Op Stats</strong> — highlights time-consuming operations; reports GFLOPS/s and rematerialization overhead.</li>
  <li><strong>Memory Profile Viewer</strong> — monitors HBM usage; surfaces peak heap consumption and potential stack exhaustion.</li>
</ul>

<h3 id="hlo-opt-isolated-pass-measurement">hlo-opt: isolated pass measurement</h3>

<p>The <code class="language-plaintext highlighter-rouge">hlo-opt</code> tool executes individual compiler passes independently of the full pipeline. This isolation is what makes the paper’s <strong>optimization-transferability measurement</strong> possible: you can run just <code class="language-plaintext highlighter-rouge">AlgebraicSimplifier</code> or <code class="language-plaintext highlighter-rouge">HloRematerialization</code> on a given input module and attribute a performance delta to that specific pass on that specific backend.</p>

<h3 id="the-diagnostic-toolkit">The diagnostic toolkit</h3>

<table>
  <thead>
    <tr>
      <th>Tool</th>
      <th>Primary Use Case</th>
      <th>Target Platform</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">hlo-opt</code></td>
      <td>Pass development and IR conversion</td>
      <td>CPU, GPU, TPU</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">run_hlo_module</code></td>
      <td>Microbenchmarking HLO snippets</td>
      <td>CPU, GPU, TPU</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">xprof</code></td>
      <td>End-to-end execution profiling</td>
      <td>GPU, TPU</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">multihost_hlo_runner</code></td>
      <td>SPMD and multi-node benchmarking</td>
      <td>Distributed</td>
    </tr>
  </tbody>
</table>

<h3 id="why-cross-platform-benchmarking-is-urgent-now">Why cross-platform benchmarking is urgent now</h3>

<p>SemiAnalysis projects that <strong>by 2030, inference will consume 75% of all AI compute</strong>. At that scale, platform economics become decisive for infrastructure planning.</p>

<p>Vendor benchmarks suggest large cost differentials:</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>TPU v6e</th>
      <th>NVIDIA H200</th>
      <th>Advantage</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cost per Hour</td>
      <td>~$1.38</td>
      <td>~$2.50+</td>
      <td>TPU (45% cheaper)</td>
    </tr>
    <tr>
      <td>Inference Perf. / $</td>
      <td>4× baseline</td>
      <td>Baseline</td>
      <td>TPU</td>
    </tr>
    <tr>
      <td>Power Efficiency</td>
      <td>60–65% less</td>
      <td>Baseline</td>
      <td>TPU</td>
    </tr>
    <tr>
      <td>Framework Maturity</td>
      <td>JAX (native)</td>
      <td>CUDA (universal)</td>
      <td>NVIDIA</td>
    </tr>
  </tbody>
</table>

<p>But these are vendor-controlled configurations and may not generalize. Independent validation—which the 3×3 methodology is designed to provide—is necessary to substantiate or qualify the claims.</p>

<p>Full details in <a href="https://kredd2506.github.io/OpenXLA-benchmark/paper1.pdf">the paper</a>, Sections 6–7.</p>]]></content><author><name></name></author><category term="openxla" /><category term="xprof" /><category term="hlo-opt" /><category term="economics" /><summary type="html"><![CDATA[Two shorter threads in the paper: the diagnostic tooling that makes compiler-level benchmarking possible, and the economic pressures that make cross-platform benchmarking urgent.]]></summary></entry><entry><title type="html">The 3×3 Benchmarking Methodology and Testable Hypotheses</title><link href="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/15/3x3-benchmarking-methodology.html" rel="alternate" type="text/html" title="The 3×3 Benchmarking Methodology and Testable Hypotheses" /><published>2026-04-15T00:00:00+00:00</published><updated>2026-04-15T00:00:00+00:00</updated><id>https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/15/3x3-benchmarking-methodology</id><content type="html" xml:base="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/15/3x3-benchmarking-methodology.html"><![CDATA[<p>The paper formalizes a <strong>3×3 experimental design</strong>: three representative ML workloads evaluated across three hardware backends, for nine distinct cells. Each cell has defined KPIs.</p>

<h3 id="the-matrix">The matrix</h3>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>TPU v6e</th>
      <th>NVIDIA H200</th>
      <th>AMD MI300X</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Task A: LLM Inference</strong> (Llama 3.1 70B)</td>
      <td>TTFT, TPS, $/tok</td>
      <td>TTFT, TPS, $/tok</td>
      <td>TTFT, TPS, $/tok</td>
    </tr>
    <tr>
      <td><strong>Task B: Dense Training</strong> (ViT / ResNet-50)</td>
      <td>TFLOPS, TTC</td>
      <td>TFLOPS, TTC</td>
      <td>TFLOPS, TTC</td>
    </tr>
    <tr>
      <td><strong>Task C: Sparse Training</strong> (DLRM v2)</td>
      <td>SC util, BW</td>
      <td>BW, throughput</td>
      <td>BW, throughput</td>
    </tr>
  </tbody>
</table>

<p>FP8 is an experimental variable on H200 and MI300X. Structured 2:4 sparsity is evaluated where hardware support is available.</p>

<h3 id="four-step-experimental-protocol">Four-step experimental protocol</h3>

<ol>
  <li><strong>Environment and baseline standardization.</strong> Unified JAX + PJRT stack. Input dataset sizes exceed host memory so the benchmark actually measures data movement, not cached hits. Accuracy is validated against reference metrics; per-operation numerical tolerances are documented (XLA’s <code class="language-plaintext highlighter-rouge">erf</code> lowering, for instance, can diverge by several ULPs depending on scalar vs. vectorized code paths).</li>
  <li><strong>HLO capture and isolated micro-benchmarking.</strong> <code class="language-plaintext highlighter-rouge">XLA_FLAGS="--xla_dump_to=/tmp/experiment"</code> captures pre- and post-optimization graphs; <code class="language-plaintext highlighter-rouge">hlo-opt</code> isolates specific passes (e.g., <code class="language-plaintext highlighter-rouge">ReshapeMover</code>) to measure impact across backends.</li>
  <li><strong>End-to-end performance measurement.</strong> XProf traces capture TTFT, tokens/s, time-to-convergence. Power is sampled at 1-second intervals via <code class="language-plaintext highlighter-rouge">nvidia-smi --query-gpu=power.draw</code>, <code class="language-plaintext highlighter-rouge">rocm-smi --showpower</code>, and the Google Cloud Monitoring API for TPU. Measurement stability requires coefficient of variation &lt;5% across trials (following Sada et al. 2025). <strong>Energy efficiency (tokens/s/W)</strong> is reported alongside throughput and latency for all 9 cells.</li>
  <li><strong>Bottleneck attribution.</strong> XProf’s Trace Viewer correlates accelerator gaps with host-side preprocessing. GFLOPS/s for critical kernels (<code class="language-plaintext highlighter-rouge">matmul</code>) are compared across architectures. Results feed a comparative TCO analysis.</li>
</ol>

<h3 id="compiler-version-sensitivity">Compiler version sensitivity</h3>

<p>All experiments pin a specific XLA commit hash. Individual commits can materially shift results—AMD shuffle promotion to DPP, collective permute barrier placement, cost-model corrections for integer GEMMs. Reproduction efforts should use the same commit or document deviations.</p>

<h3 id="statistical-protocol">Statistical protocol</h3>

<ul>
  <li>10 warm-up iterations (discarded).</li>
  <li>≥30 timed iterations, extended until the 95% CI for the primary metric is within ±3% of the sample mean.</li>
  <li>Report median, mean, standard deviation, and 95% CI.</li>
  <li>Report both wall-clock and device-only time.</li>
  <li>Record XLA commit hash, PJRT plugin version, driver version, cloud instance type, HBM temperature at run start, co-tenancy warnings.</li>
  <li>Repeat across ≥2 independent cloud instances (hardware lottery control).</li>
  <li>Runs &gt;3σ from the mean are flagged and excluded from primary statistics.</li>
</ul>

<h3 id="three-testable-hypotheses">Three testable hypotheses</h3>

<ul>
  <li><strong>H1 (Abstraction Overhead).</strong> StableHLO/PJRT overhead vs. native paths is <strong>&lt;5% for compute-bound</strong> workloads (dense matmul) but potentially <strong>&gt;15% for memory-bound</strong> workloads (sparse embedding lookups) where data movement is less amenable to fusion.</li>
  <li><strong>H2 (Optimization Transferability).</strong> Target-independent optimizations—particularly fusion—transfer unevenly. Largest gains on TPU v6e (systolic dataflow benefits most from reduced memory round-trips); smaller on MI300X (5.3 TB/s HBM partially masks the memory wall). Confounder: XLA’s internal cost model has known inaccuracies for integer GEMMs (up to 576% prediction error at medium shapes).</li>
  <li><strong>H3 (SPMD Scaling Efficiency).</strong> Shardy-partitioned SPMD scales <strong>sub-linearly</strong> across multi-node configs. Efficiency gap widest on NVIDIA (NVLink → InfiniBand transition); narrowest on TPU pods (ICI torus provides uniform bisection bandwidth).</li>
</ul>

<h3 id="limitations-deliberate">Limitations (deliberate)</h3>

<ul>
  <li>No Groq LPU, Cerebras WSE, or Intel Gaudi 3 (different paradigms).</li>
  <li>No fine-tuning, RL, or MoE workloads (different compilation challenges).</li>
  <li>JAX only as frontend (cross-framework confounds avoided).</li>
  <li>Point-in-time snapshot at a specific XLA commit.</li>
</ul>

<p>Full methodology and reporting schema: <a href="https://kredd2506.github.io/OpenXLA-benchmark/paper1.pdf">the paper</a>, Sections 8–9.</p>]]></content><author><name></name></author><category term="benchmarking" /><category term="methodology" /><category term="hypotheses" /><summary type="html"><![CDATA[The paper formalizes a 3×3 experimental design: three representative ML workloads evaluated across three hardware backends, for nine distinct cells. Each cell has defined KPIs.]]></summary></entry><entry><title type="html">Three Accelerator Paradigms: TPU v6e, NVIDIA H200, and AMD MI300X</title><link href="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/12/hardware-tpu-nvidia-amd.html" rel="alternate" type="text/html" title="Three Accelerator Paradigms: TPU v6e, NVIDIA H200, and AMD MI300X" /><published>2026-04-12T00:00:00+00:00</published><updated>2026-04-12T00:00:00+00:00</updated><id>https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/12/hardware-tpu-nvidia-amd</id><content type="html" xml:base="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/12/hardware-tpu-nvidia-amd.html"><![CDATA[<p>The 3×3 study spans three distinct accelerator paradigms. Each interacts with XLA’s fusion, buffer analysis, and partitioning strategies differently.</p>

<h3 id="google-cloud-tpu-v6e-trillium--systolic-array">Google Cloud TPU v6e (Trillium) — systolic array</h3>

<p>Google’s TPUs are specialized matrix processors built around <strong>systolic arrays</strong>: grids of multiply-accumulators (MACs) that process matrix operations in a weight-stationary dataflow.</p>

<ul>
  <li><strong>256×256 systolic array</strong>, 65,536 MAC ops per cycle (up from 128×128 on v5).</li>
  <li>Optimized for Transformers: ~2.8× performance and ~2.1× perf/watt over the prior generation.</li>
  <li><strong>SparseCore</strong>: a dataflow processor dedicated to sparse ops such as recommendation-system embedding lookups. SparseCores operate in parallel with TensorCores, enabling pipelined dense/sparse computation.</li>
</ul>

<h3 id="nvidia-h200-hopper--simt">NVIDIA H200 (Hopper) — SIMT</h3>

<p>NVIDIA’s integration leverages the maturity of the CUDA ecosystem. Two features stand out in the paper’s context:</p>

<ul>
  <li><strong>NVSHMEM</strong> (PGAS-style communication) integrated with XLA delivers up to <strong>36% speedup over NCCL</strong> at sequence lengths up to 256K tokens.</li>
  <li><strong>CUDA Graphs</strong> support in XLA amortizes kernel launch overhead by capturing operation sequences into a single graph executable—particularly valuable for models with many small, short-lived ops.</li>
</ul>

<h3 id="amd-instinct-mi300x--cdna">AMD Instinct MI300X — CDNA</h3>

<p>AMD, a founding OpenXLA member, integrates via the ROCm stack:</p>

<ul>
  <li><strong>CDNA 3</strong> with Matrix Core Technology; supports FP8 and MXFP4.</li>
  <li><strong>64-wide wavefronts</strong> (NVIDIA uses 32-wide warps); each compute unit has dedicated Matrix Cores for accelerated GEMM.</li>
  <li><strong>RCCL</strong> (AMD’s fork of NCCL) runs over Infinity Fabric rather than NVLink.</li>
  <li>Intra-wavefront primitives: <strong>DPP</strong> and <strong>ds_swizzle</strong> instead of NVIDIA warp-shuffle. Recent XLA contributions promote generic <code class="language-plaintext highlighter-rouge">gpu.shuffle</code> operations to AMD-specific DPP register-to-register instructions for reduction kernels, eliminating shared-memory overhead.</li>
</ul>

<p>Meta has deployed MI300X for Llama 3 and Llama 4 inference.</p>

<h3 id="specs-at-a-glance">Specs at a glance</h3>

<table>
  <thead>
    <tr>
      <th>Specification</th>
      <th>TPU v6e (Trillium)</th>
      <th>NVIDIA H200</th>
      <th>AMD MI300X</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Peak BF16 Compute</td>
      <td>918 TFLOPS/chip</td>
      <td>989 TFLOPS</td>
      <td>1,307 TFLOPS</td>
    </tr>
    <tr>
      <td>HBM Capacity</td>
      <td>32 GB</td>
      <td>141 GB</td>
      <td>192 GB</td>
    </tr>
    <tr>
      <td>HBM Bandwidth</td>
      <td>1.6 TB/s</td>
      <td>4.8 TB/s</td>
      <td>5.3 TB/s</td>
    </tr>
    <tr>
      <td>Interconnect</td>
      <td>2D Torus (ICI)</td>
      <td>NVLink</td>
      <td>Infinity Fabric</td>
    </tr>
    <tr>
      <td>Pod / Node Scale</td>
      <td>256 chips</td>
      <td>8 GPUs (DGX)</td>
      <td>8 GPUs</td>
    </tr>
    <tr>
      <td>Sparse Acceleration</td>
      <td>SparseCore (2/chip)</td>
      <td>Structured Sparsity</td>
      <td>Sparse Matrix Ops</td>
    </tr>
  </tbody>
</table>

<p>The cross-paradigm comparison is the point: a single compiler adapting to three fundamentally different hardware philosophies. Hypothesis H2 predicts the target-independent passes (fusion, CSE) don’t help these architectures equally.</p>

<p>See <a href="https://kredd2506.github.io/OpenXLA-benchmark/paper1.pdf">the paper</a>, Section 5, and the extended TPU v5p vs. v6e comparison in Appendix A.</p>]]></content><author><name></name></author><category term="hardware" /><category term="tpu" /><category term="nvidia" /><category term="amd" /><summary type="html"><![CDATA[The 3×3 study spans three distinct accelerator paradigms. Each interacts with XLA’s fusion, buffer analysis, and partitioning strategies differently.]]></summary></entry><entry><title type="html">PJRT: The Pluggable Just-in-Time Runtime</title><link href="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/09/pjrt-pluggable-hardware-interface.html" rel="alternate" type="text/html" title="PJRT: The Pluggable Just-in-Time Runtime" /><published>2026-04-09T00:00:00+00:00</published><updated>2026-04-09T00:00:00+00:00</updated><id>https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/09/pjrt-pluggable-hardware-interface</id><content type="html" xml:base="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/09/pjrt-pluggable-hardware-interface.html"><![CDATA[<p>To deliver the “run anywhere” half of OpenXLA’s promise, the ecosystem ships <strong>PJRT</strong>—a hardware- and framework-independent interface for ML compilers and runtimes. PJRT simplifies new hardware integration by exposing a stable C API that abstracts device management, memory allocation, and executable execution.</p>

<h3 id="core-abstractions">Core abstractions</h3>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">PjRtClient</code></strong> — manages all communication and owns the devices and memory spaces for a given plugin.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">PjRtDevice</code></strong> — describes a single device, including its unique hash identifier and location within a local or global grid.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">PjRtMemorySpace</code></strong> — distinguishes between unpinned and pinned memory associated with specific devices.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">PjRtBuffer</code></strong> — holds data on the device in a format optimized for the plugin (proprietary tensor formats, MLIR element attributes, etc.).</li>
  <li><strong><code class="language-plaintext highlighter-rouge">PjRtCompiler</code></strong> — takes an input module (such as StableHLO) and returns a <code class="language-plaintext highlighter-rouge">PjRtLoadedExecutable</code>.</li>
</ul>

<h3 id="why-this-matters-for-portability">Why this matters for portability</h3>

<p>The plugin architecture lets hardware vendors develop support <strong>independently</strong> from the main OpenXLA or framework repositories. Two concrete examples:</p>

<ul>
  <li><strong>Intel Extension for TensorFlow</strong> and the <strong>AMD ROCm plugin</strong> both use the PJRT C API to integrate seamlessly with JAX and PyTorch without modifying core framework code.</li>
</ul>

<p>From the benchmarking perspective, the PJRT layer is exactly what the paper’s <strong>portability tax</strong> measurement targets: for each hardware platform we compare the OpenXLA path (JAX → StableHLO → XLA → hardware) against the native path (vLLM/CUDA on NVIDIA, PyTorch/ROCm on AMD, direct HLO on TPU) to quantify the overhead this abstraction introduces.</p>

<p>Details in <a href="https://kredd2506.github.io/OpenXLA-benchmark/paper1.pdf">the paper</a>, Section 4.</p>]]></content><author><name></name></author><category term="openxla" /><category term="pjrt" /><category term="runtime" /><summary type="html"><![CDATA[To deliver the “run anywhere” half of OpenXLA’s promise, the ecosystem ships PJRT—a hardware- and framework-independent interface for ML compilers and runtimes. PJRT simplifies new hardware integration by exposing a stable C API that abstracts device management, memory allocation, and executable execution.]]></summary></entry><entry><title type="html">The XLA Compiler Pipeline: Target-Independent Passes and LLVM Lowering</title><link href="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/06/xla-compiler-pipeline.html" rel="alternate" type="text/html" title="The XLA Compiler Pipeline: Target-Independent Passes and LLVM Lowering" /><published>2026-04-06T00:00:00+00:00</published><updated>2026-04-06T00:00:00+00:00</updated><id>https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/06/xla-compiler-pipeline</id><content type="html" xml:base="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/06/xla-compiler-pipeline.html"><![CDATA[<p>The XLA compiler splits cleanly into <strong>target-independent analysis passes</strong> and <strong>target-specific code generation</strong>. This separation lets high-level optimizations benefit every backend while still exploiting the microarchitectural features of specific hardware.</p>

<h3 id="target-independent-optimization">Target-independent optimization</h3>

<p>During the initial compilation stage, XLA performs optimizations on the StableHLO graph that are hardware-agnostic:</p>

<ul>
  <li><strong>Common Subexpression Elimination (CSE).</strong> Redundant computations producing identical results are identified and consolidated.</li>
  <li><strong>Algebraic Simplification.</strong> Mathematically equivalent but computationally cheaper operation sequences are substituted.</li>
  <li><strong>Operation Fusion.</strong> Multiple subgraphs are combined into a single kernel, eliminating short-lived operations and intermediate buffers. For <code class="language-plaintext highlighter-rouge">Y = reduce_sum(X + Y * Z)</code>, the multiplication, addition, and reduction fuse into one kernel that streams through registers instead of writing back to HBM.</li>
</ul>

<p>Beyond these passes, the compiler performs <strong>buffer analysis</strong> to allocate runtime memory and <strong>shape specialization</strong> to enable more aggressive constant propagation.</p>

<h3 id="target-specific-code-generation">Target-specific code generation</h3>

<p>After target-independent passes, the compiler converts StableHLO into an internal HLO dialect and dispatches to a hardware-specific backend. CPU and GPU backends use the <strong>LLVM framework</strong> for low-level IR generation. The backend pattern-matches operation combinations to optimized library calls (cuDNN, MKL) and determines optimal partitioning of computations into parallel streams.</p>

<h3 id="the-full-pipeline">The full pipeline</h3>

<table>
  <thead>
    <tr>
      <th>Stage</th>
      <th>Dialect / Format</th>
      <th>Optimization Level</th>
      <th>Core Tools</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Frontend</td>
      <td>Python / NumPy (JAX)</td>
      <td>Framework-level</td>
      <td>JAX / TF / PyTorch</td>
    </tr>
    <tr>
      <td>Intermediate</td>
      <td>StableHLO / CHLO</td>
      <td>Target-independent</td>
      <td>MLIR / XLA Passes</td>
    </tr>
    <tr>
      <td>Lowering</td>
      <td>HLO Dialect</td>
      <td>Target-specific</td>
      <td>XLA Backend</td>
    </tr>
    <tr>
      <td>Backend</td>
      <td>LLVM IR / SPIR-V</td>
      <td>Microarchitectural</td>
      <td>LLVM / NVCC / ROCm</td>
    </tr>
    <tr>
      <td>Native</td>
      <td>PTX / Machine Code</td>
      <td>Hardware-specific</td>
      <td>LLVM / Driver</td>
    </tr>
  </tbody>
</table>

<p>The key insight for the benchmarking methodology: because fusion/CSE/algebraic simplification are identical across backends, we can measure whether each pass helps equally on systolic arrays vs. SIMT vs. CDNA. Hypothesis H2 predicts <em>uneven</em> transfer—largest gains on TPU v6e (memory-wall-sensitive systolic dataflow), smaller on MI300X (5.3 TB/s HBM partially masks the memory wall).</p>

<p>For details, see <a href="https://kredd2506.github.io/OpenXLA-benchmark/paper1.pdf">the paper</a>, Section 3.</p>]]></content><author><name></name></author><category term="openxla" /><category term="xla" /><category term="compilers" /><category term="llvm" /><summary type="html"><![CDATA[The XLA compiler splits cleanly into target-independent analysis passes and target-specific code generation. This separation lets high-level optimizations benefit every backend while still exploiting the microarchitectural features of specific hardware.]]></summary></entry><entry><title type="html">The OpenXLA IR Hierarchy: StableHLO, CHLO, and VHLO</title><link href="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/04/ir-hierarchy-stablehlo-chlo-vhlo.html" rel="alternate" type="text/html" title="The OpenXLA IR Hierarchy: StableHLO, CHLO, and VHLO" /><published>2026-04-04T00:00:00+00:00</published><updated>2026-04-04T00:00:00+00:00</updated><id>https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/04/ir-hierarchy-stablehlo-chlo-vhlo</id><content type="html" xml:base="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/04/ir-hierarchy-stablehlo-chlo-vhlo.html"><![CDATA[<p>OpenXLA’s effectiveness as a portability layer hinges on a multi-level dialect hierarchy within the MLIR framework. This post summarizes how the hierarchy is organized and why the design choices matter.</p>

<h3 id="stablehlo-the-portability-contract">StableHLO: the portability contract</h3>

<p>StableHLO is the bridge between frontend frameworks and backend compilers such as XLA and IREE. Any framework capable of producing StableHLO programs is compatible with any compiler that can consume them.</p>

<p>The specification defines approximately <strong>100 operations</strong> and commits to long-term stability: <strong>five years of backward compatibility</strong> and <strong>two years of forward compatibility</strong>. Stability is managed via <strong>VHLO</strong> (Versioned StableHLO), an add-only dialect that snapshots the StableHLO dialect at specific points in time, so consumers can deserialize and upgrade payloads from older versions of the stack.</p>

<h3 id="chlo-client-level-abstractions">CHLO: client-level abstractions</h3>

<p>The CHLO (Client HLO) dialect aligns with the API surface of <code class="language-plaintext highlighter-rouge">XlaBuilder</code> in C++. It models “syntactic sugar” and high-level mathematical compositions before they are materialized into lower-level dialects. This hierarchy enables hardware-independent program simplification and refines dynamically-shaped programs using concrete input arguments.</p>

<h3 id="quantization-semantics">Quantization semantics</h3>

<p>StableHLO’s types emphasize domain-specific tensor requirements. Quantized element types are governed by rigorous constraints:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">storage_type</code> (integer), <code class="language-plaintext highlighter-rouge">expressed_type</code> (floating-point), <code class="language-plaintext highlighter-rouge">scales</code> (floating-point constants).</li>
  <li>For per-axis quantization, <code class="language-plaintext highlighter-rouge">quantization_dimension &lt; rank(T)</code>.</li>
  <li>Constraints C1–C13 in the spec enforce that <code class="language-plaintext highlighter-rouge">storage_min</code>/<code class="language-plaintext highlighter-rouge">storage_max</code> fit within the storage type and that scales are strictly positive and finite.</li>
</ul>

<p>Enforcing these semantics at the compiler level prevents numerical issues common in lower-precision data formats.</p>

<h3 id="shardy-unified-tensor-partitioning">Shardy: unified tensor partitioning</h3>

<p><a href="https://github.com/openxla/shardy">Shardy</a> is an MLIR-based partitioning system from the merged GSPMD and PartIR teams. It provides a unified partitioning dialect integrated into the StableHLO layer, delivering consistent SPMD partitioning semantics across all OpenXLA backends. Shardy is directly relevant to the paper’s H3 hypothesis on SPMD scaling efficiency.</p>

<h3 id="deprecation-note">Deprecation note</h3>

<p>The internal <strong>MHLO</strong> (Meta-HLO) dialect is being deprecated. Useful passes (canonicalization, folder patterns) are migrating into StableHLO itself, ensuring that all hardware-independent graph simplifications are available to every PJRT plugin regardless of target compiler IR.</p>

<p>See the full discussion in <a href="https://kredd2506.github.io/OpenXLA-benchmark/paper1.pdf">the paper</a>, Section 2.</p>]]></content><author><name></name></author><category term="openxla" /><category term="mlir" /><category term="stablehlo" /><summary type="html"><![CDATA[OpenXLA’s effectiveness as a portability layer hinges on a multi-level dialect hierarchy within the MLIR framework. This post summarizes how the hierarchy is organized and why the design choices matter.]]></summary></entry><entry><title type="html">Paper Overview: The OpenXLA Ecosystem and a 3×3 Benchmarking Protocol</title><link href="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/01/paper-overview.html" rel="alternate" type="text/html" title="Paper Overview: The OpenXLA Ecosystem and a 3×3 Benchmarking Protocol" /><published>2026-04-01T00:00:00+00:00</published><updated>2026-04-01T00:00:00+00:00</updated><id>https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/01/paper-overview</id><content type="html" xml:base="https://kredd2506.github.io/OpenXLA-benchmark/OpenXLA-benchmark/2026/04/01/paper-overview.html"><![CDATA[<p>The paper <em>“Technical Architecture and Systematic Benchmarking of the OpenXLA Ecosystem”</em> is a proposed research protocol: the architectural analysis and benchmarking methodology are complete, and experimental results are forthcoming pending hardware access.</p>

<h3 id="the-problem">The problem</h3>

<p>The proliferation of frontend frameworks (JAX, PyTorch, TensorFlow) combined with an increasingly heterogeneous hardware landscape (GPUs, TPUs, custom ASICs) has created a fragmentation problem that impedes portable and efficient model deployment. OpenXLA—developed jointly by Google, AMD, Intel, NVIDIA, and AWS—addresses this through a unified compiler ecosystem built on <strong>StableHLO</strong> as a portability layer and <strong>PJRT</strong> as a pluggable hardware interface.</p>

<h3 id="what-the-paper-delivers">What the paper delivers</h3>

<ol>
  <li>An architectural analysis of the OpenXLA compiler pipeline, informed by direct contributions to the OpenXLA codebase.</li>
  <li>A detailed characterization of three contemporary accelerator families—TPU v6e (Trillium), NVIDIA H200 (Hopper), and AMD MI300X—and their interaction with the XLA compiler.</li>
  <li>A comparative economic analysis of cross-platform deployment costs and energy efficiency.</li>
  <li>A formal 3×3 benchmarking methodology: three representative workloads (LLM inference, dense training, sparse embedding training) evaluated across all three backends.</li>
</ol>

<h3 id="five-ways-this-differs-from-existing-benchmarks">Five ways this differs from existing benchmarks</h3>

<ul>
  <li><strong>Benchmarks the compiler, not just the hardware.</strong> Uses <code class="language-plaintext highlighter-rouge">hlo-opt</code> ablation to isolate fusion, CSE, and algebraic simplification.</li>
  <li><strong>Measures the portability tax.</strong> Compares OpenXLA (JAX → StableHLO → XLA → hardware) against native paths (vLLM/CUDA, PyTorch/ROCm, direct HLO on TPU).</li>
  <li><strong>Quantifies optimization transferability.</strong> Measures whether the same passes help equally on systolic array (TPU), SIMT (NVIDIA), and CDNA (AMD).</li>
  <li><strong>Exposes compiler version sensitivity.</strong> All experiments pin a specific XLA commit hash.</li>
  <li><strong>Covers three accelerator paradigms</strong> rather than the two-platform designs typical of prior work.</li>
</ul>

<h3 id="read-the-paper">Read the paper</h3>

<ul>
  <li><a href="https://kredd2506.github.io/OpenXLA-benchmark/paper1.pdf">paper1.pdf</a></li>
  <li><a href="https://kredd2506.github.io/OpenXLA-benchmark/paper1.tex">paper1.tex</a> (LaTeX source)</li>
</ul>]]></content><author><name></name></author><category term="openxla" /><category term="overview" /><category term="benchmarking" /><summary type="html"><![CDATA[The paper “Technical Architecture and Systematic Benchmarking of the OpenXLA Ecosystem” is a proposed research protocol: the architectural analysis and benchmarking methodology are complete, and experimental results are forthcoming pending hardware access.]]></summary></entry></feed>