[clang] [llvm] [AArch64] Add support for Qualcomm Oryon processor (PR #91022)

Wed May 15 21:53:39 PDT 2024

================
@@ -0,0 +1,1664 @@
+//=- AArch64SchedOryon.td - Nuvia Inc Oryon CPU 001 ---*- tablegen -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines the scheduling model for Nuvia Inc Oryon
+// family of processors.
+//
+//===----------------------------------------------------------------------===//
+
+//===----------------------------------------------------------------------===//
+// Pipeline Description.
+
+def OryonModel : SchedMachineModel {
+  let IssueWidth            =  14; // 14 micro-ops dispatched at a time. IXU=6, LSU=4, VXU=4
+  let MicroOpBufferSize     = 376; // 192 (48x4) entries in micro-op re-order buffer in VXU.
+                                   // 120 ((20+20)x3) entries in micro-op re-order buffer in IXU
+                                   // 64  (16+16)x2 re-order buffer in LSU
+                                   // total 373
+  let LoadLatency           =   4; // 4 cycle Load-to-use from L1D$
+                                   // LSU=5 NEON load
+  let MispredictPenalty     =  13; // 13 cycles for mispredicted branch.
+  // Determined via a mix of micro-arch details and experimentation.
+  let LoopMicroOpBufferSize =   0; // Do not have a LoopMicroOpBuffer
----------------
asb wrote:

Although it may not be microarchitecturally accurate, I wonder if you've benchmarked setting LoopMicroOpBufferSize to a non-zero value. Unless targets override it, partial and runtime loop unrolling aren't enabled unless LoopMicroOpBufferSize is non-zero (and this is the only way it's currently queried and used). In AArch64's case, they only override that decision f or in-order scheduling models. If you look at other models that set this value in-tree you'll see it's become somewhat divorced from microarchitectual reality - e.g. a number of the AArch64 models setting it based on instruction queue size or noting they just copied the value from the A57 model. On the X86 side, it's set to 50-72 for the modern Intel X86 and even up to 512 for Zen (noting it should be higher, but compile time impact is too high).

https://github.com/llvm/llvm-project/pull/91022