[clang] [llvm] [AArch64] Add support for Qualcomm Oryon processor (PR #91022)

Thu May 23 06:44:44 PDT 2024

================
@@ -0,0 +1,1664 @@
+//=- AArch64SchedOryon.td - Nuvia Inc Oryon CPU 001 ---*- tablegen -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines the scheduling model for Nuvia Inc Oryon
+// family of processors.
+//
+//===----------------------------------------------------------------------===//
+
+//===----------------------------------------------------------------------===//
+// Pipeline Description.
+
+def OryonModel : SchedMachineModel {
+  let IssueWidth            =  14; // 14 micro-ops dispatched at a time. IXU=6, LSU=4, VXU=4
+  let MicroOpBufferSize     = 376; // 192 (48x4) entries in micro-op re-order buffer in VXU.
+                                   // 120 ((20+20)x3) entries in micro-op re-order buffer in IXU
+                                   // 64  (16+16)x2 re-order buffer in LSU
+                                   // total 373
+  let LoadLatency           =   4; // 4 cycle Load-to-use from L1D$
+                                   // LSU=5 NEON load
+  let MispredictPenalty     =  13; // 13 cycles for mispredicted branch.
+  // Determined via a mix of micro-arch details and experimentation.
+  let LoopMicroOpBufferSize =   0; // Do not have a LoopMicroOpBuffer
----------------
asb wrote:

Thanks Joel. FWIW there's been some development in this setting since I posted this. 54e52aa5ebe68de122a3fe6064e0abef97f6b8e0 reduced the LoopMicroOpBufferSize for Zen to 96 (Zen3) and 108 (Zen4) so they're no longer outliers with much larger values than anyone else.

I totally agree that in-order cores will be more sensitive to scheduling changes in general. I'll note that as runtime and partial loop unrolling are disabled unless this is set, I'd see the impact as slightly more than just scheduling. In some cases further optimisations trigger on the (partially) unrolled IR which can lead to better code. e.g. setting this reduced dynamic instruction count on some SPEC benchmarks for a RISC-V OoO scheduling model. If you do experiment with the value I'd be interested in your findings.

https://github.com/llvm/llvm-project/pull/91022