[llvm] [DAGCombiner] Fix check for extending loads (PR #112182)

Mon Oct 14 04:05:05 PDT 2024

https://github.com/LewisCrawford created https://github.com/llvm/llvm-project/pull/112182

Fix a check for extending loads in DAGCombiner,
where if the result type has more bits than the
loaded type it should count as an extending load.

All backends apart from AArch64 ignore this
ExtTy argument to shouldReduceLoadWidth, so this
change currently only impacts AArch64.

>From 9907b9788ee875ff03d307f0185e9bb480669282 Mon Sep 17 00:00:00 2001
From: Lewis Crawford <lcrawford at nvidia.com>
Date: Mon, 14 Oct 2024 11:03:37 +0000
Subject: [PATCH] [DAGCombiner] Fix check for extending loads

Fix a check for extending loads in DAGCombiner,
where if the result type has more bits than the
loaded type it should count as an extending load.

All backends apart from AArch64 ignore this
ExtTy argument to shouldReduceLoadWidth, so this
change currently only impacts AArch64.
---
 llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp |  2 +-
 .../AArch64/aarch64-scalarize-vec-load-ext.ll | 29 +++++++++++++++++++
 2 files changed, 30 insertions(+), 1 deletion(-)
 create mode 100644 llvm/test/CodeGen/AArch64/aarch64-scalarize-vec-load-ext.ll

diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 810ca458bc8787..2d9025da5e2e85 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -22566,7 +22566,7 @@ SDValue DAGCombiner::scalarizeExtractedVectorLoad(SDNode *EVE, EVT InVecVT,
     return SDValue();
 
   ISD::LoadExtType ExtTy =
-      ResultVT.bitsGT(VecEltVT) ? ISD::NON_EXTLOAD : ISD::EXTLOAD;
+      ResultVT.bitsGT(VecEltVT) ? ISD::EXTLOAD : ISD::NON_EXTLOAD;
   if (!TLI.isOperationLegalOrCustom(ISD::LOAD, VecEltVT) ||
       !TLI.shouldReduceLoadWidth(OriginalLoad, ExtTy, VecEltVT))
     return SDValue();
diff --git a/llvm/test/CodeGen/AArch64/aarch64-scalarize-vec-load-ext.ll b/llvm/test/CodeGen/AArch64/aarch64-scalarize-vec-load-ext.ll
new file mode 100644
index 00000000000000..f34cad0b097f55
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/aarch64-scalarize-vec-load-ext.ll
@@ -0,0 +1,29 @@
+; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s | FileCheck %s
+
+; FIXME: Currently, we avoid narrowing this v4i32 load, in the
+; hopes of being able to fold the shift, despite it requiring stack
+; storage + loads. Ideally, we should narrow here and load the i32
+; directly from the variable offset e.g:
+;
+; add     x8, x0, x1, lsl #4
+; and     x9, x2, #0x3
+; ldr     w0, [x8, x9, lsl #2]
+;
+; The AArch64TargetLowering::shouldReduceLoadWidth heuristic should
+; probably be updated to choose load-narrowing instead of folding the
+; lsl in larger vector cases.
+;
+; CHECK-LABEL: narrow_load_v4_i32_single_ele_variable_idx:
+; CHECK: sub  sp, sp, #16
+; CHECK: ldr  q[[REG0:[0-9]+]], [x0, x1, lsl #4]
+; CHECK: bfi  x[[REG1:[0-9]+]], x2, #2, #2
+; CHECK: str  q[[REG0]], [sp]
+; CHECK: ldr  w0, [x[[REG1]]]
+; CHECK: add  sp, sp, #16
+define i32 @narrow_load_v4_i32_single_ele_variable_idx(ptr %ptr, i64 %off, i32 %ele) {
+entry:
+  %idx = getelementptr inbounds <4 x i32>, ptr %ptr, i64 %off
+  %x = load <4 x i32>, ptr %idx, align 8
+  %res = extractelement <4 x i32> %x, i32 %ele
+  ret i32 %res
+}