[PATCH] D154157: [LV] Cost model for out-of-loop reductions

Thu Aug 31 12:23:26 PDT 2023

anna added a comment.

In D154157#4628521 <https://reviews.llvm.org/D154157#4628521>, @fhahn wrote:

> In D154157#4628496 <https://reviews.llvm.org/D154157#4628496>, @anna wrote:
>
>> Gentle ping @fhahn. Anything I could do to move this forward? I have kept the option as off by default, since I would need to help to test this on supported targets upstream. It helps our use-case where loops are not vectorized when out-of-loop reductions are present in small trip count loops. 
>> There maybe some fallouts where correction in reductions cost modelling will be required.
>
> What are the regressions on ARM/RISCV? I think we should aim to enable this by default for all platforms, otherwise it is at the risk of not getting enabled. Also, those tests may show issues with the current implementation.

The regressions on ARM and RISCV are because we have an updated minimum trip count we expect (to offset the cost of the out-of-loop reductions), so the CHECK lines have been updated accordingly. RISCV looks okay with the minimum trip count being at least 4 or 8 (we round the MinimumTripCount up to VF, so it is more conservative).

However, in the case of ARM, there is no special cost for min/max reductions. They compute a pretty expensive cost under the assumption that we expand these reductions and extract the value. On some simple tests I ran, the assembly generated supports this cost for the `-mtriple=armv7-none-eabi`, but the code generated is different for another triple such as `-mtriple=thumbv8.1m.main-none-none-eabi`. I've added more details under the respective test.

> Do you have any numbers on the impact on X86?

I will have some numbers soon. The tests are still running. We see positive impact on small trip count loops on our workloads (in the range of 40% improvement) - these workloads are small trip count loops with fminimum/fmaximum reductions where we no longer vectorize.

================
Comment at: llvm/test/Transforms/LoopVectorize/ARM/tail-fold-multiple-icmps.ll:12
 ; CHECK:       for.body.preheader:
-; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 56
 ; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
----------------
This minimum trip count of 56 comes about because we compute the reduction cost as 140 (and there are two such reductions, with total being 280). 
This is a pretty high cost which is computed through https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/CodeGen/BasicTTIImpl.h#L2372. 
This is correct for a triple with mattr such as:
```
llc < %s -mtriple=armv7-none-eabi -float-abi=hard -mattr=+neon -verify-machineinstrs | FileCheck %s

define i8 @test_umin_v8i8(<8 x i8> %x) {
; CHECK-LABEL: test_umin_v8i8:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    vpmin.u8 d16, d0, d0
; CHECK-NEXT:    vpmin.u8 d16, d16, d16
; CHECK-NEXT:    vpmin.u8 d16, d16, d16
; CHECK-NEXT:    vmov.u8 r0, d16[0]
; CHECK-NEXT:    bx lr
entry:
  %z = call i8 @llvm.vector.reduce.umin.v8i8(<8 x i8> %x) 
  ret i8 %z
}
```
However, I'm not sure if this is a correct cost for other triples. 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154157/new/

https://reviews.llvm.org/D154157