[PATCH] D149281: Don't disable loop unroll for vectorized loops on AMDGPU target

Tue May 23 12:26:50 PDT 2023

fhahn added a comment.

In D149281#4361600 <https://reviews.llvm.org/D149281#4361600>, @rampitec wrote:

> In D149281#4350286 <https://reviews.llvm.org/D149281#4350286>, @alex-t wrote:
>
>> Fairly speaking, the whole idea of loop vectorization for GPU seems nonsense to me. Although I am not an expert in loop optimizations.
>> GPU has no wide vector registers which may be used to process several scala values at one HW cycle and, by this, unroll the loop by the vector factor. Instead, each thread in a wavefront operates on its own separate value in a 32-bit wide lane for the divergent values and all threads operate on the same shared scalar value in case it is uniform.
>> If we have a completely uniform input program (no dependence on thread ID) we could not get any better benefit than from the usual unroll performed by the loop unroll pass.
>> So, IMO the LV is just a complicated and error-prone way to do loop unroll.
>> Once again, I may not understand some subtle matters as I have no large experience with the LV.
>
> Right, we do not have vector operations in the same sense as a SIMD machine, neither do we have vector registers in the same sense. Well, almost. We have vector (meaning wide, multicomponent) loads and stores and we have SOME v2 16-bit packed and v2 32-bit packed operations on some subtargets. Although this is not a full set of packed ALU as needed for vectorization and may not warrant a full loop vectorizer (SLP vectorizer is still in use). In any way there are no SIMD-style vector registers to control the interleave. Even though we are using register tuples for wide operations, these are not separate registers like for example XMMs on x86, and cut into the same register budget. So increasing 'number of vector registers' does not only mean much for our target but also inevitably leads to performance regressions.
>
> All in all loop vectorizer does not do much for AMDGPU and cannot be a pass driving unroll. It still does something as said above, so whenever to keep it on or off is a completely separate question, but certainly it is no an unroll driver here.

Interesting, thanks for elaborating. It seems like the role/benefit of LV here is a bit unclear and in the long term it would be good to disable it entirely if it can do more harm then good.

================
Comment at: llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll:1
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx90a -passes=loop-vectorize %s -S -o - | FileCheck %s
+
----------------
Can you precommit the test to show have only the functional diff in the actual patch here>

================
Comment at: llvm/test/CodeGen/AMDGPU/vectorize-unroll-metadata.ll:24
+  %counter = phi i32 [0, %entry], [%inc, %loop.inc]
+  br label %loop.body
+
----------------
is it possible to fold the whole loop body in a single block to simplify the test?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D149281/new/

https://reviews.llvm.org/D149281