[PATCH] D149281: Not disable loop unroll for vectorized loops on AMDGPU target

Mon May 15 18:36:56 PDT 2023

bcahoon added a comment.

Here are some observations about the performance regression that we see once the unrolling is disabled by the vectorizer.

In a specific case, the best performance occurs when the loop is unrolled 8 iterations (I haven't tried more than 8). It's a small loop, and after unrolling the loads in the unrolled loop are all moved to the top of the loop, which increases the time between the def and uses.

AMDGPU enables run-time loop unrolling only in specific cases as enabling run-time unrolling unconditionally is not profitable in many cases.

Increasing interleaving doesn't improve vectorization for this loop. The only effect is to perform unrolling, which would be done by the run-time loop unrolling pass if it weren't disabled. Since a target hook is used to control run-time unrolling, AMDGPU can be very specific about determining the unrolling profitability.

In order to regain performance through interleaving, then getNumberOfRegisters() needs to be at least 20, as the best performance is achieved by unrolling 8 iterations. In the past, AMDGPU changed getNumberOfRegisters() to be a smaller value due to performance regressions due to register pressure. The challenge with increasing getNumberOfRegisters() to enable interleaving is that it will introduce other regressions, as more loops will be unrolled and register pressure will increase.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D149281/new/

https://reviews.llvm.org/D149281