[PATCH] D48193: [LoopVectorizer] Use an interleave count of 1 when using a vector library call
Robert Lougher via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Jun 14 13:14:50 PDT 2018
rob.lougher created this revision.
rob.lougher added reviewers: mkuper, hfinkel, mssimpso.
Herald added a subscriber: dmgreen.
Given the following test program:
#include <math.h>
void test(float *a, float *b, int n) {
for(int i = 0; i < n; i++)
b[i] = sinf(a[i]);
}
If we tell the compiler we have a vector-library available and compile it as follows:
$ clang -O2 --target=x86_64-unknown-linux -march=btver2 -mllvm -vector-library=SVML -S test.c
The loop will be vectorized with a vectorization factor of 8, and the call to sinf will be widened to a vector library call (__svml_sinf8):
.LBB0_6: # %vector.body
# =>This Inner Loop Header: Depth=1
vmovups (%r12,%r13,4), %ymm0
vmovups 32(%r12,%r13,4), %ymm1
vmovups 64(%r12,%r13,4), %ymm3
vmovups 96(%r12,%r13,4), %ymm2
vmovups %ymm1, (%rsp) # 32-byte Spill
vmovups %ymm3, 32(%rsp) # 32-byte Spill
vmovups %ymm2, 96(%rsp) # 32-byte Spill
callq __svml_sinf8
vmovups %ymm0, 64(%rsp) # 32-byte Spill
vmovups (%rsp), %ymm0 # 32-byte Reload
callq __svml_sinf8
vmovups %ymm0, (%rsp) # 32-byte Spill
vmovups 32(%rsp), %ymm0 # 32-byte Reload
callq __svml_sinf8
vmovups %ymm0, 32(%rsp) # 32-byte Spill
vmovups 96(%rsp), %ymm0 # 32-byte Reload
callq __svml_sinf8
vmovups 64(%rsp), %ymm1 # 32-byte Reload
vmovups (%rsp), %ymm3 # 32-byte Reload
vmovups 32(%rsp), %ymm2 # 32-byte Reload
vmovups %ymm1, (%r14,%r13,4)
vmovups %ymm3, 32(%r14,%r13,4)
vmovups %ymm2, 64(%r14,%r13,4)
vmovups %ymm0, 96(%r14,%r13,4)
addq $32, %r13
cmpq %r13, %rbx
jne .LBB0_6
However, as can be seen the code generated is poor, containing a large number of spills and reloads. The reason for this is the loop vectorizer has chosen an interleave count (aka unroll factor) of 4.
In general, the heuristics tries to create parallel instances of the loop to expose ILP without causing spilling. It bases this on the number of registers used in the loop and the number of registers available. However, due to the way instructions are interleaved, the vector call causes the registers for the other instances to be spilled (thus defeating the heuristics).
This patch changes the heuristics to use an interleave count of 1 when a call will be vectorized to a library call. The test above now generates:
.LBB0_6: # %vector.body
# =>This Inner Loop Header: Depth=1
vmovups (%r12,%r13,4), %ymm0
callq __svml_sinf8
vmovups %ymm0, (%r14,%r13,4)
addq $8, %r13
cmpq %r13, %rbx
jne .LBB0_6
Repository:
rL LLVM
https://reviews.llvm.org/D48193
Files:
include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
lib/Transforms/Vectorize/LoopVectorize.cpp
test/Transforms/LoopVectorize/X86/interleaving-veclib-call.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D48193.151397.patch
Type: text/x-patch
Size: 10909 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180614/beb5a78c/attachment-0001.bin>
More information about the llvm-commits
mailing list