[llvm-dev] [RFC] Make LoopVectorize Aware of SLP Operations
Florian Hahn via llvm-dev
llvm-dev at lists.llvm.org
Tue Feb 6 10:25:20 PST 2018
Hello,
We would like to propose making LoopVectorize aware of SLP operations,
to improve the generated code for loops operating on struct fields or
doing complex math.
At the moment, LoopVectorize uses interleaving to vectorize loops that
operate on values loaded/stored from consecutive addresses: vector
loads/stores are generated to combine consecutive loads/stores and then
shufflevector is used to de-interleave/interleave the loaded/stored
values. At the moment however, we fail to detect cases where the same
operations are applied to all consecutive values and there is no need to
interleave. To illustrate this, consider the following example loop:
struct Test {
int x;
int y;
};
void add(struct Test *A, struct Test *B, struct Test *C) {
for (unsigned i = 0; i < 1024; i++) {
C[i].x = A[i].x + B[i].y;
C[i].y = A[i].y + B[i].y;
}
}
On X86, we do not vectorize this loop and on AArch64, we generate the
following code for the vector body:
vector.body:
%index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
%8 = or i64 %index, 4
%9 = getelementptr inbounds %struct.Test, %struct.Test* %A,
i64 %index, i32 0
%10 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 %8,
i32 0
%11 = bitcast i32* %9 to <8 x i32>*
%12 = bitcast i32* %10 to <8 x i32>*
%wide.vec = load <8 x i32>, <8 x i32>* %11, align 4, !tbaa !2
%wide.vec60 = load <8 x i32>, <8 x i32>* %12, align 4, !tbaa !2
%13 = getelementptr inbounds %struct.Test, %struct.Test* %B,
i64 %index, i32 0
%14 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 %8,
i32 0
%15 = bitcast i32* %13 to <8 x i32>*
%16 = bitcast i32* %14 to <8 x i32>*
%wide.vec64 = load <8 x i32>, <8 x i32>* %15, align 4, !tbaa !2
%wide.vec65 = load <8 x i32>, <8 x i32>* %16, align 4, !tbaa !2
%17 = add nsw <8 x i32> %wide.vec64, %wide.vec
%18 = shufflevector <8 x i32> %17, <8 x i32> undef, <4 x i32>
<i32 0, i32 2, i32 4, i32 6>
%19 = add nsw <8 x i32> %wide.vec65, %wide.vec60
%20 = shufflevector <8 x i32> %19, <8 x i32> undef, <4 x i32>
<i32 0, i32 2, i32 4, i32 6>
%21 = add nsw <8 x i32> %wide.vec64, %wide.vec
%22 = shufflevector <8 x i32> %21, <8 x i32> undef, <4 x i32>
<i32 1, i32 3, i32 5, i32 7>
%23 = add nsw <8 x i32> %wide.vec65, %wide.vec60
%24 = shufflevector <8 x i32> %23, <8 x i32> undef, <4 x i32>
<i32 1, i32 3, i32 5, i32 7>
%25 = getelementptr inbounds %struct.Test, %struct.Test* %C,
i64 %index, i32 1
%26 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 %8,
i32 1
%27 = getelementptr i32, i32* %25, i64 -1
%28 = bitcast i32* %27 to <8 x i32>*
%29 = getelementptr i32, i32* %26, i64 -1
%30 = bitcast i32* %29 to <8 x i32>*
%interleaved.vec = shufflevector <4 x i32> %18, <4 x i32> %22,
<8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
store <8 x i32> %interleaved.vec, <8 x i32>* %28, align 4, !tbaa !2
%interleaved.vec70 = shufflevector <4 x i32> %20, <4 x i32> %24,
<8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
store <8 x i32> %interleaved.vec70, <8 x i32>* %30, align 4, !tbaa !2
%index.next = add i64 %index, 8
%31 = icmp eq i64 %index.next, 1024
br i1 %31, label %for.cond.cleanup, label %vector.body, !llvm.loop !6
Note the use of shufflevector to interleave and deinterleave the vector
elements. On AArch64, we emit additional uzp1, uzp2 and st2 instructions
for those.
In the case above however, there is no need to de-interleave the
loaded/stored data and instead we could mix the consecutive operands
together in a single vector register and apply the operations to vectors
with compounded operands. As for real-world use cases, complex number
arithmetic for example could also benefit from not interleaving.
In what follows, I propose an extension to LoopVectorize to make it
aware of SLP opportunities and detect operations on "compound values"
(for the lack of a better term) as an extension to interleaved access
handling. This idea is described in [1].
Structure
=========
To improve vectorization for loops operating on compound values, we
propose to extend LoopVectorize to
1. Detect loops containing SLP opportunities (operations on compound
values)
2. Extend the cost model to choose between interleaving or using
compound values
3. Add support for vectorizing compound operations to VPlan
Please note that the example in the previous section is intentionally
kept simple and one could argue it could be handled by doing loop
rerolling as a canonicalization step before LoopVectorize (currently
LoopReroll does not handle this case, but it could be naturally extended
to do so). However, the main advantage of making LoopVectorize aware of
SLP opportunities is that it will allow us to deal with more complex
cases like
* loops where some vectors need to be transformed, for example where
different operations are performed on different (groups of) lanes,
like A[i].x + B[i].x, A[i].y - B[i].y which could be transformed to
A[i].x + B[i].x, A[i].y + (-B[i].y), or where one compound group
needs to be reordered, like A[i].x + B[i].y, A[i].y + B[i].x
* loops where only parts are applicable to SLP-style vectorization
* loops with complex operations that can be mapped to specialized HW
instructions, like complex multiply and accumulate using FCMLA (Arm
v8.3-a)
We do not necessarily have to do 3. in LoopVectorize. If we detect SLP
opportunities, instead of choosing to interleave, we could decide to
just unroll the loop so we can make optimal use of the vector registers
with compound operations and let the SLP vectorizer generate code. But
this would be problematic for cases where the right unroll factor is not
known at compile time (e.g. with Arm's scalable vectors). Another
advantage of a SLP aware LoopVectorizer is that we would benefit from
loop versioning like with regular vectorization. Also Vplan should make
it relatively straight-forward to add codegen support.
Detect Loops Containing SLP Opportunities
------------------------------------------
This is the key part. LoopVectorize needs to be able to detect SLP
opportunities, in order to decide if interleaving or SLP style
vectorization is better. The basic idea to detect SLP opportunities is
to combine instructions that produce values that can be put in different
lanes of a single vector register to "compound values" and try to build
operation trees using them. Such trees can, for example, be constructed
using a bottom-up process rooted at interleaved stores [1].
As a first step, we propose to add analysis capable to handle only cases
in which the same operation is applied to all lanes of a compound value
and incrementally improve the analysis to handle a mixture of operations
on compound values (e.g. addition on some lanes, subtraction on others).
The SLP vectorizer in LLVM already implements a similar analysis. In the
long term, it would probably make sense to share the analysis between
LoopVectorize and the SLP vectorizer. But initially we think it would be
safer to start with a separate and initially more limited analysis in
LoopVectorize.
Extend to cost model to choose between interleaving or using "compound
operations"
-----------------------------------------------------------------------
For load/store instructions which result in compound values, we choose
SLP-vectorization over interleave/scalarization, if it is beneficial.
The cost of loading/storing an interleave-group for SLP vectorization is
the same as interleaving, but without the costs to rearrange the values.
For regular instructions operating on compound values, we count the cost
of the operation in lane 0 of the resulting compound value considering
the vectorization factor and ignore the instructions in other lanes, as
we can do them for "free". In case we mix regular vectorization and
SLP-style vectorization, we have to also account for additional
instructions to switch between data layouts.
One limitation here is that we commit to either interleaving/compound
vectorization when calculating the cost for the loads. Depending on the
other instructions in the loop, interleaving could be beneficial
overall, even though we could use compound vectorization for some
operations. Initially we could only consider SLP-style vectorization if
it can be used for all instructions.
Add support for SLP style vectorization to Vplan
------------------------------------------------------------------------
Introduce 2 new recipes VPSLPMemoryRecipe and VPSLPInstructionRecipe.
Both generate code to produce a single compound value and we add those
recipes for the instruction in lane 0 of the compound value. We do not
add any recipes for instructions in other lanes.
The SLP vectorizer in LLVM also implements code generation for similar
cases. In the long term it might make sense to share the Vplan based
infrastructure between LoopVectorize and the SLP vectorizer.
Alternative approaches
======================
We do not necessarily have to deal with that case in the vectorizer to
improve the generate code in simple cases. For example, we could add
patterns to instcombine to undo/remove the interleaving done by the
vectorizer for certain common patterns. But this seems quite fragile.
Another possible solution to make LoopReroll aware of operations on
compound values and let it deal with de-interleaving the loop
iterations. This would require no changes to LoopVectorize, but only
allows us to handle simple cases.
Also, as a follow-up, we want to build on the infrastructure described
in this RFC to enable vectorization using new HW instructions that
operate on compund values, like FCADD and FCMLA in Armv8.3-a.
[1] "Loop-Aware SLP in GCC" by Ira Rosen, Dorit Nuzman, Ayal Zaks.
Thanks for bearing with me! I am looking forward to any thoughts &
feedback!
Cheers,
Florian
More information about the llvm-dev
mailing list