[PATCH] Optimize scattered vector insert/extract pattern
Lawrence Hu
lawrence at codeaurora.org
Fri May 15 16:51:06 PDT 2015
There is a performance issue noticed on AArch64 for out internal testcase after a fix in SROA was brought in by DUI.
commit d4748bbd497b550a4e5db246c6708fcd6de542da
Author: Owen Anderson <resistor at mac.com>
Date: Thu Aug 7 21:07:35 2014 +0000
Fix a case in SROA where lifetime intrinsics could inhibit alloca promotion. In
this case, the code path dealing with vector promotion was missing the explicit
checks for lifetime intrinsics that were present on the corresponding integer
promotion path.
Here are more details on the investigation:
1) What SROA is doing.
SROA will replace large allocas with either integer SSA values or vector SSA values.
The alloca "short pix [32]" is rewritten as 4 vectors of type <8 x i16> (IR code supports vectors) to avoid the load/stores to the stack-allocated variable.
2) Our problem is the inability of the backend to combine scattered loads and stores with the insert and extract instructions to generate scalar/lane-based loads and stores in the presence of extension instructions. Example:
x = ld
y = insert x v1, 1
Generates:
ld1 { v0.b }[1], [x0]
When there no extension/truncation of the loaded values we are fine, the backend generates the optimized code. But that is not the case for the following
x = ld
y = ext x
z = insert y v1, 1
Generates:
ldrb w8, [x0]
ins v0.h[1], w8
However this is better code:
ld1 { v0.b }[1], [x0]
ushll v0.8h, v0.8b, #0
You can clearly see the advantage of the latter code when you have more than one insert instruction:
ldrb w8, [x0]
ldrb w9, [x1]
ins v0.h[1], w8
ins v0.h[5], w9
Better code would be:
ld1 { v0.b }[1], [x0]
ld1 { v0.b }[5], [x1]
ushll v0.8h, v0.8b, #0
The same is true for extract instructions:
umov w8, v0.b[1]
umov w9, v0.b[5]
strh w8, [x0]
strh w9, [x1]
Better code would be:
ushll v0.8h, v0.8b, #0
st1 { v0.h }[1], [x0]
st1 { v0.h }[5], [x1]
Look at the testcase with examples of what is bad code and the IR transformation we need to do to generate the optimized code.
Therefore after SROA we need to detect these patterns in the IR and fix the IR code so the backend can generate the optimized instructions.
This should be done target-independent. It can be done in Instr Combine, or slp vectorizer, we choose the latter.
Even though it is SROA who is generating the insert/extract instructions, I do not think we should fix it there.
REPOSITORY
rL LLVM
http://reviews.llvm.org/D9804
Files:
lib/Transforms/IPO/PassManagerBuilder.cpp
lib/Transforms/Vectorize/SLPVectorizer.cpp
test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D9804.25904.patch
Type: text/x-patch
Size: 32048 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150515/5d3c33e8/attachment.bin>
More information about the llvm-commits
mailing list