<div dir="ltr">This sounds like the right approach and the result is very promising. The VF tuning patch is already in but just not enabled by default, right?<div><br></div><div>David</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Nov 19, 2015 at 1:12 PM, Cong Hou <span dir="ltr"><<a href="mailto:congh@google.com" target="_blank">congh@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">After some attempt to implement reduce-add in LLVM, I found out a<br>
easier way to detect reduce-add without introducing new IR operations.<br>
The basic idea is annotating phi node instead of add (so that it is<br>
easier to handle other reduction operations). In PHINode class, we can<br>
add a flag indicating if the phi node is a reduction one (the flag can<br>
be set in loop vectorizer for vectorized phi nodes). Then when we<br>
build SDNode for instruction selection, we detect those reduction phi<br>
nodes and then annotate reduction operations. This requires an<br>
additional flag in SDNodeFlags. We can then check this flag when<br>
combining instructions to detect reduction operations.<br>
<br>
In this approach, I have managed to let LLVM compile a SAD loop into<br>
psadbw instructions.<br>
<br>
Source code:<br>
<br>
<br>
const int N = 1024;<br>
unsigned char a[N], b[N];<br>
<br>
int sad() {<br>
int s = 0;<br>
for (int i = 0; i < N; ++i) {<br>
int res = a[i] - b[i];<br>
s += (res > 0) ? res : -res;<br>
}<br>
return s;<br>
}<br>
<br>
<br>
Emitted instructions on X86:<br>
<br>
<br>
<br>
# BB#0: # %entry<br>
pxor %xmm0, %xmm0<br>
movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00<br>
pxor %xmm1, %xmm1<br>
.align 16, 0x90<br>
.LBB0_1: # %vector.body<br>
# =>This Inner Loop Header: Depth=1<br>
movd b+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero<br>
movd a+1024(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero<br>
psadbw %xmm2, %xmm3<br>
paddd %xmm3, %xmm0<br>
movd b+1028(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero<br>
movd a+1028(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero<br>
psadbw %xmm2, %xmm3<br>
paddd %xmm3, %xmm1<br>
addq $8, %rax<br>
jne .LBB0_1<br>
# BB#2: # %middle.block<br>
paddd %xmm0, %xmm1<br>
pshufd $78, %xmm1, %xmm0 # xmm0 = xmm1[2,3,0,1]<br>
paddd %xmm1, %xmm0<br>
pshufd $229, %xmm0, %xmm1 # xmm1 = xmm0[1,1,2,3]<br>
paddd %xmm0, %xmm1<br>
movd %xmm1, %eax<br>
retq<br>
<br>
<br>
Note that due to smaller VF we are using now (currently 4), we could<br>
not explore the most benefit of psadbw. The patch in<br>
<a href="http://reviews.llvm.org/D8943" rel="noreferrer" target="_blank">http://reviews.llvm.org/D8943</a> has enables us to use bigger VFs based<br>
on the smallest type in a loop. The follow-up work is refining the<br>
cost model to let bigger VFs have less cost. For the example above the<br>
best result is from VF >=16.<br>
<br>
The draft of the patch is here: <a href="http://reviews.llvm.org/D14840" rel="noreferrer" target="_blank">http://reviews.llvm.org/D14840</a><br>
<br>
I will refine the patch later and submit it for review.<br>
<br>
<br>
thanks,<br>
Cong<br>
<div class="HOEnZb"><div class="h5"><br>
<br>
On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <<a href="mailto:congh@google.com">congh@google.com</a>> wrote:<br>
> On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad<br>
> <<a href="mailto:Asghar-ahmad.Shahid@amd.com">Asghar-ahmad.Shahid@amd.com</a>> wrote:<br>
>> Hi Cong,<br>
>><br>
>>> -----Original Message-----<br>
>>> From: Cong Hou [mailto:<a href="mailto:congh@google.com">congh@google.com</a>]<br>
>>> Sent: Tuesday, November 17, 2015 12:47 AM<br>
>>> To: Shahid, Asghar-ahmad<br>
>>> Cc: David Li<br>
>>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.<br>
>>><br>
>>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad <Asghar-<br>
>>> <a href="mailto:ahmad.Shahid@amd.com">ahmad.Shahid@amd.com</a>> wrote:<br>
>>> > Hi Cong,<br>
>>> ><br>
>>> > We had proposed an intrinsic approach to do this. However the<br>
>>> > discussion reached to a point where it was asked that "Why do we need<br>
>>> > another approach if "reduction add" can be pattern matched in<br>
>>> DAGCombine?"<br>
>>> > However I feel if we have strong enough rationale for introduction of<br>
>>> > this instruction, it would be great. The 1st link below has the complete<br>
>>> discussion about the intrinsic approach.<br>
>>><br>
>>> Yes, I think introducing such a reduction add can let us do pattern recognition<br>
>>> of either SAD or dot production (or more?) without introducing any<br>
>>> additional intrinsics.<br>
>> I agree. Another use case could be POPCOUNT operation. Moreover, as 'reduction add'<br>
>> Is being adopted by more targets now a days, reflecting that in LLVM IR as an instruction<br>
>> Is a good idea.<br>
>> BTW, what is the idea of the syntax and semantic of this operation you have?<br>
><br>
> We can introduce a reduce-add for vectors only, or make it general so<br>
> that it could also accept scale operands. Normally it is identical to<br>
> normal add, but during instruction selection we can do pattern<br>
> recognition based on more information provided by this new operations.<br>
> For vectors, this means the result of this operation only guarantee<br>
> that the sum of all elements in the result is identical to the sum of<br>
> all elements of its operands. This gives us enough freedom to do<br>
> aggressive transformations, such as SAD or dot-product.<br>
><br>
> Do you think if this is enough for us to get there?<br>
><br>
><br>
> Cong<br>
><br>
>><br>
>> The main concern may be cost<br>
>>> model: we could not guarantee that a SAD loop is unrolled 16 times on SSE to<br>
>>> make use the most benefit of SAD. After the patch<br>
>>> <a href="http://reviews.llvm.org/D8943" rel="noreferrer" target="_blank">http://reviews.llvm.org/D8943</a> is landed, I am now working on improving cost<br>
>>> models of type conversions on X86. I believe even without SAD instruction<br>
>>> we can still get better performance with unrolling a SAD loop 16 times. This<br>
>>> seems tricky but it works. What do you think?<br>
>> I also think same as we can increase the bandwidth with proper cost modeling.<br>
>><br>
>> Regards,<br>
>> Shahid<br>
>>><br>
>>> Cong<br>
>>><br>
>>> ><br>
>>> > <a href="http://reviews.llvm.org/D10964" rel="noreferrer" target="_blank">http://reviews.llvm.org/D10964</a><br>
>>> > <a href="http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html" rel="noreferrer" target="_blank">http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html</a><br>
>>> ><br>
>>> > Regards,<br>
>>> > Shahid<br>
>>> ><br>
>>> ><br>
>>> ><br>
>>> >> -----Original Message-----<br>
>>> >> From: llvm-dev [mailto:<a href="mailto:llvm-dev-bounces@lists.llvm.org">llvm-dev-bounces@lists.llvm.org</a>] On Behalf Of<br>
>>> >> Cong Hou via llvm-dev<br>
>>> >> Sent: Friday, November 13, 2015 5:47 AM<br>
>>> >> To: <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
>>> >> Cc: David Li<br>
>>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction add instruction.<br>
>>> >><br>
>>> >> Hi<br>
>>> >><br>
>>> >> When a reduction instruction is vectorized in a loop, it will be<br>
>>> >> turned into an instruction with vector operands of the same operation<br>
>>> >> type. This new instruction has a special property that can give us<br>
>>> >> more flexibility during instruction selection later: this operation<br>
>>> >> is valid as long as the reduction of all elements of the result<br>
>>> >> vector is identical to the reduction of all elements of its operands.<br>
>>> >><br>
>>> >> One example that can benefit this property is SAD (sum of absolute<br>
>>> >> differences) pattern detection in SSE2, which provides a psadbw<br>
>>> >> instruction whose description is shown below:<br>
>>> >><br>
>>> >> '''<br>
>>> >> psadbw: Compute the absolute differences of packed unsigned 8-bit<br>
>>> >> integers in a and b, then horizontally sum each consecutive 8<br>
>>> >> differences to produce two unsigned 16-bit integers, and pack these<br>
>>> >> unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.<br>
>>> >> '''<br>
>>> >><br>
>>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one<br>
>>> >> v4i32 as output. However, psadbw will actually produce one i32 result<br>
>>> >> for four pairs of 8-bit integers (an already reduced result), and the<br>
>>> >> result is stored in the first element in v4i32. If we properly zero<br>
>>> >> out the other three elements in v4i32, and with the information that<br>
>>> >> we have a reduction add that is performed on this result, then we can<br>
>>> >> safely use psadbw here for much better performance. This can be done<br>
>>> >> during DAG combine. Another similar example is dot product. And I<br>
>>> >> think there may be many other scenarios that can benefit from this<br>
>>> >> property like eliminating redundant shuffles.<br>
>>> >><br>
>>> >> The question is, how to let DAG combiner know that a vector operation<br>
>>> >> is a reduction one?<br>
>>> >><br>
>>> >> Here I propose to introduce a "reduction add" instruction for vectors.<br>
>>> >> This will be a new instruction with vector operands only. Normally it<br>
>>> >> is treated as a normal ADD operation, but the selection DAG combiner<br>
>>> >> can make use of this new operation to generate better instructions.<br>
>>> >> This new instruction is generated when vectorizing reduction add in<br>
>>> >> loop vectorizer.<br>
>>> >><br>
>>> >> I would like to hear more comments on this proposal or suggestions of<br>
>>> >> better alternative implementations.<br>
>>> >><br>
>>> >><br>
>>> >> thanks,<br>
>>> >> Cong<br>
>>> >> _______________________________________________<br>
>>> >> LLVM Developers mailing list<br>
>>> >> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
>>> >> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</div></div></blockquote></div><br></div>