<div dir="ltr">Hal is probably not questioning about the usefulness of reduction recognition and a way to represent it, but the clear semantics of the flag. You can probably draw some ideas from OMP SIMD reduction clause, or intel's SIMD pragma's reduction clause.<div><br></div><div><a href="https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40">https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40</a><br></div><div><br></div><div>David</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Nov 25, 2015 at 3:07 PM, Cong Hou <span dir="ltr"><<a href="mailto:congh@google.com" target="_blank">congh@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Nov 25, 2015 at 2:32 PM, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:<br>

> Hi Cong,<br>

><br>

> After reading the original RFC and this update, I'm still not entirely sure I understand the semantics of the flag you're proposing to add. Does it having something to do with the ordering of the reduction operations?<br>

<br>

</span>The flag is only useful for vectorized reduction for now. I'll give<br>

you an example how this flag is used.<br>

<br>

Given the following reduction loop:<br>

<br>

int a[N];<br>

<span class="">int s = 0;<br>

for (int i = 0; i < N; ++i)<br>

</span>  s += a[i];<br>

<br>

After it is vectorized, we have a reduction operation add whose<br>

operands and the result are vectors. Suppose it is represented as:<br>

<br>

[s0, s1, s2, s3] = [s0, s1, s2, s3] + [a0, a1, a2, a3].<br>

<br>

If we know this operation is a reduction one, then we could have some<br>

flexibility on how elements in the result are organized, as long as<br>

the sum of them stays the same (the precondition is that the only use<br>

of the reduction result is the reduction phi node, and this is usually<br>

true as long as a reduction loop can be vectorized). For example, if<br>

we let the result be [s0+s1, 0, s2+s3, 0] or [0, 0, s0+s1+s2+s3, 0],<br>

the reduction result won't change. This enable us to detect SAD or<br>

dot-product patterns and use SSE's psadbw and pmaddwd instructions.<br>

Please see my respond to your another email for more details.<br>

<br>

Thanks!<br>

<span class="HOEnZb"><font color="#888888"><br>

Cong<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

><br>

> Thanks again,<br>

> Hal<br>

><br>

> ----- Original Message -----<br>

>> From: "Cong Hou via llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

>> To: "llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

>> Cc: "David Li" <<a href="mailto:davidxl@google.com">davidxl@google.com</a>><br>

>> Sent: Thursday, November 19, 2015 3:12:10 PM<br>

>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.<br>

>><br>

>> After some attempt to implement reduce-add in LLVM, I found out a<br>

>> easier way to detect reduce-add without introducing new IR<br>

>> operations.<br>

>> The basic idea is annotating phi node instead of add (so that it is<br>

>> easier to handle other reduction operations). In PHINode class, we<br>

>> can<br>

>> add a flag indicating if the phi node is a reduction one (the flag<br>

>> can<br>

>> be set in loop vectorizer for vectorized phi nodes). Then when we<br>

>> build SDNode for instruction selection, we detect those reduction phi<br>

>> nodes and then annotate reduction operations. This requires an<br>

>> additional flag in SDNodeFlags. We can then check this flag when<br>

>> combining instructions to detect reduction operations.<br>

>><br>

>> In this approach, I have managed to let LLVM compile a SAD loop into<br>

>> psadbw instructions.<br>

>><br>

>> Source code:<br>

>><br>

>><br>

>> const int N = 1024;<br>

>> unsigned char a[N], b[N];<br>

>><br>

>> int sad() {<br>

>>   int s = 0;<br>

>>   for (int i = 0; i < N; ++i) {<br>

>>     int res = a[i] - b[i];<br>

>>     s += (res > 0) ? res : -res;<br>

>>   }<br>

>>   return s;<br>

>> }<br>

>><br>

>><br>

>> Emitted instructions on X86:<br>

>><br>

>><br>

>><br>

>> # BB#0:                                 # %entry<br>

>> pxor %xmm0, %xmm0<br>

>> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00<br>

>> pxor %xmm1, %xmm1<br>

>> .align 16, 0x90<br>

>> .LBB0_1:                                # %vector.body<br>

>>                                         # =>This Inner Loop Header:<br>

>>                                         Depth=1<br>

>> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero<br>

>> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero<br>

>> psadbw %xmm2, %xmm3<br>

>> paddd %xmm3, %xmm0<br>

>> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero<br>

>> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero<br>

>> psadbw %xmm2, %xmm3<br>

>> paddd %xmm3, %xmm1<br>

>> addq $8, %rax<br>

>> jne .LBB0_1<br>

>> # BB#2:                                 # %middle.block<br>

>> paddd %xmm0, %xmm1<br>

>> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]<br>

>> paddd %xmm1, %xmm0<br>

>> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]<br>

>> paddd %xmm0, %xmm1<br>

>> movd %xmm1, %eax<br>

>> retq<br>

>><br>

>><br>

>> Note that due to smaller VF we are using now (currently 4), we could<br>

>> not explore the most benefit of psadbw. The patch in<br>

>> <a href="http://reviews.llvm.org/D8943" rel="noreferrer" target="_blank">http://reviews.llvm.org/D8943</a> has enables us to use bigger VFs based<br>

>> on the smallest type in a loop. The follow-up work is refining the<br>

>> cost model to let bigger VFs have less cost. For the example above<br>

>> the<br>

>> best result is from VF >=16.<br>

>><br>

>> The draft of the patch is here: <a href="http://reviews.llvm.org/D14840" rel="noreferrer" target="_blank">http://reviews.llvm.org/D14840</a><br>

>><br>

>> I will refine the patch later and submit it for review.<br>

>><br>

>><br>

>> thanks,<br>

>> Cong<br>

>><br>

>><br>

>> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <<a href="mailto:congh@google.com">congh@google.com</a>> wrote:<br>

>> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad<br>

>> > <<a href="mailto:Asghar-ahmad.Shahid@amd.com">Asghar-ahmad.Shahid@amd.com</a>> wrote:<br>

>> >> Hi Cong,<br>

>> >><br>

>> >>> -----Original Message-----<br>

>> >>> From: Cong Hou [mailto:<a href="mailto:congh@google.com">congh@google.com</a>]<br>

>> >>> Sent: Tuesday, November 17, 2015 12:47 AM<br>

>> >>> To: Shahid, Asghar-ahmad<br>

>> >>> Cc: David Li<br>

>> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add<br>

>> >>> instruction.<br>

>> >>><br>

>> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad <Asghar-<br>

>> >>> <a href="mailto:ahmad.Shahid@amd.com">ahmad.Shahid@amd.com</a>> wrote:<br>

>> >>> > Hi Cong,<br>

>> >>> ><br>

>> >>> > We had proposed an intrinsic approach to do this. However the<br>

>> >>> > discussion reached to a point where it was asked that "Why do<br>

>> >>> > we need<br>

>> >>> > another approach if "reduction add" can be pattern matched in<br>

>> >>> DAGCombine?"<br>

>> >>> > However I feel if we have strong enough rationale for<br>

>> >>> > introduction of<br>

>> >>> > this instruction, it would be great. The 1st link below has the<br>

>> >>> > complete<br>

>> >>> discussion about the intrinsic approach.<br>

>> >>><br>

>> >>> Yes, I think introducing such a reduction add can let us do<br>

>> >>> pattern recognition<br>

>> >>> of either SAD or dot production (or more?) without introducing<br>

>> >>> any<br>

>> >>> additional intrinsics.<br>

>> >> I agree. Another use case could be POPCOUNT operation. Moreover,<br>

>> >> as 'reduction add'<br>

>> >> Is being adopted by more targets now a days, reflecting that in<br>

>> >> LLVM IR as an instruction<br>

>> >> Is a good idea.<br>

>> >> BTW, what is the idea of the syntax and semantic of this operation<br>

>> >> you have?<br>

>> ><br>

>> > We can introduce a reduce-add for vectors only, or make it general<br>

>> > so<br>

>> > that it could also accept scale operands. Normally it is identical<br>

>> > to<br>

>> > normal add, but during instruction selection we can do pattern<br>

>> > recognition based on more information provided by this new<br>

>> > operations.<br>

>> > For vectors, this means the result of this operation only guarantee<br>

>> > that the sum of all elements in the result is identical to the sum<br>

>> > of<br>

>> > all elements of its operands. This gives us enough freedom to do<br>

>> > aggressive transformations, such as SAD or dot-product.<br>

>> ><br>

>> > Do you think if this is enough for us to get there?<br>

>> ><br>

>> ><br>

>> > Cong<br>

>> ><br>

>> >><br>

>> >> The main concern may be cost<br>

>> >>> model: we could not guarantee that a SAD loop is unrolled 16<br>

>> >>> times on SSE to<br>

>> >>> make use the most benefit of SAD. After the patch<br>

>> >>> <a href="http://reviews.llvm.org/D8943" rel="noreferrer" target="_blank">http://reviews.llvm.org/D8943</a> is landed, I am now working on<br>

>> >>> improving cost<br>

>> >>> models of type conversions on X86. I believe even without SAD<br>

>> >>> instruction<br>

>> >>> we can still get better performance with unrolling a SAD loop 16<br>

>> >>> times. This<br>

>> >>> seems tricky but it works. What do you think?<br>

>> >> I also think same as we can increase the bandwidth with proper<br>

>> >> cost modeling.<br>

>> >><br>

>> >> Regards,<br>

>> >> Shahid<br>

>> >>><br>

>> >>> Cong<br>

>> >>><br>

>> >>> ><br>

>> >>> > <a href="http://reviews.llvm.org/D10964" rel="noreferrer" target="_blank">http://reviews.llvm.org/D10964</a><br>

>> >>> > <a href="http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html" rel="noreferrer" target="_blank">http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html</a><br>

>> >>> ><br>

>> >>> > Regards,<br>

>> >>> > Shahid<br>

>> >>> ><br>

>> >>> ><br>

>> >>> ><br>

>> >>> >> -----Original Message-----<br>

>> >>> >> From: llvm-dev [mailto:<a href="mailto:llvm-dev-bounces@lists.llvm.org">llvm-dev-bounces@lists.llvm.org</a>] On<br>

>> >>> >> Behalf Of<br>

>> >>> >> Cong Hou via llvm-dev<br>

>> >>> >> Sent: Friday, November 13, 2015 5:47 AM<br>

>> >>> >> To: <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

>> >>> >> Cc: David Li<br>

>> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction add<br>

>> >>> >> instruction.<br>

>> >>> >><br>

>> >>> >> Hi<br>

>> >>> >><br>

>> >>> >> When a reduction instruction is vectorized in a loop, it will<br>

>> >>> >> be<br>

>> >>> >> turned into an instruction with vector operands of the same<br>

>> >>> >> operation<br>

>> >>> >> type. This new instruction has a special property that can<br>

>> >>> >> give us<br>

>> >>> >> more flexibility during instruction selection later: this<br>

>> >>> >> operation<br>

>> >>> >> is valid as long as the reduction of all elements of the<br>

>> >>> >> result<br>

>> >>> >> vector is identical to the reduction of all elements of its<br>

>> >>> >> operands.<br>

>> >>> >><br>

>> >>> >> One example that can benefit this property is SAD (sum of<br>

>> >>> >> absolute<br>

>> >>> >> differences) pattern detection in SSE2, which provides a<br>

>> >>> >> psadbw<br>

>> >>> >> instruction whose description is shown below:<br>

>> >>> >><br>

>> >>> >> '''<br>

>> >>> >> psadbw: Compute the absolute differences of packed unsigned<br>

>> >>> >> 8-bit<br>

>> >>> >> integers in a and b, then horizontally sum each consecutive 8<br>

>> >>> >> differences to produce two unsigned 16-bit integers, and pack<br>

>> >>> >> these<br>

>> >>> >> unsigned 16-bit integers in the low 16 bits of 64-bit elements<br>

>> >>> >> in dst.<br>

>> >>> >> '''<br>

>> >>> >><br>

>> >>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs<br>

>> >>> >> and one<br>

>> >>> >> v4i32 as output. However, psadbw will actually produce one i32<br>

>> >>> >> result<br>

>> >>> >> for four pairs of 8-bit integers (an already reduced result),<br>

>> >>> >> and the<br>

>> >>> >> result is stored in the first element in v4i32. If we properly<br>

>> >>> >> zero<br>

>> >>> >> out the other three elements in v4i32, and with the<br>

>> >>> >> information that<br>

>> >>> >> we have a reduction add that is performed on this result, then<br>

>> >>> >> we can<br>

>> >>> >> safely use psadbw here for much better performance. This can<br>

>> >>> >> be done<br>

>> >>> >> during DAG combine. Another similar example is dot product.<br>

>> >>> >> And I<br>

>> >>> >> think there may be many other scenarios that can benefit from<br>

>> >>> >> this<br>

>> >>> >> property like eliminating redundant shuffles.<br>

>> >>> >><br>

>> >>> >> The question is, how to let DAG combiner know that a vector<br>

>> >>> >> operation<br>

>> >>> >> is a reduction one?<br>

>> >>> >><br>

>> >>> >> Here I propose to introduce a "reduction add" instruction for<br>

>> >>> >> vectors.<br>

>> >>> >> This will be a new instruction with vector operands only.<br>

>> >>> >> Normally it<br>

>> >>> >> is treated as a normal ADD operation, but the selection DAG<br>

>> >>> >> combiner<br>

>> >>> >> can make use of this new operation to generate better<br>

>> >>> >> instructions.<br>

>> >>> >> This new instruction is generated when vectorizing reduction<br>

>> >>> >> add in<br>

>> >>> >> loop vectorizer.<br>

>> >>> >><br>

>> >>> >> I would like to hear more comments on this proposal or<br>

>> >>> >> suggestions of<br>

>> >>> >> better alternative implementations.<br>

>> >>> >><br>

>> >>> >><br>

>> >>> >> thanks,<br>

>> >>> >> Cong<br>

>> >>> >> _______________________________________________<br>

>> >>> >> LLVM Developers mailing list<br>

>> >>> >> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

>> >>> >> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

>> _______________________________________________<br>

>> LLVM Developers mailing list<br>

>> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

>><br>

><br>

> --<br>

> Hal Finkel<br>

> Assistant Computational Scientist<br>

> Leadership Computing Facility<br>

> Argonne National Laboratory<br>

</div></div></blockquote></div><br></div>