<div dir="ltr">Hal is probably not questioning about the usefulness of reduction recognition and a way to represent it, but the clear semantics of the flag. You can probably draw some ideas from OMP SIMD reduction clause, or intel's SIMD pragma's reduction clause.<div><br></div><div><a href="https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40">https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40</a><br></div><div><br></div><div>David</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Nov 25, 2015 at 3:07 PM, Cong Hou <span dir="ltr"><<a href="mailto:congh@google.com" target="_blank">congh@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Nov 25, 2015 at 2:32 PM, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:<br>
> Hi Cong,<br>
><br>
> After reading the original RFC and this update, I'm still not entirely sure I understand the semantics of the flag you're proposing to add. Does it having something to do with the ordering of the reduction operations?<br>
<br>
</span>The flag is only useful for vectorized reduction for now. I'll give<br>
you an example how this flag is used.<br>
<br>
Given the following reduction loop:<br>
<br>
int a[N];<br>
<span class="">int s = 0;<br>
for (int i = 0; i < N; ++i)<br>
</span> s += a[i];<br>
<br>
After it is vectorized, we have a reduction operation add whose<br>
operands and the result are vectors. Suppose it is represented as:<br>
<br>
[s0, s1, s2, s3] = [s0, s1, s2, s3] + [a0, a1, a2, a3].<br>
<br>
If we know this operation is a reduction one, then we could have some<br>
flexibility on how elements in the result are organized, as long as<br>
the sum of them stays the same (the precondition is that the only use<br>
of the reduction result is the reduction phi node, and this is usually<br>
true as long as a reduction loop can be vectorized). For example, if<br>
we let the result be [s0+s1, 0, s2+s3, 0] or [0, 0, s0+s1+s2+s3, 0],<br>
the reduction result won't change. This enable us to detect SAD or<br>
dot-product patterns and use SSE's psadbw and pmaddwd instructions.<br>
Please see my respond to your another email for more details.<br>
<br>
Thanks!<br>
<span class="HOEnZb"><font color="#888888"><br>
Cong<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
><br>
> Thanks again,<br>
> Hal<br>
><br>
> ----- Original Message -----<br>
>> From: "Cong Hou via llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>
>> To: "llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>
>> Cc: "David Li" <<a href="mailto:davidxl@google.com">davidxl@google.com</a>><br>
>> Sent: Thursday, November 19, 2015 3:12:10 PM<br>
>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.<br>
>><br>
>> After some attempt to implement reduce-add in LLVM, I found out a<br>
>> easier way to detect reduce-add without introducing new IR<br>
>> operations.<br>
>> The basic idea is annotating phi node instead of add (so that it is<br>
>> easier to handle other reduction operations). In PHINode class, we<br>
>> can<br>
>> add a flag indicating if the phi node is a reduction one (the flag<br>
>> can<br>
>> be set in loop vectorizer for vectorized phi nodes). Then when we<br>
>> build SDNode for instruction selection, we detect those reduction phi<br>
>> nodes and then annotate reduction operations. This requires an<br>
>> additional flag in SDNodeFlags. We can then check this flag when<br>
>> combining instructions to detect reduction operations.<br>
>><br>
>> In this approach, I have managed to let LLVM compile a SAD loop into<br>
>> psadbw instructions.<br>
>><br>
>> Source code:<br>
>><br>
>><br>
>> const int N = 1024;<br>
>> unsigned char a[N], b[N];<br>
>><br>
>> int sad() {<br>
>> int s = 0;<br>
>> for (int i = 0; i < N; ++i) {<br>
>> int res = a[i] - b[i];<br>
>> s += (res > 0) ? res : -res;<br>
>> }<br>
>> return s;<br>
>> }<br>
>><br>
>><br>
>> Emitted instructions on X86:<br>
>><br>
>><br>
>><br>
>> # BB#0: # %entry<br>
>> pxor %xmm0, %xmm0<br>
>> movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00<br>
>> pxor %xmm1, %xmm1<br>
>> .align 16, 0x90<br>
>> .LBB0_1: # %vector.body<br>
>> # =>This Inner Loop Header:<br>
>> Depth=1<br>
>> movd b+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero<br>
>> movd a+1024(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero<br>
>> psadbw %xmm2, %xmm3<br>
>> paddd %xmm3, %xmm0<br>
>> movd b+1028(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero<br>
>> movd a+1028(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero<br>
>> psadbw %xmm2, %xmm3<br>
>> paddd %xmm3, %xmm1<br>
>> addq $8, %rax<br>
>> jne .LBB0_1<br>
>> # BB#2: # %middle.block<br>
>> paddd %xmm0, %xmm1<br>
>> pshufd $78, %xmm1, %xmm0 # xmm0 = xmm1[2,3,0,1]<br>
>> paddd %xmm1, %xmm0<br>
>> pshufd $229, %xmm0, %xmm1 # xmm1 = xmm0[1,1,2,3]<br>
>> paddd %xmm0, %xmm1<br>
>> movd %xmm1, %eax<br>
>> retq<br>
>><br>
>><br>
>> Note that due to smaller VF we are using now (currently 4), we could<br>
>> not explore the most benefit of psadbw. The patch in<br>
>> <a href="http://reviews.llvm.org/D8943" rel="noreferrer" target="_blank">http://reviews.llvm.org/D8943</a> has enables us to use bigger VFs based<br>
>> on the smallest type in a loop. The follow-up work is refining the<br>
>> cost model to let bigger VFs have less cost. For the example above<br>
>> the<br>
>> best result is from VF >=16.<br>
>><br>
>> The draft of the patch is here: <a href="http://reviews.llvm.org/D14840" rel="noreferrer" target="_blank">http://reviews.llvm.org/D14840</a><br>
>><br>
>> I will refine the patch later and submit it for review.<br>
>><br>
>><br>
>> thanks,<br>
>> Cong<br>
>><br>
>><br>
>> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <<a href="mailto:congh@google.com">congh@google.com</a>> wrote:<br>
>> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad<br>
>> > <<a href="mailto:Asghar-ahmad.Shahid@amd.com">Asghar-ahmad.Shahid@amd.com</a>> wrote:<br>
>> >> Hi Cong,<br>
>> >><br>
>> >>> -----Original Message-----<br>
>> >>> From: Cong Hou [mailto:<a href="mailto:congh@google.com">congh@google.com</a>]<br>
>> >>> Sent: Tuesday, November 17, 2015 12:47 AM<br>
>> >>> To: Shahid, Asghar-ahmad<br>
>> >>> Cc: David Li<br>
>> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add<br>
>> >>> instruction.<br>
>> >>><br>
>> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad <Asghar-<br>
>> >>> <a href="mailto:ahmad.Shahid@amd.com">ahmad.Shahid@amd.com</a>> wrote:<br>
>> >>> > Hi Cong,<br>
>> >>> ><br>
>> >>> > We had proposed an intrinsic approach to do this. However the<br>
>> >>> > discussion reached to a point where it was asked that "Why do<br>
>> >>> > we need<br>
>> >>> > another approach if "reduction add" can be pattern matched in<br>
>> >>> DAGCombine?"<br>
>> >>> > However I feel if we have strong enough rationale for<br>
>> >>> > introduction of<br>
>> >>> > this instruction, it would be great. The 1st link below has the<br>
>> >>> > complete<br>
>> >>> discussion about the intrinsic approach.<br>
>> >>><br>
>> >>> Yes, I think introducing such a reduction add can let us do<br>
>> >>> pattern recognition<br>
>> >>> of either SAD or dot production (or more?) without introducing<br>
>> >>> any<br>
>> >>> additional intrinsics.<br>
>> >> I agree. Another use case could be POPCOUNT operation. Moreover,<br>
>> >> as 'reduction add'<br>
>> >> Is being adopted by more targets now a days, reflecting that in<br>
>> >> LLVM IR as an instruction<br>
>> >> Is a good idea.<br>
>> >> BTW, what is the idea of the syntax and semantic of this operation<br>
>> >> you have?<br>
>> ><br>
>> > We can introduce a reduce-add for vectors only, or make it general<br>
>> > so<br>
>> > that it could also accept scale operands. Normally it is identical<br>
>> > to<br>
>> > normal add, but during instruction selection we can do pattern<br>
>> > recognition based on more information provided by this new<br>
>> > operations.<br>
>> > For vectors, this means the result of this operation only guarantee<br>
>> > that the sum of all elements in the result is identical to the sum<br>
>> > of<br>
>> > all elements of its operands. This gives us enough freedom to do<br>
>> > aggressive transformations, such as SAD or dot-product.<br>
>> ><br>
>> > Do you think if this is enough for us to get there?<br>
>> ><br>
>> ><br>
>> > Cong<br>
>> ><br>
>> >><br>
>> >> The main concern may be cost<br>
>> >>> model: we could not guarantee that a SAD loop is unrolled 16<br>
>> >>> times on SSE to<br>
>> >>> make use the most benefit of SAD. After the patch<br>
>> >>> <a href="http://reviews.llvm.org/D8943" rel="noreferrer" target="_blank">http://reviews.llvm.org/D8943</a> is landed, I am now working on<br>
>> >>> improving cost<br>
>> >>> models of type conversions on X86. I believe even without SAD<br>
>> >>> instruction<br>
>> >>> we can still get better performance with unrolling a SAD loop 16<br>
>> >>> times. This<br>
>> >>> seems tricky but it works. What do you think?<br>
>> >> I also think same as we can increase the bandwidth with proper<br>
>> >> cost modeling.<br>
>> >><br>
>> >> Regards,<br>
>> >> Shahid<br>
>> >>><br>
>> >>> Cong<br>
>> >>><br>
>> >>> ><br>
>> >>> > <a href="http://reviews.llvm.org/D10964" rel="noreferrer" target="_blank">http://reviews.llvm.org/D10964</a><br>
>> >>> > <a href="http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html" rel="noreferrer" target="_blank">http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html</a><br>
>> >>> ><br>
>> >>> > Regards,<br>
>> >>> > Shahid<br>
>> >>> ><br>
>> >>> ><br>
>> >>> ><br>
>> >>> >> -----Original Message-----<br>
>> >>> >> From: llvm-dev [mailto:<a href="mailto:llvm-dev-bounces@lists.llvm.org">llvm-dev-bounces@lists.llvm.org</a>] On<br>
>> >>> >> Behalf Of<br>
>> >>> >> Cong Hou via llvm-dev<br>
>> >>> >> Sent: Friday, November 13, 2015 5:47 AM<br>
>> >>> >> To: <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
>> >>> >> Cc: David Li<br>
>> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction add<br>
>> >>> >> instruction.<br>
>> >>> >><br>
>> >>> >> Hi<br>
>> >>> >><br>
>> >>> >> When a reduction instruction is vectorized in a loop, it will<br>
>> >>> >> be<br>
>> >>> >> turned into an instruction with vector operands of the same<br>
>> >>> >> operation<br>
>> >>> >> type. This new instruction has a special property that can<br>
>> >>> >> give us<br>
>> >>> >> more flexibility during instruction selection later: this<br>
>> >>> >> operation<br>
>> >>> >> is valid as long as the reduction of all elements of the<br>
>> >>> >> result<br>
>> >>> >> vector is identical to the reduction of all elements of its<br>
>> >>> >> operands.<br>
>> >>> >><br>
>> >>> >> One example that can benefit this property is SAD (sum of<br>
>> >>> >> absolute<br>
>> >>> >> differences) pattern detection in SSE2, which provides a<br>
>> >>> >> psadbw<br>
>> >>> >> instruction whose description is shown below:<br>
>> >>> >><br>
>> >>> >> '''<br>
>> >>> >> psadbw: Compute the absolute differences of packed unsigned<br>
>> >>> >> 8-bit<br>
>> >>> >> integers in a and b, then horizontally sum each consecutive 8<br>
>> >>> >> differences to produce two unsigned 16-bit integers, and pack<br>
>> >>> >> these<br>
>> >>> >> unsigned 16-bit integers in the low 16 bits of 64-bit elements<br>
>> >>> >> in dst.<br>
>> >>> >> '''<br>
>> >>> >><br>
>> >>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs<br>
>> >>> >> and one<br>
>> >>> >> v4i32 as output. However, psadbw will actually produce one i32<br>
>> >>> >> result<br>
>> >>> >> for four pairs of 8-bit integers (an already reduced result),<br>
>> >>> >> and the<br>
>> >>> >> result is stored in the first element in v4i32. If we properly<br>
>> >>> >> zero<br>
>> >>> >> out the other three elements in v4i32, and with the<br>
>> >>> >> information that<br>
>> >>> >> we have a reduction add that is performed on this result, then<br>
>> >>> >> we can<br>
>> >>> >> safely use psadbw here for much better performance. This can<br>
>> >>> >> be done<br>
>> >>> >> during DAG combine. Another similar example is dot product.<br>
>> >>> >> And I<br>
>> >>> >> think there may be many other scenarios that can benefit from<br>
>> >>> >> this<br>
>> >>> >> property like eliminating redundant shuffles.<br>
>> >>> >><br>
>> >>> >> The question is, how to let DAG combiner know that a vector<br>
>> >>> >> operation<br>
>> >>> >> is a reduction one?<br>
>> >>> >><br>
>> >>> >> Here I propose to introduce a "reduction add" instruction for<br>
>> >>> >> vectors.<br>
>> >>> >> This will be a new instruction with vector operands only.<br>
>> >>> >> Normally it<br>
>> >>> >> is treated as a normal ADD operation, but the selection DAG<br>
>> >>> >> combiner<br>
>> >>> >> can make use of this new operation to generate better<br>
>> >>> >> instructions.<br>
>> >>> >> This new instruction is generated when vectorizing reduction<br>
>> >>> >> add in<br>
>> >>> >> loop vectorizer.<br>
>> >>> >><br>
>> >>> >> I would like to hear more comments on this proposal or<br>
>> >>> >> suggestions of<br>
>> >>> >> better alternative implementations.<br>
>> >>> >><br>
>> >>> >><br>
>> >>> >> thanks,<br>
>> >>> >> Cong<br>
>> >>> >> _______________________________________________<br>
>> >>> >> LLVM Developers mailing list<br>
>> >>> >> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
>> >>> >> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
>> _______________________________________________<br>
>> LLVM Developers mailing list<br>
>> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
>><br>
><br>
> --<br>
> Hal Finkel<br>
> Assistant Computational Scientist<br>
> Leadership Computing Facility<br>
> Argonne National Laboratory<br>
</div></div></blockquote></div><br></div>