<div dir="ltr">This sounds like the right approach and the result is very promising. The VF tuning patch is already in but just not enabled by default, right?<div><br></div><div>David</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Nov 19, 2015 at 1:12 PM, Cong Hou <span dir="ltr"><<a href="mailto:congh@google.com" target="_blank">congh@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">After some attempt to implement reduce-add in LLVM, I found out a<br>

easier way to detect reduce-add without introducing new IR operations.<br>

The basic idea is annotating phi node instead of add (so that it is<br>

easier to handle other reduction operations). In PHINode class, we can<br>

add a flag indicating if the phi node is a reduction one (the flag can<br>

be set in loop vectorizer for vectorized phi nodes). Then when we<br>

build SDNode for instruction selection, we detect those reduction phi<br>

nodes and then annotate reduction operations. This requires an<br>

additional flag in SDNodeFlags. We can then check this flag when<br>

combining instructions to detect reduction operations.<br>

<br>

In this approach, I have managed to let LLVM compile a SAD loop into<br>

psadbw instructions.<br>

<br>

Source code:<br>

<br>

<br>

const int N = 1024;<br>

unsigned char a[N], b[N];<br>

<br>

int sad() {<br>

  int s = 0;<br>

  for (int i = 0; i < N; ++i) {<br>

    int res = a[i] - b[i];<br>

    s += (res > 0) ? res : -res;<br>

  }<br>

  return s;<br>

}<br>

<br>

<br>

Emitted instructions on X86:<br>

<br>

<br>

<br>

# BB#0:                                 # %entry<br>

pxor %xmm0, %xmm0<br>

movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00<br>

pxor %xmm1, %xmm1<br>

.align 16, 0x90<br>

.LBB0_1:                                # %vector.body<br>

                                        # =>This Inner Loop Header: Depth=1<br>

movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero<br>

movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero<br>

psadbw %xmm2, %xmm3<br>

paddd %xmm3, %xmm0<br>

movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero<br>

movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero<br>

psadbw %xmm2, %xmm3<br>

paddd %xmm3, %xmm1<br>

addq $8, %rax<br>

jne .LBB0_1<br>

# BB#2:                                 # %middle.block<br>

paddd %xmm0, %xmm1<br>

pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]<br>

paddd %xmm1, %xmm0<br>

pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]<br>

paddd %xmm0, %xmm1<br>

movd %xmm1, %eax<br>

retq<br>

<br>

<br>

Note that due to smaller VF we are using now (currently 4), we could<br>

not explore the most benefit of psadbw. The patch in<br>

<a href="http://reviews.llvm.org/D8943" rel="noreferrer" target="_blank">http://reviews.llvm.org/D8943</a> has enables us to use bigger VFs based<br>

on the smallest type in a loop. The follow-up work is refining the<br>

cost model to let bigger VFs have less cost. For the example above the<br>

best result is from VF >=16.<br>

<br>

The draft of the patch is here: <a href="http://reviews.llvm.org/D14840" rel="noreferrer" target="_blank">http://reviews.llvm.org/D14840</a><br>

<br>

I will refine the patch later and submit it for review.<br>

<br>

<br>

thanks,<br>

Cong<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <<a href="mailto:congh@google.com">congh@google.com</a>> wrote:<br>

> On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad<br>

> <<a href="mailto:Asghar-ahmad.Shahid@amd.com">Asghar-ahmad.Shahid@amd.com</a>> wrote:<br>

>> Hi Cong,<br>

>><br>

>>> -----Original Message-----<br>

>>> From: Cong Hou [mailto:<a href="mailto:congh@google.com">congh@google.com</a>]<br>

>>> Sent: Tuesday, November 17, 2015 12:47 AM<br>

>>> To: Shahid, Asghar-ahmad<br>

>>> Cc: David Li<br>

>>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.<br>

>>><br>

>>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad <Asghar-<br>

>>> <a href="mailto:ahmad.Shahid@amd.com">ahmad.Shahid@amd.com</a>> wrote:<br>

>>> > Hi Cong,<br>

>>> ><br>

>>> > We had proposed an intrinsic approach to do this. However the<br>

>>> > discussion reached to a point where it was asked that "Why do we need<br>

>>> > another approach if "reduction add" can be pattern matched in<br>

>>> DAGCombine?"<br>

>>> > However I feel if we have strong enough rationale for introduction of<br>

>>> > this instruction, it would be great. The 1st link below has the complete<br>

>>> discussion about the intrinsic approach.<br>

>>><br>

>>> Yes, I think introducing such a reduction add can let us do pattern recognition<br>

>>> of either SAD or dot production (or more?) without introducing any<br>

>>> additional intrinsics.<br>

>> I agree. Another use case could be POPCOUNT operation. Moreover, as 'reduction add'<br>

>> Is being adopted by more targets now a days, reflecting that in LLVM IR as an instruction<br>

>> Is a good idea.<br>

>> BTW, what is the idea of the syntax and semantic of this operation you have?<br>

><br>

> We can introduce a reduce-add for vectors only, or make it general so<br>

> that it could also accept scale operands. Normally it is identical to<br>

> normal add, but during instruction selection we can do pattern<br>

> recognition based on more information provided by this new operations.<br>

> For vectors, this means the result of this operation only guarantee<br>

> that the sum of all elements in the result is identical to the sum of<br>

> all elements of its operands. This gives us enough freedom to do<br>

> aggressive transformations, such as SAD or dot-product.<br>

><br>

> Do you think if this is enough for us to get there?<br>

><br>

><br>

> Cong<br>

><br>

>><br>

>> The main concern may be cost<br>

>>> model: we could not guarantee that a SAD loop is unrolled 16 times on SSE to<br>

>>> make use the most benefit of SAD. After the patch<br>

>>> <a href="http://reviews.llvm.org/D8943" rel="noreferrer" target="_blank">http://reviews.llvm.org/D8943</a> is landed, I am now working on improving cost<br>

>>> models of type conversions on X86. I believe even without SAD instruction<br>

>>> we can still get better performance with unrolling a SAD loop 16 times. This<br>

>>> seems tricky but it works. What do you think?<br>

>> I also think same as we can increase the bandwidth with proper cost modeling.<br>

>><br>

>> Regards,<br>

>> Shahid<br>

>>><br>

>>> Cong<br>

>>><br>

>>> ><br>

>>> > <a href="http://reviews.llvm.org/D10964" rel="noreferrer" target="_blank">http://reviews.llvm.org/D10964</a><br>

>>> > <a href="http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html" rel="noreferrer" target="_blank">http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html</a><br>

>>> ><br>

>>> > Regards,<br>

>>> > Shahid<br>

>>> ><br>

>>> ><br>

>>> ><br>

>>> >> -----Original Message-----<br>

>>> >> From: llvm-dev [mailto:<a href="mailto:llvm-dev-bounces@lists.llvm.org">llvm-dev-bounces@lists.llvm.org</a>] On Behalf Of<br>

>>> >> Cong Hou via llvm-dev<br>

>>> >> Sent: Friday, November 13, 2015 5:47 AM<br>

>>> >> To: <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

>>> >> Cc: David Li<br>

>>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction add instruction.<br>

>>> >><br>

>>> >> Hi<br>

>>> >><br>

>>> >> When a reduction instruction is vectorized in a loop, it will be<br>

>>> >> turned into an instruction with vector operands of the same operation<br>

>>> >> type. This new instruction has a special property that can give us<br>

>>> >> more flexibility during instruction selection later: this operation<br>

>>> >> is valid as long as the reduction of all elements of the result<br>

>>> >> vector is identical to the reduction of all elements of its operands.<br>

>>> >><br>

>>> >> One example that can benefit this property is SAD (sum of absolute<br>

>>> >> differences) pattern detection in SSE2, which provides a psadbw<br>

>>> >> instruction whose description is shown below:<br>

>>> >><br>

>>> >> '''<br>

>>> >> psadbw: Compute the absolute differences of packed unsigned 8-bit<br>

>>> >> integers in a and b, then horizontally sum each consecutive 8<br>

>>> >> differences to produce two unsigned 16-bit integers, and pack these<br>

>>> >> unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.<br>

>>> >> '''<br>

>>> >><br>

>>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one<br>

>>> >> v4i32 as output. However, psadbw will actually produce one i32 result<br>

>>> >> for four pairs of 8-bit integers (an already reduced result), and the<br>

>>> >> result is stored in the first element in v4i32. If we properly zero<br>

>>> >> out the other three elements in v4i32, and with the information that<br>

>>> >> we have a reduction add that is performed on this result, then we can<br>

>>> >> safely use psadbw here for much better performance. This can be done<br>

>>> >> during DAG combine. Another similar example is dot product. And I<br>

>>> >> think there may be many other scenarios that can benefit from this<br>

>>> >> property like eliminating redundant shuffles.<br>

>>> >><br>

>>> >> The question is, how to let DAG combiner know that a vector operation<br>

>>> >> is a reduction one?<br>

>>> >><br>

>>> >> Here I propose to introduce a "reduction add" instruction for vectors.<br>

>>> >> This will be a new instruction with vector operands only. Normally it<br>

>>> >> is treated as a normal ADD operation, but the selection DAG combiner<br>

>>> >> can make use of this new operation to generate better instructions.<br>

>>> >> This new instruction is generated when vectorizing reduction add in<br>

>>> >> loop vectorizer.<br>

>>> >><br>

>>> >> I would like to hear more comments on this proposal or suggestions of<br>

>>> >> better alternative implementations.<br>

>>> >><br>

>>> >><br>

>>> >> thanks,<br>

>>> >> Cong<br>

>>> >> _______________________________________________<br>

>>> >> LLVM Developers mailing list<br>

>>> >> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

>>> >> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</div></div></blockquote></div><br></div>