[LLVMdev] RFC: generation of PSAD instruction

Thu Feb 5 10:38:30 PST 2015

 [VJ] I tired looking for other patterns which are similar but couldn’t
find. Do let me know if you find any other pattern.

Perhaps matching a dot product (dpps/dppd) is similar enough?

I filed PR21975 for this a few weeks ago:
http://llvm.org/bugs/show_bug.cgi?id=21975

On Mon, Feb 2, 2015 at 10:25 PM, Muthyala, Vj <Vj.Muthyala at amd.com> wrote:

> Hi,
>
> Thank you for your feedback. Fews comments inlined.[VJ]
>
>
> -Vijender
>
> -----Original Message-----
> From: aschwaighofer at apple.com [mailto:aschwaighofer at apple.com]
> Sent: Monday, February 02, 2015 11:25 PM
> To: Muthyala, Vj; Hal Finkel
> Cc: LLVM Dev; James Molloy; Michael Zolotukhin; Nadav Rotem; suyog sarda;
> ahmed.bougacha at gmail.com
> Subject: RE: [LLVMdev] RFC: generation of PSAD instruction
>
> Hi Vijender,
>
> thanks for working on this! Comments below.
>
>
> On 02/02/15, "Muthyala, Vj"   wrote:
> > Hi all,
> >
> > I have done some work on implementing generation of SAD in
> LoopVectorizer. I am sharing two patches which should help to understand my
> idea. There are many places where I am not sure about the right way, which
> I want to put forward for discussion. Let me talk about each patch
> individually.
> >
> > LoopVectorizerSAD.patch
> >
> > This patch contains:
> >  -  Introduction of new ReductionKind "RK_IntegerSad”.
>
>
> Sounds good.
>
> >  -  Identify the sad pattern which is a sequence of SUB, ICMP, NEGATE,
> SELECT and ADD in isSadPattern() function .
>
> I suspect that there will be other patterns than SAD. We should probably
> look for at least one other similar reduction case and then generalize the
> logic such that it handles both cases. What I mean by this is that most of
> the code should not talk about isSADPattern, rather, the specific pattern
> should be hidden behind a generalized function, for example,
> isSpecialPattern().
>
> [VJ] I tired looking for other patterns which are similar but couldn’t
> find. Do let me know if you find any other pattern.
>
> But in principle this is fine.
>
> I think that we should probably only match an SAD pattern if the target
> supports it.
>
> >  -  Introduction of new list into the reduction Descriptor to store the
> SAD pattern instructions. This list is needed for cost computations and
> also needed for removal once the LLVM intrinsic is created.
> This should be a set. We should test this set when we vectorize
> instructions (no need to vectorize and then erase again) and for cost
> computation (no need to count and then subtract again).
> [VJ] Sure I will work on it.
>
> We also need to make sure that instructions in the pattern are not used
> other than by the pattern (I have not looked closely whether you did that).
> [VJ] yes I have already checked that in the code.
>
> >  -  Modification of the getWidestType function to return 8 if SAD
> pattern is present. This is because for SAD in x86, the minimum
> vectorization factor needed is 8. In the present implementation of cost
> model, the highest VF which is tested is WidestRegisterSize/WidestType. If
> WidesrRegisterSize is 128 and if there is any variable in the basic block
> which is i32 which is not actually related to vector data then also max VF
> will come to be 4 (128/32). So if I replace widestType with 8 then max VF
> will come to 16. My thought is that higher vectorization factors like 8 or
> 16 also needs to be given chance for computing its cost which makes sense
> for instructions like SAD. To make this happen we can divide the
> widestRegisterSize with SmallestType. I am not sure about the circumstances
> of this. We can also get target information and decide.
>
>
> In principle fine but this also needs some kind of abstraction:
> getWidestTypeForPattern or something the like. The pattern can then decide.
> For an SAD it will be the type of the inputs to the SAD. You don’t need
> target info for this.
>
> > -  For the cost calculations, the SAD intrinsic is not yet present in
> the basic block. So the relevant PHI node is passed to getInstructionCost()
> function. Inside this function, a dummy SAD intrinsic is passed and cost is
> computed.
> Keying off the PHI node is fine.
>
> > -  Right now the Basic Block created by LoopVectorizer is modified .This
> code works fine for known trip count. If the trip count is not known at
> compile time then we may need to have multiple versions of the loop body.
> I don't understand why the TripCount is important. The only thing you need
> is the vectorization factor VF. The loop vectorizer already generates a
> skeleton that will jump to a scalar copy of the loop if that is not the
> case.
>
> [VJ] - I think I need to explain in detail. You are right that it will
> jump to a scalar copy but the issue here is, if for instance VF comes out
> to be 16 then all instructions will be widened to 16. Now in this situation
> if the TripCount comes to be 15 then the whole code is executed in scalar
> copy. But the right way is to execute a body with VF 8 and then the
> remaining  iterations in scalar body ( or could be body of VF 4 followed by
> scalar body). That is why I am talking about multiple versions of the loop,
> but yes there could be overheads with branching instructions. This problem
> wont arise if the TripCount is known and considered in selecting VF.
>
> >
> > CostModelingSAD.patch
> >
> > This patch includes:
> > - Addition of new function  getSadInstrCost() in the targetTransformInfo.
> > - Addition of LLVM Intrinsic which takes anyvector type as arguments and
> returns I32.
>
> Why only i32?
>
> [VJ] - We want to make this LLVM intrinsic generic. Whatever the size of
> vectors could be and whatever the output of PSAD instruction is, it is
> meant to return only one value as the user is writing a reduction operation
> which returns only one value. If for instance PSAD returns 128 bits then we
> are going to merge this into single value in lowering intrinsic and return.
> BTW PSAD works only for char vectors, so the problem of overflowing also
> will not occur if it is i32.
>
> > - Addition of new SAD DAG node. This is also required for lowering the
> LLVM intrinsic.
> >
> > The work on lowering LLVM intrinsic to X86 is in progress.
> >
> > Please provide me with your inputs.
> >
> > Note: These patches may not work if you directly integrate with the
> code. These are provided only for explanation. Do let me know if you want a
> complete working patch.
> >
> > Thank you,
> > Vijender.
> >
> >
> > -----Original Message-----
> > From: Ahmed Bougacha
> > [mailto:ahmed.bougacha at gmail.com](javascript:main.compose()
> > Sent: Wednesday, January 28, 2015 11:39 PM
> > To: Hal Finkel
> > Cc: Muthyala, Vj; LLVM Dev
> > Subject: Re: [LLVMdev] RFC: generation of PSAD instruction
> >
> > On Wed, Jan 28, 2015 at 7:50 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> >> Hi Vijender,
> >>
> >> Thanks for posting this, there is wide support here for improving our
> support for reductions of various kinds, both in flavor and robustness.
> I've cc'd some others who have previously discussed this.
> >>
> >> James has advocated in the past for an intrinsic for horizontal
> reductions, which I support. We also need better support for vectorization
> cost modeling of combinations of IR-level instructions that have efficient
> target representations. Taken together, I think this addresses the use case
> you're highlighting here (and more).
> >
> > The second part (multi-instruction cost modeling) is on my todo list, to
> vectorize saturation code.  I created a thread recently, and indeed, people
> have mentioned SAD as another potential user.  I need to look into how the
> cost model interacts with the reduction variable identification (it
> doesn't?), and will come up with a patch soon!
> >
> > -Ahmed
> >
> >>
> >> -Hal
> >>
> >> ----- Original Message -----
> >>> From: "Vj Muthyala" <Vj.Muthyala at amd.com>
> >>> To: llvmdev at cs.uiuc.edu
> >>> Sent: Tuesday, January 27, 2015 10:14:44 PM
> >>> Subject: [LLVMdev] RFC: generation of PSAD instruction
> >>>
> >>> Hello,
> >>>
> >>> I was looking at the following test case which is very relevant in
> >>> imaging applications.
> >>>
> >>>
> >>>
> >>> int sad(unsigned char *pix1, unsigned char *pix2)
> >>>
> >>> {
> >>>
> >>> int sum = 0;
> >>>
> >>> for( int x = 0; x < 16; x++ )
> >>>
> >>> {
> >>>
> >>> sum += abs( pix1[x] - pix2[x] );
> >>>
> >>> }
> >>>
> >>> return sum;
> >>>
> >>> }
> >>>
> >>>
> >>>
> >>> The llvm IR generated after all the IR optimizations is
> >>>
> >>>
> >>>
> >>> …..
> >>>
> >>> %9 = bitcast i8* %8 to <4 x i8>*
> >>>
> >>> %wide.load.1 = load <4 x i8>* %9, align 1
> >>>
> >>> %10 = zext <4 x i8> %wide.load.1 to <4 x i32>
> >>>
> >>> %11 = getelementptr inbounds i8* %pix2, i64 4
> >>>
> >>> %12 = bitcast i8* %11 to <4 x i8>*
> >>>
> >>> %wide.load7.1 = load <4 x i8>* %12, align 1
> >>>
> >>> %13 = zext <4 x i8> %wide.load7.1 to <4 x i32>
> >>>
> >>> %14 = sub nsw <4 x i32> %10, %13
> >>>
> >>> %15 = icmp uge <4 x i8> %wide.load.1, %wide.load7.1
> >>>
> >>> %16 = sub nsw <4 x i32> zeroinitializer, %14
> >>>
> >>> %17 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %16
> >>>
> >>> %18 = add nsw <4 x i32> %17, %7
> >>>
> >>> %19 = getelementptr inbounds i8* %pix1, i64 8
> >>>
> >>> …..
> >>>
> >>>
> >>>
> >>> The test case is a perfect example for the generation of Sum of
> >>> Absolute Difference (SAD) instruction which is present in most of
> >>> the targets. The proposed generated IR is : ( directly x86 intrinsic
> >>> is called just to demonstrate the difference )
> >>>
> >>>
> >>>
> >>> …..
> >>>
> >>> %6 = bitcast i8* %5 to x86_mmx*
> >>>
> >>> %wide.load8.1 = load x86_mmx* %6, align 1
> >>>
> >>> %7 = getelementptr inbounds i8* %pix2, i64 8
> >>>
> >>> %8 = bitcast i8* %7 to x86_mmx*
> >>>
> >>> %wide.load79.1 = load x86_mmx* %8, align 1
> >>>
> >>> %9 = call x86_mmx @llvm.x86.mmx.psad.bw(x86_mmx %wide.load8.1,
> >>> x86_mmx %wide.load79.1)
> >>>
> >>> %10 = bitcast x86_mmx %9 to i64
> >>>
> >>> %11 = trunc i64 %10 to i32
> >>>
> >>> %12 = add i32 %11, %4
> >>>
> >>> …..
> >>>
> >>>
> >>>
> >>> This Loop optimization can reduce the loop to only one instruction
> >>> for every 4 or 8 or 16 iterations of the loop ( depending on the
> >>> target sad instruction).
> >>>
> >>>
> >>>
> >>> Proposed Solution
> >>>
> >>>
> >>>
> >>> The sad pattern shown above which is the abs pattern with reduction
> >>> variable ( sequence of sub, icomp, sub, select, add ) can be
> >>> replaced with an intrinsic call.
> >>>
> >>> For a non-unrolled loop we can do this in LoopVectorization where
> >>> all the infrastructure for identifying reduction variables is
> >>> already there.
> >>>
> >>> We just need to identify the sad pattern, remove those instructions
> >>> and call an intrinsic.
> >>>
> >>> This can be done in the loop body which Vectorizer creates
> >>> (vector.body) without disturbing the scalar body of the loop.
> >>>
> >>>
> >>>
> >>> For this to happen in LoopVectorizer, we also need to influence the
> >>> cost modeling of LV. For computing cost for different Vectorization
> >>> Factors, we should remove the cost of sad pattern and add the cost
> >>> of intrinsic.
> >>>
> >>> If selected vectorization factor comes to be equal to any operand
> >>> vector size of basic SAD instruction for that particular target (8
> >>> for x86 target) then we can go ahead and replace with intrinsic.
> >>>
> >>>
> >>>
> >>> This intrinsic call can be a call to llvm intrinsic which can be
> >>> lowered into different target intrinsic calls depending on the VF
> >>> selected.
> >>>
> >>>
> >>>
> >>> For unrolled loops, we can target in SLPVectorizer.
> >>>
> >>>
> >>>
> >>> The work in loopVectorizer using llvm intrinsic call is ready for
> >>> x86 codegen and can send the patch after this discussion.
> >>>
> >>> ( Attached contains complete IR with and without 8 byte SAD for the
> >>> above test case after optimizations )
> >>>
> >>>
> >>>
> >>> Please provide your thoughts.
> >>>
> >>>
> >>>
> >>> Thank you,
> >>>
> >>> Vijender
> >>>
> >>>
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>>
> >>
> >> --
> >> Hal Finkel
> >> Assistant Computational Scientist
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150205/9ba7c853/attachment.html>