[LLVMdev] RFC: generation of PSAD instruction

Mon Feb 2 09:39:25 PST 2015

On Mon, Feb 2, 2015 at 4:48 AM, Muthyala, Vj <Vj.Muthyala at amd.com> wrote:

> Hi all,
>
> I have done some work on implementing generation of SAD in LoopVectorizer.
> I am sharing two patches which should help to understand my idea. There are
> many places where I am not sure about the right way, which I want to put
> forward for discussion. Let me talk about each patch individually.
>
> LoopVectorizerSAD.patch
>

I know this is a work-in-progress, but for formatting, please have a look
at the coding standards and try clang-format.

> This patch contains:
>   -  Introduction of new ReductionKind "RK_IntegerSad".
>   -  Identify the sad pattern which is a sequence of SUB, ICMP, NEGATE,
> SELECT and ADD in isSadPattern() function .
>   -  Introduction of new list into the reduction Descriptor to store the
> SAD pattern instructions. This list is needed for cost computations and
> also needed for removal once the LLVM intrinsic is created.
>   -  Modification of the getWidestType function to return 8 if SAD pattern
> is present. This is because for SAD in x86, the minimum vectorization
> factor needed is 8. In the present implementation of cost model, the
> highest VF which is tested is WidestRegisterSize/WidestType. If
> WidesrRegisterSize is 128 and if there is any variable in the basic block
> which is i32 which is not actually related to vector data then also max VF
> will come to be 4 (128/32). So if I replace widestType with 8 then max VF
> will come to 16. My thought is that higher vectorization factors like 8 or
> 16 also needs to be given chance for computing its cost which makes sense
> for instructions like SAD. To make this happen we can divide the
> widestRegisterSize with SmallestType. I am not sure about the circumstances
> of this. We can also get target information and decide.
>

I haven't had a good look at the rest yet, but this part looks dubious to
me.  This should go in the cost model and TTI: determining whether the
vectorization is profitable is the cost model's job; plus, baking
target-specific information in the vectorizer also isn't desirable.

-Ahmed

 -  For the cost calculations, the SAD intrinsic is not yet present in the
> basic block. So the relevant PHI node is passed to getInstructionCost()
> function. Inside this function, a dummy SAD intrinsic is passed and cost is
> computed.
>  -  Right now the Basic Block created by LoopVectorizer is modified .This
> code works fine for known trip count. If the trip count is not known at
> compile time then we may need to have multiple versions of the loop body.
>
> CostModelingSAD.patch
>
> This patch includes:
>  - Addition of new function  getSadInstrCost() in the targetTransformInfo.
>  - Addition of LLVM Intrinsic which takes anyvector type as arguments and
> returns I32.
>  - Addition of new SAD DAG node. This is also required for lowering the
> LLVM intrinsic.
>
> The work on lowering LLVM intrinsic to X86 is in progress.
>
> Please provide me with your inputs.
>
> Note: These patches may not work if you directly integrate with the code.
> These are provided only for explanation. Do let me know if you want a
> complete working patch.
>
> Thank you,
> Vijender.
>
>
> -----Original Message-----
> From: Ahmed Bougacha [mailto:ahmed.bougacha at gmail.com]
> Sent: Wednesday, January 28, 2015 11:39 PM
> To: Hal Finkel
> Cc: Muthyala, Vj; LLVM Dev
> Subject: Re: [LLVMdev] RFC: generation of PSAD instruction
>
> On Wed, Jan 28, 2015 at 7:50 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> > Hi Vijender,
> >
> > Thanks for posting this, there is wide support here for improving our
> support for reductions of various kinds, both in flavor and robustness.
> I've cc'd some others who have previously discussed this.
> >
> > James has advocated in the past for an intrinsic for horizontal
> reductions, which I support. We also need better support for vectorization
> cost modeling of combinations of IR-level instructions that have efficient
> target representations. Taken together, I think this addresses the use case
> you're highlighting here (and more).
>
> The second part (multi-instruction cost modeling) is on my todo list, to
> vectorize saturation code.  I created a thread recently, and indeed, people
> have mentioned SAD as another potential user.  I need to look into how the
> cost model interacts with the reduction variable identification (it
> doesn't?), and will come up with a patch soon!
>
> -Ahmed
>
> >
> >  -Hal
> >
> > ----- Original Message -----
> >> From: "Vj Muthyala" <Vj.Muthyala at amd.com>
> >> To: llvmdev at cs.uiuc.edu
> >> Sent: Tuesday, January 27, 2015 10:14:44 PM
> >> Subject: [LLVMdev] RFC: generation of PSAD instruction
> >>
> >> Hello,
> >>
> >> I was looking at the following test case which is very relevant in
> >> imaging applications.
> >>
> >>
> >>
> >> int sad(unsigned char *pix1, unsigned char *pix2)
> >>
> >> {
> >>
> >> int sum = 0;
> >>
> >> for( int x = 0; x < 16; x++ )
> >>
> >> {
> >>
> >> sum += abs( pix1[x] - pix2[x] );
> >>
> >> }
> >>
> >> return sum;
> >>
> >> }
> >>
> >>
> >>
> >> The llvm IR generated after all the IR optimizations is
> >>
> >>
> >>
> >> …..
> >>
> >> %9 = bitcast i8* %8 to <4 x i8>*
> >>
> >> %wide.load.1 = load <4 x i8>* %9, align 1
> >>
> >> %10 = zext <4 x i8> %wide.load.1 to <4 x i32>
> >>
> >> %11 = getelementptr inbounds i8* %pix2, i64 4
> >>
> >> %12 = bitcast i8* %11 to <4 x i8>*
> >>
> >> %wide.load7.1 = load <4 x i8>* %12, align 1
> >>
> >> %13 = zext <4 x i8> %wide.load7.1 to <4 x i32>
> >>
> >> %14 = sub nsw <4 x i32> %10, %13
> >>
> >> %15 = icmp uge <4 x i8> %wide.load.1, %wide.load7.1
> >>
> >> %16 = sub nsw <4 x i32> zeroinitializer, %14
> >>
> >> %17 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %16
> >>
> >> %18 = add nsw <4 x i32> %17, %7
> >>
> >> %19 = getelementptr inbounds i8* %pix1, i64 8
> >>
> >> …..
> >>
> >>
> >>
> >> The test case is a perfect example for the generation of Sum of
> >> Absolute Difference (SAD) instruction which is present in most of the
> >> targets. The proposed generated IR is : ( directly x86 intrinsic is
> >> called just to demonstrate the difference )
> >>
> >>
> >>
> >> …..
> >>
> >> %6 = bitcast i8* %5 to x86_mmx*
> >>
> >> %wide.load8.1 = load x86_mmx* %6, align 1
> >>
> >> %7 = getelementptr inbounds i8* %pix2, i64 8
> >>
> >> %8 = bitcast i8* %7 to x86_mmx*
> >>
> >> %wide.load79.1 = load x86_mmx* %8, align 1
> >>
> >> %9 = call x86_mmx @llvm.x86.mmx.psad.bw(x86_mmx %wide.load8.1,
> >> x86_mmx %wide.load79.1)
> >>
> >> %10 = bitcast x86_mmx %9 to i64
> >>
> >> %11 = trunc i64 %10 to i32
> >>
> >> %12 = add i32 %11, %4
> >>
> >> …..
> >>
> >>
> >>
> >> This Loop optimization can reduce the loop to only one instruction
> >> for every 4 or 8 or 16 iterations of the loop ( depending on the
> >> target sad instruction).
> >>
> >>
> >>
> >> Proposed Solution
> >>
> >>
> >>
> >> The sad pattern shown above which is the abs pattern with reduction
> >> variable ( sequence of sub, icomp, sub, select, add ) can be replaced
> >> with an intrinsic call.
> >>
> >> For a non-unrolled loop we can do this in LoopVectorization where all
> >> the infrastructure for identifying reduction variables is already
> >> there.
> >>
> >> We just need to identify the sad pattern, remove those instructions
> >> and call an intrinsic.
> >>
> >> This can be done in the loop body which Vectorizer creates
> >> (vector.body) without disturbing the scalar body of the loop.
> >>
> >>
> >>
> >> For this to happen in LoopVectorizer, we also need to influence the
> >> cost modeling of LV. For computing cost for different Vectorization
> >> Factors, we should remove the cost of sad pattern and add the cost of
> >> intrinsic.
> >>
> >> If selected vectorization factor comes to be equal to any operand
> >> vector size of basic SAD instruction for that particular target (8
> >> for x86 target) then we can go ahead and replace with intrinsic.
> >>
> >>
> >>
> >> This intrinsic call can be a call to llvm intrinsic which can be
> >> lowered into different target intrinsic calls depending on the VF
> >> selected.
> >>
> >>
> >>
> >> For unrolled loops, we can target in SLPVectorizer.
> >>
> >>
> >>
> >> The work in loopVectorizer using llvm intrinsic call is ready for x86
> >> codegen and can send the patch after this discussion.
> >>
> >> ( Attached contains complete IR with and without 8 byte SAD for the
> >> above test case after optimizations )
> >>
> >>
> >>
> >> Please provide your thoughts.
> >>
> >>
> >>
> >> Thank you,
> >>
> >> Vijender
> >>
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150202/85481286/attachment.html>