[LLVMdev] supporting SAD in loop vectorizer

Hal Finkel hfinkel at anl.gov
Tue Nov 11 07:00:23 PST 2014


----- Original Message -----
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "James Molloy" <james at jamesmolloy.co.uk>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Tuesday, November 11, 2014 8:54:01 AM
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> 
> ----- Original Message -----
> > From: "James Molloy" <james at jamesmolloy.co.uk>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "Dibyendu Das" <Dibyendu.Das at amd.com>, llvmdev at cs.uiuc.edu
> > Sent: Tuesday, November 11, 2014 8:21:37 AM
> > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > 
> > 
> > If you'd like to contribute support for this, look at
> > isHorizontalBinOp and go from there. Feel free to ask questions if
> > you get stuck.
> > 
> > 
> > 
> > FWIW, I've looked at isHorizontalBinOp for inspiration for matching
> > AArch64 ADDV-and-friends (horizontal reduction operations), and
> > thought it was rather temperamental and noticed it being prone to
> > breaking depending on the exact format of the IR. Given that we
> > don't have a canonical form for reductions, I think it wrong that
> > we
> > expect targets to undo quite complex patterns.
> > 
> > 
> > The reduction pattern is a log2(n) sequence of shuffles and binops,
> > that are really rather complex. These sort of things should, IMHO,
> > be intrinsics. I chatted with Arnold about this at the devmtg and
> > was going to send a patch to do exactly that in a week or so.
> 
> Sounds good. We should try hard to canonicalize into the intrinsic in
> InstCombine from the shuffles

Or maybe we should do this in CGP -- would we want to do this if there is no actual target support?

 -Hal

> (in addition to emitting it directly
> from the vectorizer), but it is likely easier to do there than in
> the backend.
> 
>  -Hal
> 
> > 
> > 
> > Cheers,
> > 
> > 
> > James
> > 
> > 
> > On 11 November 2014 13:35, Hal Finkel < hfinkel at anl.gov > wrote:
> > 
> > 
> > ----- Original Message -----
> > > From: "Dibyendu Das" < Dibyendu.Das at amd.com >
> > > To: "Hal Finkel" < hfinkel at anl.gov >, "Renato Golin" <
> > > renato.golin at linaro.org >
> > > Cc: llvmdev at cs.uiuc.edu
> > > Sent: Tuesday, November 4, 2014 12:15:12 PM
> > > Subject: RE: [LLVMdev] supporting SAD in loop vectorizer
> > > 
> > > Here's the simple SAD code:
> > > ---------------------------------------------------
> > > 1 #include <stdlib.h>
> > > 2
> > > 3 extern int ly,lx;
> > > 4 int sad_c( unsigned char *pix1, unsigned char *pix2)
> > > 5 {
> > > 6 int i_sum = 0;
> > > 7 for( int x = 0; x < lx; x++ )
> > > 8 i_sum += abs( pix1[x] - pix2[x] );
> > > 9 return i_sum;
> > > 10 }
> > > 11
> > > -----------------------------------------------------
> > > 
> > > The loop vectorizer does vectorize the loop and then unrolls it
> > > twice. The main body of the loop at the end looks like below
> > > where
> > > we see the icmp, neg select pattern appearing twice.
> > > Are we saying we pattern match this to PSADBW in target ?
> > 
> > Yes.
> > 
> > > That seems
> > > to have some challenges
> > 
> > It does, but we already have code in the backend that matches other
> > horizontal operations (see isHorizontalBinOp and its callers in
> > lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be
> > significantly more complicated.
> > 
> > > including the fact that we would need a
> > > 4-way unroll to use all of 128b PSADBWs. Or am I
> > > missing something ?
> > 
> > No, each unrolling will get its own, so you'll get a PSADBW from
> > each
> > time the loop is unrolled. Each unrolling is vectorized in terms of
> > <4 x i32>, and that is the 128 bits you need.
> > 
> > If you'd like to contribute support for this, look at
> > isHorizontalBinOp and go from there. Feel free to ask questions if
> > you get stuck.
> > 
> > -Hal
> > 
> > 
> > 
> > > 
> > > 2783 vector.body: ; preds =
> > > %vector.body.preheader, %vector.body
> > > 2784 %index = phi i64 [ %index.next, %vector.body ], [ 0,
> > > %vector.body.preheader ]
> > > 2785 %vec.phi = phi <4 x i32> [ %24, %vector.body ], [
> > > zeroinitializer, %vector.body.preheader ]
> > > 2786 %vec.phi9 = phi <4 x i32> [ %25, %vector.body ], [
> > > zeroinitializer, %vector.body.preheader ]
> > > 2787 %4 = getelementptr inbounds i8* %pix1, i64 %index
> > > 2788 %5 = bitcast i8* %4 to <4 x i8>*
> > > 2789 %wide.load = load <4 x i8>* %5, align 1
> > > 2790 %.sum19 = or i64 %index, 4
> > > 2791 %6 = getelementptr i8* %pix1, i64 %.sum19
> > > 2792 %7 = bitcast i8* %6 to <4 x i8>*
> > > 2793 %wide.load10 = load <4 x i8>* %7, align 1
> > > 2794 %8 = zext <4 x i8> %wide.load to <4 x i32>
> > > 2795 %9 = zext <4 x i8> %wide.load10 to <4 x i32>
> > > 2796 %10 = getelementptr inbounds i8* %pix2, i64 %index
> > > 2797 %11 = bitcast i8* %10 to <4 x i8>*
> > > 2798 %wide.load11 = load <4 x i8>* %11, align 1
> > > 2799 %.sum20 = or i64 %index, 4
> > > 2800 %12 = getelementptr i8* %pix2, i64 %.sum20
> > > 2801 %13 = bitcast i8* %12 to <4 x i8>*
> > > 2802 %wide.load12 = load <4 x i8>* %13, align 1
> > > 2803 %14 = zext <4 x i8> %wide.load11 to <4 x i32>
> > > 2804 %15 = zext <4 x i8> %wide.load12 to <4 x i32>
> > > 2805 %16 = sub nsw <4 x i32> %8, %14
> > > 2806 %17 = sub nsw <4 x i32> %9, %15
> > > 2807 %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1, i32 -1, i32
> > > -1>
> > > 2808 %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1, i32 -1, i32
> > > -1>
> > > 2809 %20 = sub <4 x i32> zeroinitializer, %16
> > > 2810 %21 = sub <4 x i32> zeroinitializer, %17
> > > 2811 %22 = select <4 x i1> %18, <4 x i32> %16, <4 x i32> %20
> > > 2812 %23 = select <4 x i1> %19, <4 x i32> %17, <4 x i32> %21
> > > 2813 %24 = add nsw <4 x i32> %22, %vec.phi
> > > 2814 %25 = add nsw <4 x i32> %23, %vec.phi9
> > > 2815 %index.next = add i64 %index, 8
> > > 2816 %26 = icmp eq i64 %index.next, %n.vec
> > > 2817 br i1 %26, label %middle.block.loopexit, label %vector.body,
> > > !llvm.loop !1
> > > -----------------------------------------------------
> > > 
> > > -----Original Message-----
> > > From: Hal Finkel [mailto: hfinkel at anl.gov ]
> > > Sent: Tuesday, November 04, 2014 9:54 PM
> > > To: Renato Golin
> > > Cc: llvmdev at cs.uiuc.edu ; Das, Dibyendu
> > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > > 
> > > ----- Original Message -----
> > > > From: "Renato Golin" < renato.golin at linaro.org >
> > > > To: "Dibyendu Das" < Dibyendu.Das at amd.com >
> > > > Cc: llvmdev at cs.uiuc.edu
> > > > Sent: Tuesday, November 4, 2014 5:23:30 AM
> > > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > > > 
> > > > On 4 November 2014 11:06, Das, Dibyendu < Dibyendu.Das at amd.com
> > > > >
> > > > wrote:
> > > > > Is there any plan to support special idioms in the loop
> > > > > vectorizer
> > > > > like sum of absolute difference (SAD) ? We see some useful
> > > > > cases
> > > > > where llvm is losing performance at -O3 due to SADs not being
> > > > > vectorized (hence PSADBWs not being generated).
> > > > 
> > > > It's been a while, but this could either be that the
> > > > legalisation
> > > > phase is not recognising the reduction or that the cost is not
> > > > taking
> > > > into account the lowered abs().
> > > > 
> > > > What does -debug-only=loop-vectorize say about it?
> > > 
> > > FWIW, I agree, this sounds like a cost-model problem. The
> > > loop-vectorizer should be able to vectorize the 'icmp; neg;
> > > select'
> > > pattern, and then the backend can pattern-patch that with the
> > > reduction (which is a series of shuffles and extract_element)
> > > into
> > > the single instruction PSADBW -- we're quite likely missing the
> > > target code to do that.
> > > 
> > > -Hal
> > > 
> > > > 
> > > > cheers,
> > > > --renato
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > > 
> > > 
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory



More information about the llvm-dev mailing list