[PATCH] [SLPVectorization] Vectorize Reduction Chain feeding into a 'return' statement

Fri Nov 14 04:27:39 PST 2014

Hi all,

I had run performance test suite on X86 for 10 iterations and below is the output.

{F244234}

Please NOTE - BASELINE is WITH THE PROPOSED PATCH and CURRENT is WITHOUT PATCH.
(I tried alternating the baseline and the current so that current contains the proposed patch. Strangely, 
 the report always shows baseline to be with the current patch.) There isn't any regression observed.
Test cases marked in RED are improvements, though they are unrelated to the proposed patch.

This code will be re-usable for future improvements in identifying consecutive memory access in same subtree.
That will come in separate patch, unrelated to the objective of this patch.

I tried checking code generated for smaller than 32-bit type for AArch64.

Test Case :

   #include <arm_neon.h>
     short hadd(short * a) {
       return ((a[0] + a[2]) + (a[1] + a[3]));
     }

IR after O1 (without SLP) :

       define i16 @hadd(i16* nocapture readonly %a) #0 {
        entry:
       %0 = load i16* %a, align 2, !tbaa !1
       %conv13 = zext i16 %0 to i32
       %arrayidx1 = getelementptr inbounds i16* %a, i64 2
       %1 = load i16* %arrayidx1, align 2, !tbaa !1
       %conv214 = zext i16 %1 to i32
       %arrayidx3 = getelementptr inbounds i16* %a, i64 1
       %2 = load i16* %arrayidx3, align 2, !tbaa !1
       %conv415 = zext i16 %2 to i32
       %arrayidx5 = getelementptr inbounds i16* %a, i64 3
       %3 = load i16* %arrayidx5, align 2, !tbaa !1
       %conv616 = zext i16 %3 to i32
       %add7 = add nuw nsw i32 %conv214, %conv13
       %add = add nuw nsw i32 %add7, %conv415
       %add8 = add nuw nsw i32 %add, %conv616
       %conv9 = trunc i32 %add8 to i16
       ret i16 %conv9
     }

Since we are doing extension-truncation operations here, current patch does not vectorize it.

If we remove those extension and truncation

     define i16 @hadd(i16* nocapture readonly %a) #0 {
     entry:
          %0 = load i16* %a, align 2, !tbaa !1
          %arrayidx1 = getelementptr inbounds i16* %a, i64 2
          %1 = load i16* %arrayidx1, align 2, !tbaa !1
          %arrayidx3 = getelementptr inbounds i16* %a, i64 1
          %2 = load i16* %arrayidx3, align 2, !tbaa !1
          %arrayidx5 = getelementptr inbounds i16* %a, i64 3
          %3 = load i16* %arrayidx5, align 2, !tbaa !1
          %add7 = add nuw nsw i16 %0, %1
          %add = add nuw nsw i16 %2, %3
          %add8 = add nuw nsw i16 %add, %add7
          ret i16 %add8 
        }

LLVM vectorizes this with patch above.

Assembly code for 16 bit with extension-truncation after running SLP pass (No vectorization done in this case)

                 ldrh	 w8, [x0]
                 ldrh	w9, [x0, #4]
	         ldrh	w10, [x0, #2]
	         ldrh	w11, [x0, #6]
	         add	 w8, w9, w8
	         add	 w8, w8, w10
	         add	 w0, w8, w11
	         ret

Assembly code for 16 bit without extension-truncation after running SLP pass (vectorization done in this case)

                 ldrh	 w8, [x0]
	         ldrh	w9, [x0, #2]
	         ldrh	w10, [x0, #4]
	         ldrh	w11, [x0, #6]
	         fmov	s0, w8
	         fmov	s1, w10
	         ins	v0.s[1], w9
	         ins	v1.s[1], w11
	         add	v0.2s, v0.2s, v1.2s
	         fmov	w8, s0
	         mov	 w9, v0.s[1]
	         add	 w0, w9, w8 
                ret

Seems bad code for less than 32 bit data type.

However, the current patch doesn't vectorizes less than 32 bit data as it ignores vectorization if truncation/extension encountered.

Please help in reviewing this patch. 

Regards,
Suyog

http://reviews.llvm.org/D6227