[LLVMbugs] [Bug 19296] New: Vectorizer adds costs of (expensive) dead operations

Mon Mar 31 14:58:04 PDT 2014

http://llvm.org/bugs/show_bug.cgi?id=19296

            Bug ID: 19296
           Summary: Vectorizer adds costs of (expensive) dead operations
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedbugs at nondot.org
          Reporter: hfinkel at anl.gov
                CC: llvmbugs at cs.uiuc.edu
    Classification: Unclassified

Created attachment 12313
  --> http://llvm.org/bugs/attachment.cgi?id=12313&action=edit
test case

The loop in the attached file, which is essentially this:
                int k = LEN/2;
                ...
                for (int i = 0; i < LEN/2; i++) {
                        a[i+k] = a[i] + b[i];
                }

will vectorize well on PowerPC with Alitvec instructions, but if VSX is
enabled, then it will not. This is a serious problem (VSX is a superset of
Altivec), here's what happens:

opt -mcpu=pwr7 -mattr=-vsx -loop-vectorize:

LV: Found an estimated cost of 0 for VF 1 For instruction:   %indvars.iv = phi
i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next, %fo
r.body3 ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %arrayidx =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0
, i32 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction:   %1 = load float*
%arrayidx, align 4, !tbaa !5
LV: Found an estimated cost of 0 for VF 1 For instruction:   %arrayidx5 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 
0, i32 3, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction:   %2 = load float*
%arrayidx5, align 4, !tbaa !5
LV: Found an estimated cost of 2 for VF 1 For instruction:   %add = fadd float
%1, %2
LV: Found an estimated cost of 1 for VF 1 For instruction:   %3 = add nsw i64
%indvars.iv, 16000
LV: Found an estimated cost of 0 for VF 1 For instruction:   %arrayidx8 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 
0, i32 0, i64 %3
LV: Found an estimated cost of 1 for VF 1 For instruction:   store float %add,
float* %arrayidx8, align 4, !tbaa !5
LV: Found an estimated cost of 1 for VF 1 For instruction:   %indvars.iv.next =
add nuw nsw i64 %indvars.iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %exitcond = icmp
eq i64 %indvars.iv.next, 16000
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %exitcond,
label %for.end, label %for.body3
LV: Scalar loop costs: 8.

...

LV: Found an estimated cost of 0 for VF 4 For instruction:   %indvars.iv = phi
i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next, %for.body3 ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0, i64
%indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction:   %1 = load float*
%arrayidx, align 4, !tbaa !5
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx5 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3, i64
%indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction:   %2 = load float*
%arrayidx5, align 4, !tbaa !5
LV: Found an estimated cost of 2 for VF 4 For instruction:   %add = fadd float
%1, %2
LV: Found an estimated cost of 8 for VF 4 For instruction:   %3 = add nsw i64
%indvars.iv, 16000
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx8 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0, i64 %3
LV: Found an estimated cost of 4 for VF 4 For instruction:   store float %add,
float* %arrayidx8, align 4, !tbaa !5
LV: Found an estimated cost of 1 for VF 4 For instruction:   %indvars.iv.next =
add nuw nsw i64 %indvars.iv, 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %exitcond = icmp
eq i64 %indvars.iv.next, 16000
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %exitcond,
label %for.end, label %for.body3
LV: Vector loop of width 4 costs: 6.
LV: Selecting VF = : 4.

with VSX: opt -mcpu=pwr7 -mattr=+vsx -loop-vectorize:

[the costs for VF = 1 are the same, but the higher VF look like this:]
...
LV: Found an estimated cost of 0 for VF 4 For instruction:   %indvars.iv = phi
i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next, %for.body3 ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0, i64
%indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction:   %1 = load float*
%arrayidx, align 4, !tbaa !5
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx5 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3, i64
%indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction:   %2 = load float*
%arrayidx5, align 4, !tbaa !5
LV: Found an estimated cost of 2 for VF 4 For instruction:   %add = fadd float
%1, %2
LV: Found an estimated cost of 108 for VF 4 For instruction:   %3 = add nsw i64
%indvars.iv, 16000
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx8 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0, i64 %3
LV: Found an estimated cost of 4 for VF 4 For instruction:   store float %add,
float* %arrayidx8, align 4, !tbaa !5
LV: Found an estimated cost of 1 for VF 4 For instruction:   %indvars.iv.next =
add nuw nsw i64 %indvars.iv, 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %exitcond = icmp
eq i64 %indvars.iv.next, 16000
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %exitcond,
label %for.end, label %for.body3
LV: Vector loop of width 4 costs: 31.
LV: Selecting VF = : 1.

Note the key difference: without VSX we had:

LV: Found an estimated cost of 8 for VF 4 For instruction:   %3 = add nsw i64
%indvars.iv, 16000

with VSX we had:

LV: Found an estimated cost of 108 for VF 4 For instruction:   %3 = add nsw i64
%indvars.iv, 16000

Now the tricky part is this: Under VSX v2i64 is a legal type, and add/sub are
not legal (they're expanded), and this expansion is expensive. Thus, the cost
in the 100s is itself not unreasonable. The problem is that this cost should
not even contribute to the loop at all. Unfortunately, this may not be obvious
at the time. The output of the loop vectorizer for VF=4 looks like:

vector.body:                                      ; preds = %vector.body,
%vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
  %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64>
undef, <4 x i32> zeroinitializer
  %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
  %1 = extractelement <4 x i64> %induction, i32 0
  %2 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %1
  %3 = insertelement <4 x float*> undef, float* %2, i32 0
  %4 = extractelement <4 x i64> %induction, i32 1
  %5 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %4
  %6 = insertelement <4 x float*> %3, float* %5, i32 1
  %7 = extractelement <4 x i64> %induction, i32 2
  %8 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %7
  %9 = insertelement <4 x float*> %6, float* %8, i32 2
  %10 = extractelement <4 x i64> %induction, i32 3
  %11 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %10
  %12 = insertelement <4 x float*> %9, float* %11, i32 3
  %13 = getelementptr float* %2, i32 0
  %14 = bitcast float* %13 to <4 x float>*
  %wide.load = load <4 x float>* %14, align 4
  %15 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %1
  %16 = insertelement <4 x float*> undef, float* %15, i32 0
  %17 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %4
  %18 = insertelement <4 x float*> %16, float* %17, i32 1
  %19 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %7
  %20 = insertelement <4 x float*> %18, float* %19, i32 2
  %21 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %10
  %22 = insertelement <4 x float*> %20, float* %21, i32 3
  %23 = getelementptr float* %15, i32 0
  %24 = bitcast float* %23 to <4 x float>*
  %wide.load1 = load <4 x float>* %24, align 4
  %25 = fadd <4 x float> %wide.load, %wide.load1
  %26 = add nsw <4 x i64> %induction, <i64 16000, i64 16000, i64 16000, i64
16000>
  %27 = extractelement <4 x i64> %26, i32 0
  %28 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %27
  %29 = insertelement <4 x float*> undef, float* %28, i32 0
  %30 = extractelement <4 x i64> %26, i32 1
  %31 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %30
  %32 = insertelement <4 x float*> %29, float* %31, i32 1
  %33 = extractelement <4 x i64> %26, i32 2
  %34 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %33
  %35 = insertelement <4 x float*> %32, float* %34, i32 2
  %36 = extractelement <4 x i64> %26, i32 3
  %37 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %36
  %38 = insertelement <4 x float*> %35, float* %37, i32 3
  %39 = getelementptr float* %28, i32 0
  %40 = bitcast float* %39 to <4 x float>*
  store <4 x float> %25, <4 x float>* %40, align 4
  %41 = add nuw nsw <4 x i64> %induction, <i64 1, i64 1, i64 1, i64 1>
  %42 = icmp eq <4 x i64> %41, <i64 16000, i64 16000, i64 16000, i64 16000>
  %index.next = add i64 %index, 4
  %43 = icmp eq i64 %index.next, 16000
  br i1 %43, label %middle.block, label %vector.body, !llvm.loop !5

and, as you can see, there is indeed a vector v4i64 add:

  %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64>
undef, <4 x i32> zeroinitializer
  %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
  %1 = extractelement <4 x i64> %induction, i32 0
...
  %26 = add nsw <4 x i64> %induction, <i64 16000, i64 16000, i64 16000, i64
16000>

but this is not *really* going to be a vector add... running the output through
instcombine yields:

vector.body:                                      ; preds = %vector.body,
%vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %1 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %index
  %2 = bitcast float* %1 to <4 x float>*
  %wide.load = load <4 x float>* %2, align 16
  %3 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %index
  %4 = bitcast float* %3 to <4 x float>*
  %wide.load1 = load <4 x float>* %4, align 16
  %5 = fadd <4 x float> %wide.load, %wide.load1
  %6 = add i64 %index, 16000
  %7 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %6
  %8 = bitcast float* %7 to <4 x float>*
  store <4 x float> %5, <4 x float>* %8, align 16
  %index.next = add i64 %index, 4
  %9 = icmp eq i64 %index.next, 16000
  br i1 %9, label %middle.block, label %vector.body, !llvm.loop !5

and all of the vector integer operations have gone away, as they should have. I
need these instructions, that may be "vectorized" but only their lane-0 values
are ever used, to contribute their scalar cost only.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20140331/6bdc1d72/attachment.html>