[LLVMbugs] [Bug 19296] New: Vectorizer adds costs of (expensive) dead operations
bugzilla-daemon at llvm.org
bugzilla-daemon at llvm.org
Mon Mar 31 14:58:04 PDT 2014
http://llvm.org/bugs/show_bug.cgi?id=19296
Bug ID: 19296
Summary: Vectorizer adds costs of (expensive) dead operations
Product: libraries
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P
Component: Loop Optimizer
Assignee: unassignedbugs at nondot.org
Reporter: hfinkel at anl.gov
CC: llvmbugs at cs.uiuc.edu
Classification: Unclassified
Created attachment 12313
--> http://llvm.org/bugs/attachment.cgi?id=12313&action=edit
test case
The loop in the attached file, which is essentially this:
int k = LEN/2;
...
for (int i = 0; i < LEN/2; i++) {
a[i+k] = a[i] + b[i];
}
will vectorize well on PowerPC with Alitvec instructions, but if VSX is
enabled, then it will not. This is a serious problem (VSX is a superset of
Altivec), here's what happens:
opt -mcpu=pwr7 -mattr=-vsx -loop-vectorize:
LV: Found an estimated cost of 0 for VF 1 For instruction: %indvars.iv = phi
i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next, %fo
r.body3 ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0
, i32 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction: %1 = load float*
%arrayidx, align 4, !tbaa !5
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx5 =
getelementptr inbounds %struct.GlobalData* @global_data, i64
0, i32 3, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction: %2 = load float*
%arrayidx5, align 4, !tbaa !5
LV: Found an estimated cost of 2 for VF 1 For instruction: %add = fadd float
%1, %2
LV: Found an estimated cost of 1 for VF 1 For instruction: %3 = add nsw i64
%indvars.iv, 16000
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx8 =
getelementptr inbounds %struct.GlobalData* @global_data, i64
0, i32 0, i64 %3
LV: Found an estimated cost of 1 for VF 1 For instruction: store float %add,
float* %arrayidx8, align 4, !tbaa !5
LV: Found an estimated cost of 1 for VF 1 For instruction: %indvars.iv.next =
add nuw nsw i64 %indvars.iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %exitcond = icmp
eq i64 %indvars.iv.next, 16000
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %exitcond,
label %for.end, label %for.body3
LV: Scalar loop costs: 8.
...
LV: Found an estimated cost of 0 for VF 4 For instruction: %indvars.iv = phi
i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next, %for.body3 ]
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0, i64
%indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction: %1 = load float*
%arrayidx, align 4, !tbaa !5
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx5 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3, i64
%indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction: %2 = load float*
%arrayidx5, align 4, !tbaa !5
LV: Found an estimated cost of 2 for VF 4 For instruction: %add = fadd float
%1, %2
LV: Found an estimated cost of 8 for VF 4 For instruction: %3 = add nsw i64
%indvars.iv, 16000
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx8 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0, i64 %3
LV: Found an estimated cost of 4 for VF 4 For instruction: store float %add,
float* %arrayidx8, align 4, !tbaa !5
LV: Found an estimated cost of 1 for VF 4 For instruction: %indvars.iv.next =
add nuw nsw i64 %indvars.iv, 1
LV: Found an estimated cost of 1 for VF 4 For instruction: %exitcond = icmp
eq i64 %indvars.iv.next, 16000
LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %exitcond,
label %for.end, label %for.body3
LV: Vector loop of width 4 costs: 6.
LV: Selecting VF = : 4.
with VSX: opt -mcpu=pwr7 -mattr=+vsx -loop-vectorize:
[the costs for VF = 1 are the same, but the higher VF look like this:]
...
LV: Found an estimated cost of 0 for VF 4 For instruction: %indvars.iv = phi
i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next, %for.body3 ]
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0, i64
%indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction: %1 = load float*
%arrayidx, align 4, !tbaa !5
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx5 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3, i64
%indvars.iv
LV: Found an estimated cost of 4 for VF 4 For instruction: %2 = load float*
%arrayidx5, align 4, !tbaa !5
LV: Found an estimated cost of 2 for VF 4 For instruction: %add = fadd float
%1, %2
LV: Found an estimated cost of 108 for VF 4 For instruction: %3 = add nsw i64
%indvars.iv, 16000
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx8 =
getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0, i64 %3
LV: Found an estimated cost of 4 for VF 4 For instruction: store float %add,
float* %arrayidx8, align 4, !tbaa !5
LV: Found an estimated cost of 1 for VF 4 For instruction: %indvars.iv.next =
add nuw nsw i64 %indvars.iv, 1
LV: Found an estimated cost of 1 for VF 4 For instruction: %exitcond = icmp
eq i64 %indvars.iv.next, 16000
LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %exitcond,
label %for.end, label %for.body3
LV: Vector loop of width 4 costs: 31.
LV: Selecting VF = : 1.
Note the key difference: without VSX we had:
LV: Found an estimated cost of 8 for VF 4 For instruction: %3 = add nsw i64
%indvars.iv, 16000
with VSX we had:
LV: Found an estimated cost of 108 for VF 4 For instruction: %3 = add nsw i64
%indvars.iv, 16000
Now the tricky part is this: Under VSX v2i64 is a legal type, and add/sub are
not legal (they're expanded), and this expansion is expensive. Thus, the cost
in the 100s is itself not unreasonable. The problem is that this cost should
not even contribute to the loop at all. Unfortunately, this may not be obvious
at the time. The output of the loop vectorizer for VF=4 looks like:
vector.body: ; preds = %vector.body,
%vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
%broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64>
undef, <4 x i32> zeroinitializer
%induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
%1 = extractelement <4 x i64> %induction, i32 0
%2 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %1
%3 = insertelement <4 x float*> undef, float* %2, i32 0
%4 = extractelement <4 x i64> %induction, i32 1
%5 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %4
%6 = insertelement <4 x float*> %3, float* %5, i32 1
%7 = extractelement <4 x i64> %induction, i32 2
%8 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %7
%9 = insertelement <4 x float*> %6, float* %8, i32 2
%10 = extractelement <4 x i64> %induction, i32 3
%11 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %10
%12 = insertelement <4 x float*> %9, float* %11, i32 3
%13 = getelementptr float* %2, i32 0
%14 = bitcast float* %13 to <4 x float>*
%wide.load = load <4 x float>* %14, align 4
%15 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %1
%16 = insertelement <4 x float*> undef, float* %15, i32 0
%17 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %4
%18 = insertelement <4 x float*> %16, float* %17, i32 1
%19 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %7
%20 = insertelement <4 x float*> %18, float* %19, i32 2
%21 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %10
%22 = insertelement <4 x float*> %20, float* %21, i32 3
%23 = getelementptr float* %15, i32 0
%24 = bitcast float* %23 to <4 x float>*
%wide.load1 = load <4 x float>* %24, align 4
%25 = fadd <4 x float> %wide.load, %wide.load1
%26 = add nsw <4 x i64> %induction, <i64 16000, i64 16000, i64 16000, i64
16000>
%27 = extractelement <4 x i64> %26, i32 0
%28 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %27
%29 = insertelement <4 x float*> undef, float* %28, i32 0
%30 = extractelement <4 x i64> %26, i32 1
%31 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %30
%32 = insertelement <4 x float*> %29, float* %31, i32 1
%33 = extractelement <4 x i64> %26, i32 2
%34 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %33
%35 = insertelement <4 x float*> %32, float* %34, i32 2
%36 = extractelement <4 x i64> %26, i32 3
%37 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %36
%38 = insertelement <4 x float*> %35, float* %37, i32 3
%39 = getelementptr float* %28, i32 0
%40 = bitcast float* %39 to <4 x float>*
store <4 x float> %25, <4 x float>* %40, align 4
%41 = add nuw nsw <4 x i64> %induction, <i64 1, i64 1, i64 1, i64 1>
%42 = icmp eq <4 x i64> %41, <i64 16000, i64 16000, i64 16000, i64 16000>
%index.next = add i64 %index, 4
%43 = icmp eq i64 %index.next, 16000
br i1 %43, label %middle.block, label %vector.body, !llvm.loop !5
and, as you can see, there is indeed a vector v4i64 add:
%broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64>
undef, <4 x i32> zeroinitializer
%induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
%1 = extractelement <4 x i64> %induction, i32 0
...
%26 = add nsw <4 x i64> %induction, <i64 16000, i64 16000, i64 16000, i64
16000>
but this is not *really* going to be a vector add... running the output through
instcombine yields:
vector.body: ; preds = %vector.body,
%vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%1 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %index
%2 = bitcast float* %1 to <4 x float>*
%wide.load = load <4 x float>* %2, align 16
%3 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 3,
i64 %index
%4 = bitcast float* %3 to <4 x float>*
%wide.load1 = load <4 x float>* %4, align 16
%5 = fadd <4 x float> %wide.load, %wide.load1
%6 = add i64 %index, 16000
%7 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0,
i64 %6
%8 = bitcast float* %7 to <4 x float>*
store <4 x float> %5, <4 x float>* %8, align 16
%index.next = add i64 %index, 4
%9 = icmp eq i64 %index.next, 16000
br i1 %9, label %middle.block, label %vector.body, !llvm.loop !5
and all of the vector integer operations have gone away, as they should have. I
need these instructions, that may be "vectorized" but only their lane-0 values
are ever used, to contribute their scalar cost only.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20140331/6bdc1d72/attachment.html>
More information about the llvm-bugs
mailing list