[llvm-bugs] [Bug 28070] New: IndVarSimplify + InstCombine integer widening does not play nice with loop vectorizer
via llvm-bugs
llvm-bugs at lists.llvm.org
Thu Jun 9 18:35:05 PDT 2016
https://llvm.org/bugs/show_bug.cgi?id=28070
Bug ID: 28070
Summary: IndVarSimplify + InstCombine integer widening does not
play nice with loop vectorizer
Product: libraries
Version: trunk
Hardware: PC
OS: All
Status: NEW
Severity: normal
Priority: P
Component: Scalar Optimizations
Assignee: unassignedbugs at nondot.org
Reporter: mkuper at google.com
CC: davidxl at google.com, llvm-bugs at lists.llvm.org,
wmi at google.com
Classification: Unclassified
Consider a reduction loop, which reduces a multiplication of i32s into an i64:
long long foo() {
long long x = 42;
#pragma nounroll
#pragma clang loop interleave_count(1)
for (int i = 0; i < 1000; i++) {
x += i * i;
}
return x;
}
For:
$ clang -c -S -o - -O3 -mavx2 --target=x86_64
We'd like to get:
.LBB0_1:
vpmulld %xmm2, %xmm2, %xmm3
vpaddd %xmm1, %xmm2, %xmm2
vpmovzxdq %xmm3, %ymm3
vpaddq %ymm0, %ymm3, %ymm0
addq $-4, %rax
jne .LBB0_1
What we actually get is:
.LBB0_1:
vpsrlq $32, %ymm3, %ymm4
vpmuludq %ymm4, %ymm3, %ymm4
vpmuludq %ymm3, %ymm3, %ymm5
vpaddq %ymm1, %ymm3, %ymm3
vpsllq $32, %ymm4, %ymm4
vpaddq %ymm4, %ymm5, %ymm5
vpaddq %ymm4, %ymm5, %ymm4
vpblendd $170, %ymm2, %ymm4, %ymm4
vpaddq %ymm0, %ymm4, %ymm0
addq $-4, %rax
jne .LBB0_1
What happens is that IndVarSimplify promotes the induction variable from i32 to
i64:
for.body:
%i.09 = phi i32 [ 0, %entry ], [ %inc, %for.body ] <==
%x.08 = phi i64 [ 42, %entry ], [ %add, %for.body ]
%mul = mul nsw i32 %i.09, %i.09
%conv7 = zext i32 %mul to i64
%add = add nsw i64 %conv7, %x.08
%inc = add nsw i32 %i.09, 1 <==
%cmp = icmp slt i32 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !1
}
becomes
for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ] <==
%x.08 = phi i64 [ 42, %entry ], [ %add, %for.body ]
%0 = trunc i64 %indvars.iv to i32 <==
%mul = mul nsw i32 %0, %0
%conv7 = zext i32 %mul to i64
%add = add nsw i64 %conv7, %x.08
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 <==
%exitcond = icmp ne i64 %indvars.iv.next, 1000
br i1 %exitcond, label %for.body, label %for.cond.cleanup, !llvm.loop !1
And then InstCombine notices the trunc -> mul -> zext, and promotes the whole
thing to i64:
for.body: ; preds = %for.body, %entry
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
%x.08 = phi i64 [ 42, %entry ], [ %add, %for.body ]
%mul = mul i64 %indvars.iv, %indvars.iv <==
%conv7 = and i64 %mul, 4294967295 <==
%add = add nsw i64 %conv7, %x.08
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 1000
br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !1
This is unfortunate, because we end up with illegal vector muls, and really
messy codegen. Note that the issue is not AVX2-specific, that's just the
cleanest example. We can get similar nonsense with other feature-sets.
Undoing this in codegen (by matching the "mul + and" back into a "trunc + mul +
zext") doesn't seem sufficient, since ideally we'd also like the vectorizer to
know what the real width is going to be. What we really want is to vectorize
this reduction by a factor of 8, like GCC does, and not by 4, and that would
require the cost model to know that we're reducing i32 values into an i64
result.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160610/cb776d41/attachment-0001.html>
More information about the llvm-bugs
mailing list