<html>
<head>
<base href="https://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - IndVarSimplify + InstCombine integer widening does not play nice with loop vectorizer"
href="https://llvm.org/bugs/show_bug.cgi?id=28070">28070</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>IndVarSimplify + InstCombine integer widening does not play nice with loop vectorizer
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>All
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Scalar Optimizations
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>mkuper@google.com
</td>
</tr>
<tr>
<th>CC</th>
<td>davidxl@google.com, llvm-bugs@lists.llvm.org, wmi@google.com
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>Consider a reduction loop, which reduces a multiplication of i32s into an i64:
long long foo() {
long long x = 42;
#pragma nounroll
#pragma clang loop interleave_count(1)
for (int i = 0; i < 1000; i++) {
x += i * i;
}
return x;
}
For:
$ clang -c -S -o - -O3 -mavx2 --target=x86_64
We'd like to get:
.LBB0_1:
vpmulld %xmm2, %xmm2, %xmm3
vpaddd %xmm1, %xmm2, %xmm2
vpmovzxdq %xmm3, %ymm3
vpaddq %ymm0, %ymm3, %ymm0
addq $-4, %rax
jne .LBB0_1
What we actually get is:
.LBB0_1:
vpsrlq $32, %ymm3, %ymm4
vpmuludq %ymm4, %ymm3, %ymm4
vpmuludq %ymm3, %ymm3, %ymm5
vpaddq %ymm1, %ymm3, %ymm3
vpsllq $32, %ymm4, %ymm4
vpaddq %ymm4, %ymm5, %ymm5
vpaddq %ymm4, %ymm5, %ymm4
vpblendd $170, %ymm2, %ymm4, %ymm4
vpaddq %ymm0, %ymm4, %ymm0
addq $-4, %rax
jne .LBB0_1
What happens is that IndVarSimplify promotes the induction variable from i32 to
i64:
for.body:
%i.09 = phi i32 [ 0, %entry ], [ %inc, %for.body ] <==
%x.08 = phi i64 [ 42, %entry ], [ %add, %for.body ]
%mul = mul nsw i32 %i.09, %i.09
%conv7 = zext i32 %mul to i64
%add = add nsw i64 %conv7, %x.08
%inc = add nsw i32 %i.09, 1 <==
%cmp = icmp slt i32 %inc, 1000
br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !1
}
becomes
for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ] <==
%x.08 = phi i64 [ 42, %entry ], [ %add, %for.body ]
%0 = trunc i64 %indvars.iv to i32 <==
%mul = mul nsw i32 %0, %0
%conv7 = zext i32 %mul to i64
%add = add nsw i64 %conv7, %x.08
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 <==
%exitcond = icmp ne i64 %indvars.iv.next, 1000
br i1 %exitcond, label %for.body, label %for.cond.cleanup, !llvm.loop !1
And then InstCombine notices the trunc -> mul -> zext, and promotes the whole
thing to i64:
for.body: ; preds = %for.body, %entry
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
%x.08 = phi i64 [ 42, %entry ], [ %add, %for.body ]
%mul = mul i64 %indvars.iv, %indvars.iv <==
%conv7 = and i64 %mul, 4294967295 <==
%add = add nsw i64 %conv7, %x.08
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 1000
br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !1
This is unfortunate, because we end up with illegal vector muls, and really
messy codegen. Note that the issue is not AVX2-specific, that's just the
cleanest example. We can get similar nonsense with other feature-sets.
Undoing this in codegen (by matching the "mul + and" back into a "trunc + mul +
zext") doesn't seem sufficient, since ideally we'd also like the vectorizer to
know what the real width is going to be. What we really want is to vectorize
this reduction by a factor of 8, like GCC does, and not by 4, and that would
require the cost model to know that we're reducing i32 values into an i64
result.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>