<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - memcpy in loop with known-zero source not optimized well"
href="https://bugs.llvm.org/show_bug.cgi?id=32168">32168</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>memcpy in loop with known-zero source not optimized well
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Scalar Optimizations
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>arielb1@mail.tau.ac.il
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>IR that does a memcpy in a loop out of a memset(0) is optimized quite badly due
to a bad interaction between several passes.
For example, this IR:
%Biquad = type { double, double, double, double, double, double, double,
double, double }
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8*
nocapture readonly, i64, i32, i1)
declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i32,
i1)
define void @initme([5 x %Biquad]*) {
entry-block:
%temp = alloca %Biquad, align 8
%temp_i8 = bitcast %Biquad* %temp to i8*
call void @llvm.memset.p0i8.i64(i8* %temp_i8, i8 0, i64 72, i32 8, i1
false)
%p0 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64 0, i64
0
%pN = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64 0, i64
5
br label %slice_loop_body
slice_loop_body:
%p = phi %Biquad* [ %p0, %entry-block ], [ %p_next, %slice_loop_body ]
%p_i8 = bitcast %Biquad* %p to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %p_i8, i8* %temp_i8, i64 72, i32
8, i1 false)
%p_next = getelementptr inbounds %Biquad, %Biquad* %p, i64 1
%cond = icmp eq %Biquad* %p_next, %pN
br i1 %cond, label %exit, label %slice_loop_body
exit:
ret void
}
which (as can be demonstrated using the pass sequence "-loop-unroll
-simplifycfg -instcombine -memcpyopt -instcombine") is equivalent to a memset:
define void @initme([5 x %Biquad]*) {
entry-block:
%1 = bitcast [5 x %Biquad]* %0 to i8*
call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 360, i32 8, i1 false)
ret void
}
However, if you use the standard -O2 optimization sequence, you get this
terrible result:
define void @initme([5 x %Biquad]*) local_unnamed_addr #1 {
entry-block:
%p_next.1 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64
0, i64 2
%p_i8.2 = bitcast %Biquad* %p_next.1 to i8*
%1 = bitcast [5 x %Biquad]* %0 to i8*
call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 144, i32 8, i1 false)
call void @llvm.memset.p0i8.i64(i8* %p_i8.2, i8 0, i64 72, i32 8, i1
false)
%p_next.2 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64
0, i64 3
%p_i8.3 = bitcast %Biquad* %p_next.2 to i8*
call void @llvm.memset.p0i8.i64(i8* %p_i8.3, i8 0, i64 72, i32 8, i1
false)
%p_next.3 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64
0, i64 4
%p_i8.4 = bitcast %Biquad* %p_next.3 to i8*
call void @llvm.memset.p0i8.i64(i8* %p_i8.4, i8 0, i64 72, i32 8, i1
false)
ret void
}
If the array was larger (say, had 100 members), the code generated would have
been proportionally worse.
The reason this happens is:
1) There is no optimization that turns a memcpy initialized in a different
basic block to memset, and in particular LoopIdiomRecognize can't optimize this
out.
2) Loop unrolling happily unrolls this loop, and generates a "chain" of geps:
%p_i8 = bitcast %Biquad* %p0 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %p_i8, i8* %temp_i8, i64 72, i32
8, i1 false)
%p_next = getelementptr inbounds %Biquad, %Biquad* %p0, i64 1
%p_i8.1 = bitcast %Biquad* %p_next to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %p_i8.1, i8* %temp_i8, i64 72,
i32 8, i1 false)
%p_next.1 = getelementptr inbounds %Biquad, %Biquad* %p_next, i64 1
...
3) MemCpyOpt does not handle GEP chains well
(<a href="https://github.com/llvm-mirror/llvm/blob/f33a6990794fc06d1e54c1cbecca0afa0a3d7d7a/lib/Transforms/Scalar/MemCpyOptimizer.cpp#L429">https://github.com/llvm-mirror/llvm/blob/f33a6990794fc06d1e54c1cbecca0afa0a3d7d7a/lib/Transforms/Scalar/MemCpyOptimizer.cpp#L429</a>),
so it only merges the first 2 memcpys. This is normally fine, because
InstCombine (which collapses GEP chains) normally runs before MemCpyOpt, but
here unrolling introduces new chains.
The terrible code generated is causing random slowdowns - e.g.
<a href="https://github.com/rust-lang/rust/issues/40267">https://github.com/rust-lang/rust/issues/40267</a>.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>