<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - [NPM] Slower arm_mult_q15 code from failing to simplify min/max pattern"
href="https://bugs.llvm.org/show_bug.cgi?id=48734">48734</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>[NPM] Slower arm_mult_q15 code from failing to simplify min/max pattern
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Windows NT
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Interprocedural Optimizations
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>david.green@arm.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>This code does a saturating multiply of 16bit fixed point values.
<a href="https://godbolt.org/z/9Eo1rz">https://godbolt.org/z/9Eo1rz</a>
It is roughly ~55% slower under the new pass manager (larger with q7 data
types). It appears that the code pattern under the old pass manager contains a
min/max pattern, and is nicely vectorized:
%11 = load i16, i16* %pSrcA.addr.010, align 2, !tbaa !3
%conv = sext i16 %11 to i32
%12 = load i16, i16* %pSrcB.addr.08, align 2, !tbaa !3
%conv2 = sext i16 %12 to i32
%mul = mul nsw i32 %conv2, %conv
%shr = ashr i32 %mul, 15
%13 = icmp slt i32 %shr, 32767
%spec.select.i = select i1 %13, i32 %shr, i32 32767
%conv3 = trunc i32 %spec.select.i to i16
store i16 %conv3, i16* %pDst.addr.09, align 2, !tbaa !3
Instead in the new has more expensive compare/select/trunc combo:
%11 = load i16, i16* %pSrcA.addr.010, align 2, !tbaa !3
%conv = sext i16 %11 to i32
%12 = load i16, i16* %pSrcB.addr.08, align 2, !tbaa !3
%conv2 = sext i16 %12 to i32
%mul = mul nsw i32 %conv2, %conv
%13 = lshr i32 %mul, 15
%cmp4.i = icmp sgt i32 %mul, 1073741823
%14 = trunc i32 %13 to i16
%conv3 = select i1 %cmp4.i, i16 32767, i16 %14
store i16 %conv3, i16* %pDst.addr.09, align 2, !tbaa !3
It appears that the function is differently optimized before it gets inlined?
It might also be possibly be fixed with a canonicalization fold:
<a href="https://alive2.llvm.org/ce/z/CwJcsD">https://alive2.llvm.org/ce/z/CwJcsD</a>
We seem to have a lot of regressions in other suites which may be more
difficult to reproduce for upstream, due to the nature of the benchmarks. We
will see what we can do.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>