<html>
<head>
<base href="http://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - vector truncation generates pretty terrible code without ssse3"
href="http://llvm.org/bugs/show_bug.cgi?id=15524">15524</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>vector truncation generates pretty terrible code without ssse3
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: X86
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>sroland@vmware.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvmbugs@cs.uiuc.edu
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>If pshufb (which requires ssse3) isn't available, "common" vector truncations
generate pretty terrible code, in particular generally doing element
extraction/inserts instead of using shuffles.
E.g. this:
define i64 @trunc(<4 x i32> %inval) {
entry:
%0 = trunc <4 x i32> %inval to <4 x i16>
%1 = bitcast <4 x i16> %0 to i64
ret i64 %1
}
generates
pextrw $4, %xmm0, %ecx
pextrw $6, %xmm0, %eax
movlhps %xmm0, %xmm0 # xmm0 = xmm0[0,0]
pshuflw $8, %xmm0, %xmm0 # xmm0 = xmm0[0,2,0,0,4,5,6,7]
pinsrw $2, %ecx, %xmm0
pinsrw $3, %eax, %xmm0
movd %xmm0, %rax
ret
(and don't ask me what the "movlhps" is even doing there as noone cares about
the upper 64bits). If ssse3 is available, this works ok (single pshufb
instruction).
However, there is really no need at all to go vector->scalar->vector, it can be
trivially done with 3 shuffles with only sse2:
pshuflw $8, %xmm0, %xmm0
pshufhw $8, %xmm0, %xmm0
pshufd $8, %xmm0, %xmm0
movd %xmm0, %rax
Even worse (WAY worse) is the same with 16bit->8bit:
define i64 @trunc(<8 x i16> %inval) {
entry:
%0 = trunc <8 x i16> %inval to <8 x i8>
%1 = bitcast <8 x i8> %0 to i64
ret i64 %1
}
pextrw $3, %xmm0, %ecx
shll $8, %ecx
pextrw $2, %xmm0, %eax
movzbl %al, %eax
orl %ecx, %eax
pextrw $1, %xmm0, %ecx
shll $8, %ecx
movd %xmm0, %edx
movzbl %dl, %edx
orl %ecx, %edx
movdqa %xmm0, %xmm1
pinsrw $0, %edx, %xmm1
pinsrw $1, %eax, %xmm1
pextrw $5, %xmm0, %eax
shll $8, %eax
pextrw $4, %xmm0, %ecx
movzbl %cl, %ecx
orl %eax, %ecx
pinsrw $2, %ecx, %xmm1
pextrw $7, %xmm0, %eax
shll $8, %eax
pextrw $6, %xmm0, %ecx
movzbl %cl, %ecx
orl %eax, %ecx
pinsrw $3, %ecx, %xmm1
movd %xmm1, %rax
ret
While we don't have byte shuffles here it could be emulated with and/shift/or
and then the same shuffle sequence as the 32bit->16bit case above.
However this is still too complicated, and an optimal version would just do
(obviously that's not real code but you get the idea):
pand %xmm0, <8 x 0x00ff>
packuswb %xmm0, %xmm0 (second source can be anything)
movd %xmm0, %rax
(we can't use this trick for 32bit->16bit because we don't have unsigned pack
there without sse41)
That is probably at least an order of magnitude faster...
Granted it's only a problem if there's no ssse3 but fairly recent cpus don't
have that (e.g. amd barcelona).</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>