[llvm-bugs] [Bug 39936] New: [X86] [BtVer2] 256-bit integer horizontal add idiom not fully expanded using PHADDD
via llvm-bugs
llvm-bugs at lists.llvm.org
Mon Dec 10 08:46:43 PST 2018
https://bugs.llvm.org/show_bug.cgi?id=39936
Bug ID: 39936
Summary: [X86] [BtVer2] 256-bit integer horizontal add idiom
not fully expanded using PHADDD
Product: libraries
Version: trunk
Hardware: PC
OS: Windows NT
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: andrea.dibiagio at gmail.com
CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
llvm-dev at redking.me.uk, spatel+llvm at rotateright.com
This is a spin off of bug 39921.
The following code performs an integer reduction using operator ADD. The type
is __v8si, so it could be implemented in three steps of horizontal adds.
```
int foo(__v8si A) {
__v8si Lo = __builtin_shufflevector(A, A, 0, 2, 4, 6, -1, -1, -1, -1);
__v8si Hi = __builtin_shufflevector(A, A, 1, 3, 5, 7, -1, -1, -1, -1);
__v8si Step = Lo + Hi;
Lo = __builtin_shufflevector(Step, Step, 0, 2, -1, -1, -1, -1, -1, -1);
Hi = __builtin_shufflevector(Step, Step, 1, 3, -1, -1, -1, -1, -1, -1);
Step = Lo + Hi;
Hi = __builtin_shufflevector(Step, Step, 1, -1, -1, -1, -1, -1, -1, -1);
Step += Hi;
return Step[0];
}
```
Instead, on BtVer2, we currently generate this:
vextractf128 $1, %ymm0, %xmm1
vshufps $136, %xmm1, %xmm0, %xmm2 # xmm2 = xmm0[0,2],xmm1[0,2]
vshufps $221, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[1,3],xmm1[1,3]
vpaddd %xmm0, %xmm2, %xmm0
vphaddd %xmm0, %xmm0, %xmm0
vphaddd %xmm0, %xmm0, %xmm0
vmovd %xmm0, %eax
retq
We could have generate this instead:
vextractf128 $1, %ymm0, %xmm1
vphaddd %xmm0, %xmm1, %xmm0
vphaddd %xmm0, %xmm0, %xmm0
vphaddd %xmm0, %xmm0, %xmm0
vmovd %xmm0, %eax
retq
It looks like that our target specific `combineAnd()` routine is unable to
match the following DAG due to the presence of extract_subvector nodes.
t6: v8i32 = vector_shuffle<0,2,4,6,u,u,u,u> t2, undef:v8i32
t33: v4i32 = extract_subvector t6, Constant:i64<0>
t7: v8i32 = vector_shuffle<1,3,5,7,u,u,u,u> t2, undef:v8i32
t34: v4i32 = extract_subvector t7, Constant:i64<0>
t35: v4i32 = add t33, t34
If we try to hoist the extract_subvector, instead of shrinking the binary
computation (and therefore shrink the shuffle operands), we end up in an
infinite loop of combine. That is because the DAGCombiner would always attempt
to sink a extract_subvector of a binop into the operands of the binop itself.
It is worth mentioning that if we compile the following (equivalent) C++ code,
then we get the optimal HADDD sequence:
```
int foo(__v8si A) {
__v4si Lo = __builtin_shufflevector(A, A, 0, 2, 4, 6);
__v4si Hi = __builtin_shufflevector(A, A, 1, 3, 5, 7);
__v4si Step = Lo + Hi;
Lo = __builtin_shufflevector(Step, Step, 0, 2, -1, -1);
Hi = __builtin_shufflevector(Step, Step, 1, 3, -1, -1);
Step = Lo + Hi;
Hi = __builtin_shufflevector(Step, Step, 1, -1, -1, -1);
Step += Hi;
return Step[0];
}
```
vextractf128 $1, %ymm0, %xmm1
vphaddd %xmm1, %xmm0, %xmm0
vphaddd %xmm0, %xmm0, %xmm0
vphaddd %xmm0, %xmm0, %xmm0
vmovd %xmm0, %eax
retq
Note how the shuffle mask is shrunk in the IR, so we only manipulate 128-bit
values in practice (excluding vector A in input to the function).
Does it mean that we can do something better at IR level, rather than
complicating existing target specific/independent dag combine rules?
Do we have a demanded-elts kind of analysis at IR level? If so, then we may be
able to realize that all those shufflevectors could be shrunk before we even
reach the code generator.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20181210/a584166f/attachment.html>
More information about the llvm-bugs
mailing list