[PATCH] D74088: [x86] form broadcast of scalar memop even with >1 use
Sanjay Patel via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Feb 5 13:48:56 PST 2020
spatel created this revision.
spatel added reviewers: craig.topper, RKSimon, andreadb.
Herald added subscribers: hiraditya, mcrosier.
Herald added a project: LLVM.
spatel marked an inline comment as done.
spatel added inline comments.
================
Comment at: llvm/test/CodeGen/X86/vector-reduce-fadd.ll:1109
define double @test_v16f64(double %a0, <16 x double> %a1) {
-; SSE-LABEL: test_v16f64:
-; SSE: # %bb.0:
-; SSE-NEXT: movapd {{[0-9]+}}(%rsp), %xmm8
-; SSE-NEXT: addsd %xmm1, %xmm0
-; SSE-NEXT: unpckhpd {{.*#+}} xmm1 = xmm1[1,1]
-; SSE-NEXT: addsd %xmm1, %xmm0
-; SSE-NEXT: addsd %xmm2, %xmm0
-; SSE-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1,1]
-; SSE-NEXT: addsd %xmm2, %xmm0
-; SSE-NEXT: addsd %xmm3, %xmm0
-; SSE-NEXT: unpckhpd {{.*#+}} xmm3 = xmm3[1,1]
-; SSE-NEXT: addsd %xmm3, %xmm0
-; SSE-NEXT: addsd %xmm4, %xmm0
-; SSE-NEXT: unpckhpd {{.*#+}} xmm4 = xmm4[1,1]
-; SSE-NEXT: addsd %xmm4, %xmm0
-; SSE-NEXT: addsd %xmm5, %xmm0
-; SSE-NEXT: unpckhpd {{.*#+}} xmm5 = xmm5[1,1]
-; SSE-NEXT: addsd %xmm5, %xmm0
-; SSE-NEXT: addsd %xmm6, %xmm0
-; SSE-NEXT: unpckhpd {{.*#+}} xmm6 = xmm6[1,1]
-; SSE-NEXT: addsd %xmm6, %xmm0
-; SSE-NEXT: addsd %xmm7, %xmm0
-; SSE-NEXT: unpckhpd {{.*#+}} xmm7 = xmm7[1,1]
-; SSE-NEXT: addsd %xmm7, %xmm0
-; SSE-NEXT: addsd %xmm8, %xmm0
-; SSE-NEXT: unpckhpd {{.*#+}} xmm8 = xmm8[1,1]
-; SSE-NEXT: addsd %xmm8, %xmm0
-; SSE-NEXT: retq
+; SSE2-LABEL: test_v16f64:
+; SSE2: # %bb.0:
----------------
I haven't stepped through this or the next test diff to see why it changed. Anyone know why SSE2/41 would differ here?
The unseen logic diff occurs because MayFoldLoad() is defined like this:
static bool MayFoldLoad(SDValue Op) {
return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
}
The test diffs here all seem ok to me on screen/paper, but it's hard to know if that will lead to universally better perf for all targets. For example, if a target implements broadcast from mem as multiple uops, we would have to weigh the potential reduction of instructions and register pressure vs. possible increase in number of uops. I don't know if we can make a truly informed decision on this at compile-time.
The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the larger example there without at least 1 other fix.
https://reviews.llvm.org/D74088
Files:
llvm/lib/Target/X86/X86ISelLowering.cpp
llvm/test/CodeGen/X86/avg.ll
llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll
llvm/test/CodeGen/X86/extract-concat.ll
llvm/test/CodeGen/X86/merge-consecutive-stores-nt.ll
llvm/test/CodeGen/X86/oddshuffles.ll
llvm/test/CodeGen/X86/pr34653.ll
llvm/test/CodeGen/X86/vec-strict-cmp-sub128.ll
llvm/test/CodeGen/X86/vector-reduce-fadd.ll
llvm/test/CodeGen/X86/vector-reduce-fmul.ll
llvm/test/CodeGen/X86/vector-shuffle-combining.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D74088.242739.patch
Type: text/x-patch
Size: 44520 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200205/d1c463fa/attachment-0001.bin>
More information about the llvm-commits
mailing list