[PATCH] D74088: [x86] form broadcast of scalar memop even with >1 use

Wed Feb 5 13:48:56 PST 2020

spatel created this revision.
spatel added reviewers: craig.topper, RKSimon, andreadb.
Herald added subscribers: hiraditya, mcrosier.
Herald added a project: LLVM.
spatel marked an inline comment as done.
spatel added inline comments.

================
Comment at: llvm/test/CodeGen/X86/vector-reduce-fadd.ll:1109
 define double @test_v16f64(double %a0, <16 x double> %a1) {
-; SSE-LABEL: test_v16f64:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movapd {{[0-9]+}}(%rsp), %xmm8
-; SSE-NEXT:    addsd %xmm1, %xmm0
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1,1]
-; SSE-NEXT:    addsd %xmm1, %xmm0
-; SSE-NEXT:    addsd %xmm2, %xmm0
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1,1]
-; SSE-NEXT:    addsd %xmm2, %xmm0
-; SSE-NEXT:    addsd %xmm3, %xmm0
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm3 = xmm3[1,1]
-; SSE-NEXT:    addsd %xmm3, %xmm0
-; SSE-NEXT:    addsd %xmm4, %xmm0
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm4 = xmm4[1,1]
-; SSE-NEXT:    addsd %xmm4, %xmm0
-; SSE-NEXT:    addsd %xmm5, %xmm0
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm5 = xmm5[1,1]
-; SSE-NEXT:    addsd %xmm5, %xmm0
-; SSE-NEXT:    addsd %xmm6, %xmm0
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm6 = xmm6[1,1]
-; SSE-NEXT:    addsd %xmm6, %xmm0
-; SSE-NEXT:    addsd %xmm7, %xmm0
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm7 = xmm7[1,1]
-; SSE-NEXT:    addsd %xmm7, %xmm0
-; SSE-NEXT:    addsd %xmm8, %xmm0
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm8 = xmm8[1,1]
-; SSE-NEXT:    addsd %xmm8, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_v16f64:
+; SSE2:       # %bb.0:
----------------
I haven't stepped through this or the next test diff to see why it changed. Anyone know why SSE2/41 would differ here?

The unseen logic diff occurs because MayFoldLoad() is defined like this:

  static bool MayFoldLoad(SDValue Op) {
    return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
  }

The test diffs here all seem ok to me on screen/paper, but it's hard to know if that will lead to universally better perf for all targets. For example, if a target implements broadcast from mem as multiple uops, we would have to weigh the potential reduction of instructions and register pressure vs. possible increase in number of uops. I don't know if we can make a truly informed decision on this at compile-time.

The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the larger example there without at least 1 other fix.

https://reviews.llvm.org/D74088

Files:
  llvm/lib/Target/X86/X86ISelLowering.cpp
  llvm/test/CodeGen/X86/avg.ll
  llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll
  llvm/test/CodeGen/X86/extract-concat.ll
  llvm/test/CodeGen/X86/merge-consecutive-stores-nt.ll
  llvm/test/CodeGen/X86/oddshuffles.ll
  llvm/test/CodeGen/X86/pr34653.ll
  llvm/test/CodeGen/X86/vec-strict-cmp-sub128.ll
  llvm/test/CodeGen/X86/vector-reduce-fadd.ll
  llvm/test/CodeGen/X86/vector-reduce-fmul.ll
  llvm/test/CodeGen/X86/vector-shuffle-combining.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D74088.242739.patch
Type: text/x-patch
Size: 44520 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200205/d1c463fa/attachment-0001.bin>