[PATCH] D59997: [x86] allow movmsk with 2-element reductions

Sanjay Patel via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Mar 29 08:21:57 PDT 2019


spatel created this revision.
spatel added reviewers: andreadb, RKSimon, lebedev.ri, craig.topper.
Herald added subscribers: jdoerfert, hiraditya, mcrosier.
Herald added a project: LLVM.

One motivation for making this change is that the lack of using movmsk is likely the main source of perf difference between clang and gcc on the C-Ray benchmark as shown here:
https://www.phoronix.com/scan.php?page=article&item=gcc-clang-2019&num=5
...but this change alone isn't enough to solve that problem.

The 'all-of' examples show what is likely the worst case trade-off: we end up with an extra instruction (or 2 if we count the 'xor' register clearing). The 'any-of' examples look clearly better using movmsk because we've traded 2 vector instructions for 2 scalar instructions, and movmsk may have better timing than the generic 'movq'.

If we examine the llvm-mca output for these cases, it appears that even though the 'all-of' movmsk variant looks worse on paper, it would perform better on both Haswell and Jaguar.

  $ llvm-mca -mcpu=haswell no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      504
  Total uOps:        400
  
  Dispatch Width:    4
  uOps Per Cycle:    0.79
  IPC:               0.79
  Block RThroughput: 1.0



  $ llvm-mca -mcpu=haswell movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      358
  Total uOps:        600
  
  Dispatch Width:    4
  uOps Per Cycle:    1.68
  IPC:               1.68
  Block RThroughput: 1.5



  $ llvm-mca -mcpu=btver2 no_movmsk.s -timeline
  Iterations:        100
  Instructions:      400
  Total Cycles:      407
  Total uOps:        400
  
  Dispatch Width:    2
  uOps Per Cycle:    0.98
  IPC:               0.98
  Block RThroughput: 2.0



  $ llvm-mca -mcpu=btver2 movmsk.s -timeline
  Iterations:        100
  Instructions:      600
  Total Cycles:      311
  Total uOps:        600
  
  Dispatch Width:    2
  uOps Per Cycle:    1.93
  IPC:               1.93
  Block RThroughput: 3.0

Finally, there may be CPUs where movmsk is horribly slow (old AMD small cores?), but if that's true, then we're also almost certainly making the wrong transform already for reductions with >2 elements, so that should be fixed independently.


https://reviews.llvm.org/D59997

Files:
  llvm/lib/Target/X86/X86ISelLowering.cpp
  llvm/test/CodeGen/X86/vector-compare-all_of.ll
  llvm/test/CodeGen/X86/vector-compare-any_of.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D59997.192824.patch
Type: text/x-patch
Size: 6154 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190329/0c4dbbe9/attachment.bin>


More information about the llvm-commits mailing list