[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)

Mon Aug 27 23:42:10 PDT 2018

pcordes accepted this revision.
pcordes added a comment.
This revision is now accepted and ready to land.

Looks like an improvement everywhere; Further refinements are possible but this is already an improvement.

I still like my `pslldq` byte-shift + `psrlq` bit-shift to mix up the port pressure, especially when this is emitted specifically to feed other shifts.  I'd recommend that, unless this commonly gets generated as part of shuffle-heavy code that bottleneck on port 5 on modern Intel.  (It's not rare for people to compile for baseline and run on Haswell+, unfortunately.)

Before Skylake, vector bit-shifts only ran on a single port on Intel, so moving 1 uop to another port could be a big deal if that's the bottleneck in a loop.

shift+shift is better with tune=core2 or earlier, though.  (Slow shuffles on Merom and earlier, but even 2nd-gen Core2 Penryn has fast shuffles.)  I don't think we care enough about first-gen Core2 or Pentium M to tip the balance, though.  It's not disastrous on Core2 (1 extra uop), so I prob. wouldn't even bother keeping the other code-gen option around unless we can select based on port pressure in a whole loop body.

Further refinements would be deciding when to load a mask outside a loop to do it with PAND inside a loop.

Repository:
  rL LLVM

https://reviews.llvm.org/D51263