[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Aug 27 23:42:10 PDT 2018
pcordes accepted this revision.
pcordes added a comment.
This revision is now accepted and ready to land.
Looks like an improvement everywhere; Further refinements are possible but this is already an improvement.
I still like my `pslldq` byte-shift + `psrlq` bit-shift to mix up the port pressure, especially when this is emitted specifically to feed other shifts. I'd recommend that, unless this commonly gets generated as part of shuffle-heavy code that bottleneck on port 5 on modern Intel. (It's not rare for people to compile for baseline and run on Haswell+, unfortunately.)
Before Skylake, vector bit-shifts only ran on a single port on Intel, so moving 1 uop to another port could be a big deal if that's the bottleneck in a loop.
shift+shift is better with tune=core2 or earlier, though. (Slow shuffles on Merom and earlier, but even 2nd-gen Core2 Penryn has fast shuffles.) I don't think we care enough about first-gen Core2 or Pentium M to tip the balance, though. It's not disastrous on Core2 (1 extra uop), so I prob. wouldn't even bother keeping the other code-gen option around unless we can select based on port pressure in a whole loop body.
Further refinements would be deciding when to load a mask outside a loop to do it with PAND inside a loop.
Repository:
rL LLVM
https://reviews.llvm.org/D51263
More information about the llvm-commits
mailing list