[PATCH] D38472: [X86][SSE] Add support for lowering shuffles to PACKSS/PACKUS

Tue Oct 3 17:35:06 PDT 2017

pcordes added a comment.

Some CPUs have good `pblendw` throughput, it's not always a win to do 2 shifts.  (But I guess that's the same problem you mentioned in https://reviews.llvm.org/D38472?id=117394#inline-335636, that the scheduler model isn't close to figuring out when to use a variable-shuffle to reduce port pressure?)

I hope clang isn't going to start compiling `_mm_shuffle_epi8` into `psrlw $8, %xmm0` / `packsswb %xmm0,%xmm0` in cases where that's not a win, when the shuffle control constant lets it do that.

I guess it's a tricky tradeoff between aggressive optimization of intrinsics helping novices (or code tuned for a uarch that isn't the target) vs. defeating deliberate tuning choices.  I think it's good to have at least one compiler (clang) that does aggressively optimize, since we can always use gcc instead or for comparison.

================
Comment at: test/CodeGen/X86/vector-trunc.ll:409
 ; SSE41-NEXT:    psrad $16, %xmm0
 ; SSE41-NEXT:    psrad $16, %xmm1
+; SSE41-NEXT:    packssdw %xmm0, %xmm1
----------------
RKSimon wrote:
> pcordes wrote:
> > If I'm understanding this function right, there's still a big missed optimization:
> > 
> > ```
> > psrad       $16, %xmm0                           # get the words we want aligned with the garbage in xmm1
> > pblendw  $alternating, %xmm1, %xmm0
> > pshufb     (fix the order),  %xmm0
> > ret
> > ```
> > 
> > But this patch isn't trying to fix that.  TODO: report this separately.
> Even better it should just combine to:
> ```
> psrad $16, %xmm0
> psrad $16, %xmm1
> packssdw %xmm1, %xmm0
> ```
> That should be handled when we enable shuffle combining to create PACKSS/PACKUS nodes (and not just lowering).
@RKSimon:

Yeah, that's usually better on Skylake, where immediate vector shifts have 0.5c throughput (running on ports 0 or 1) but pblendw and pshufb compete for port 5.  It would only be worse in a loop with lots of p01 pressure and very low p5 pressure.

On Haswell, pblendw and pshufb still compete for port 5, but shifts compete for port 0.  So depending on the surrounding code, it's worth considering both options to use the one that uses more of the port with lower demand.

On Ryzen, pblendw has 0.33c throughput (ports p013).  pshufb and packss run on p12 (0.5c throughput).
psrad runs on p2 only (1c throughput), so it's a potential throughput bottleneck in a loop that isn't doing other stuff on other ports.  My sequence has twice the throughput, both bottlencked on port 2 for shift uops.

On Sandybridge: pblendw and pshufb/pack: p15.  psrad: p0.  So like Ryzen, we get 2x the throughput from my sequence (if used on its own).

On Nehalem, psrad, packss, pshufb, and pblendw all run on p05.

Repository:
  rL LLVM

https://reviews.llvm.org/D38472