[PATCH] D14901: [X86][SSE] Improve i16 splatting shuffles

Mon Dec 21 13:07:54 PST 2015

RKSimon added a comment.

Sorry Quentin - I missed your follow up email to the list - copied here:

> > Tested on Jaguar CPU:

> 

> > 

> 

> > Throughput: 

> 

> >  Old 3op shuffle: 4cy

> 

> >  New 2op shuffle: 2cy

> 

> >  pshufb_rr        3cy

> 

> >  pshufb_rm        3cy

> 

> 

> I am confused.

>  When the code sequence is shorter, I was expecting this, but this number is not for the problem we were discussing, i.e., when the shufb is replaced by 2 shuf(w|hd, whatever), right?

>  If it is, I am missing something because it should be 2 uops in both cases.

I'm confused too - I'm not certain what outstanding problem with my patch you think I should be addressing.

What it does is improve vXi16 shuffles so that more patterns can be performed in 2uops instead of 3uops, a side effect of which is that a later combine stage in PerformShuffleCombine (combineX86ShufflesRecursively) no longer merges these into a single PSHUFB as its threshold for combining is 3uops. The timing tests I did demonstrated that this threshold is probably about right - although I accept that more recent targets can perform PSHUFB faster.

What am I missing?

Repository:
  rL LLVM

http://reviews.llvm.org/D14901