[PATCH] D14901: [X86][SSE] Improve i16 splatting shuffles
Simon Pilgrim via llvm-commits
llvm-commits at lists.llvm.org
Mon Dec 21 13:07:54 PST 2015
RKSimon added a comment.
Sorry Quentin - I missed your follow up email to the list - copied here:
> > Tested on Jaguar CPU:
> > Throughput:
> > Old 3op shuffle: 4cy
> > New 2op shuffle: 2cy
> > pshufb_rr 3cy
> > pshufb_rm 3cy
> I am confused.
> When the code sequence is shorter, I was expecting this, but this number is not for the problem we were discussing, i.e., when the shufb is replaced by 2 shuf(w|hd, whatever), right?
> If it is, I am missing something because it should be 2 uops in both cases.
I'm confused too - I'm not certain what outstanding problem with my patch you think I should be addressing.
What it does is improve vXi16 shuffles so that more patterns can be performed in 2uops instead of 3uops, a side effect of which is that a later combine stage in PerformShuffleCombine (combineX86ShufflesRecursively) no longer merges these into a single PSHUFB as its threshold for combining is 3uops. The timing tests I did demonstrated that this threshold is probably about right - although I accept that more recent targets can perform PSHUFB faster.
What am I missing?
More information about the llvm-commits