RKSimon added a comment. Tested on Jaguar CPU: Throughput: Old 3op shuffle: 4cy New 2op shuffle: 2cy pshufb_rr 3cy pshufb_rm 3cy Latency: Old 3op shuffle: 4cy New 2op shuffle: 3cy pshufb_rr 3cy pshufb_rm 4cy Repository: rL LLVM http://reviews.llvm.org/D14901