[llvm] [X86][Codegen] Shuffle certain shifts on i8 vectors to create opportunity for vectorized shift instructions (PR #117980)

Sat Dec 14 22:04:57 PST 2024

huangjd wrote:

Cost/benefit analysis below, assuming a fully utilized pipeline (for example, `op mem, reg` never stalls on memory load as if the memory load uop is issued early enough so that the actual arithmetic/logic uop can be issued immediately after dependent reg is available). 

v*i8 column is original latency. v*16 and v*i32 are latency values for shift widened to 16 and 32 byte respectively. 

![Screenshot from 2024-12-15 00-57-40](https://github.com/user-attachments/assets/58b5329d-def5-4949-966a-2f14ef351e72)

https://github.com/llvm/llvm-project/pull/117980