[llvm] [CVP] Implement type narrowing for LShr (PR #119577)

Mon Dec 16 08:12:57 PST 2024

adam-bzowski wrote:

> Generally, can you share what your actual motivating case here is?

Hi, thanks for the comments. The original motivation is as follows. Consider the simple code here:
https://godbolt.org/z/GzGsjbaG4
In principle, one would expect that both sides result in the same IR, as the division by 128 is exchanged for the right shift by 7. But, as you can see, the division results in lshr for i16s, while the original lshr ends up with i32s plus unnecessary zext and trunc. This may not have much impact for such a simple code, but when you close the code in a loop, the vectorizer produces a worse code, as you can see here:
https://godbolt.org/z/7EPnj9a8d

I was looking at this issue and realized that the essential difference between the two cases is how CVP treats udiv and lshr. Since lshr is a relatively cheap instruction, we do not want to narrow it too aggressively. We want to avoid the situation where the gain from the narrowing is overshadowed by the inserted zexts and truncs. That's why I added the condition that lshr must be followed by truncs. With this constraint the resulting code after the narrowing does seem to be slightly faster. Please let me know what you think!

https://github.com/llvm/llvm-project/pull/119577