[llvm] [AMDGPU] Eliminate unnecessary packing in wider f16 vectors for sdwa/opsel-able instruction (PR #137137)

Wed Aug 6 23:54:06 PDT 2025

vg0204 wrote:

So, finally I cam to conclusion of moving my patch as a separate new pass immediately after si-peephol-sdwa for following reasons.

1. It could not be treated as a peephole optimization because of the way its implemented that do rigorous sort of conditions (across use-def chains) test to look for scenario whose transformation would be profiatable.
2. The use-case & coverage of optimization via my patch, dominates the performance improvement over the increased cost of dealing with a new pass in pipeline.
3. It is certainly possible to break this implemenation as a series of peephole optimization patterns (as suggested by @frederik-h ), but I am doubtful about it handling all but not few of generic scenarios as listed in my testCase file.

> 
>I do think solving the original problem - that is, less register-efficient lowerings of SWDA/OPSEL-able operations that're being >run on a vector <4 x [i//f]16> or the like - should be done

This is another possible approach (suggested by @krzysz00) to tackle the problem at its source itself. So, @arsenm @jayfoad @Pierre-vh @frederik-h  , & @krzysz00 What seems better way to go with it!

https://github.com/llvm/llvm-project/pull/137137