[llvm] [AMDGPU] Eliminate unnecessary packing in wider f16 vectors for sdwa/opsel-able instruction (PR #137137)

Fri Sep 12 03:36:50 PDT 2025

vg0204 wrote:

> So, finally I cam to conclusion of moving my patch as a separate new pass immediately after si-peephol-sdwa for following reasons.
> 
> 1. It could not be treated as a peephole optimization because of the way its implemented that do rigorous sort of conditions (across use-def chains) test to look for scenario whose transformation would be profiatable.
> 2. The use-case & coverage of optimization via my patch, dominates the performance improvement over the increased cost of dealing with a new pass in pipeline.
> 3. It is certainly possible to break this implemenation as a series of peephole optimization patterns (as suggested by @frederik-h ), but I am doubtful about it handling all but not few of generic scenarios as listed in my testCase file.
> 
> > I do think solving the original problem - that is, less register-efficient lowerings of SWDA/OPSEL-able operations that're being >run on a vector <4 x [i//f]16> or the like - should be done
> 
> This is another possible approach (suggested by @krzysz00) to tackle the problem at its source itself. So, @arsenm @jayfoad @Pierre-vh @frederik-h , & @krzysz00 What seems better way to go with it!

Now, I am at crossroads on how to move exactly, so if you guys could advice, it would be great!

https://github.com/llvm/llvm-project/pull/137137