[llvm] [AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce number of extractelements as they have large expansion in the backend. (PR #171253)

Mon Dec 8 23:17:50 PST 2025

ruiling wrote:

> @ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

Vector element extract/insert needs to operate on element basis, like in the example:
```
  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1
```
As we don't know whether `%gep` will be aligned to 16. You cannot bitcast `%alloca` to <4 x i128> and translate the `<8 x i16>` load into one `extractelement`. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.

https://github.com/llvm/llvm-project/pull/171253