[llvm] [AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce number of extractelements as they have large expansion in the backend. (PR #171253)

Tue Dec 9 07:13:06 PST 2025

ruiling wrote:

> > Vector element extract/insert needs to operate on element basis, like in the example:
> > ```
> >   %alloca = alloca [32 x i16], align 16, addrspace(5)
> >   %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
> >   %load = load <8 x i16>, ptr addrspace(5) %gep, align 1
> > ```
> > As we don't know whether `%gep` will be aligned to 16. You cannot bitcast `%alloca` to <4 x i128> and translate the `<8 x i16>` load into one `extractelement`. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.
> 
> I believe GEPToVectorIndex is only allowing GEPs that map to the vector index. I will add an unaligned testcase to make sure.

I think GEPToVectorIndex is used to get the index into the vector type translated from the alloca. By saying unaligned, I really mean not aligned to the vector type you just bitcasted to temporarily. Like in below case, the alloca was translated to <32 x i16>. but in order to translate the `load <8 x i16>` into the simplified form. You bitcast it to <4 x i128>. Unalignment means the case that the address of `%gep` is not aligned to 128bit.
```
 %alloca = alloca [32 x i16], align 16, addrspace(5)
 %gep = getelementptr  i16, ptr addrspace(5) %alloca, i32 0, i32 %idx
 %load = load <8 x i16>, ptr addrspace(5) %gep, align 2
```
Hope I understand everything correctly.

https://github.com/llvm/llvm-project/pull/171253