[PATCH] D80364: [amdgpu] Teach load widening to handle non-DWORD aligned loads.

Fri May 29 20:42:28 PDT 2020

hliao added a comment.

In D80364#2063603 <https://reviews.llvm.org/D80364#2063603>, @arsenm wrote:

> I did some experiments locally and think this can stay in AMDGPUCodeGenPrepare, and doesn't need the split pass. Since you restrict this widening to the case where you're rebasing the load anyway,  I don't think this will cause the same problems with the vectorizer the previous IR load widening had (and may help it even?)
>
> test3 should also come back, but should have the explicit align 4 added to the load. This could also use some loads of i8, and <2 x i8>. We could also extend this to handle wider, sub-dword aligned types but that's a separate patch.

Scalar load widening should run after LSV to generate redundant loads. Cases like a sequence of consecutive loads of `i16` benefit from such an organization to avoid redundant load generation. Here's the details

for 4 loads of i16

  ld.i16 (ptr + 0)
  ld.i16 (ptr + 2)
  ld.i16 (ptr + 4)
  ld.i16 (ptr + 6)

If we run scalar load widening before LSV. After widening, we have

  ld.i16 (ptr + 0)
  ld.i32 (ptr + 0)
  ld.i16 (ptr + 4)
  ld.i32 (ptr + 4)

After LSV, we have

  ld.i16 (ptr + 0)
  ld.i32x2 (ptr + 0)
  ld.i16 (ptr + 4)

That 2 i16 loads are redundant. If we run scalar load widening after LSV, we won't have that result.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80364/new/

https://reviews.llvm.org/D80364