[PATCH] D80364: [amdgpu] Teach load widening to handle non-DWORD aligned loads.

Thu May 28 13:13:57 PDT 2020

hliao added a comment.

In D80364#2061176 <https://reviews.llvm.org/D80364#2061176>, @arsenm wrote:

> In D80364#2060318 <https://reviews.llvm.org/D80364#2060318>, @hliao wrote:
>
> > In D80364#2058794 <https://reviews.llvm.org/D80364#2058794>, @arsenm wrote:
> >
> > > I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?
> >
> >
> >
> >
> > In D80364#2058794 <https://reviews.llvm.org/D80364#2058794>, @arsenm wrote:
> >
> > > I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?
> >
> >
> > That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.
>
>
> But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:
>
> after opt -instcombine:

but we run `llc` only in this test and skip all middle-end optimizations. That's why I said this test is impractical in the normal scenario where middle-end is always run. This test is added to address your previous concern on CSE of widened loads. If not required, I could remove this test.

> 
> 
>   define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
>     %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
>     %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
>     %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
>     %v1 = load i16, i16 addrspace(4)* %h1, align 4
>     %e1 = zext i16 %v1 to i32
>     %d2 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6
>     %h2 = bitcast i8 addrspace(4)* %d2 to i16 addrspace(4)*
>     %v2 = load i16, i16 addrspace(4)* %h2, align 2
>     %e2 = zext i16 %v2 to i32
>     %o = add nuw nsw i32 %e2, %e1
>     store i32 %o, i32 addrspace(1)* %out, align 4
>     ret void
>   }
>   
> 
> 
> Now the load's alignment was correctly inferred to be higher, so then the vectorizer handles this as expected:
> 
> opt -instcombine -load-store-vectorizer -instsimplify gives:
> 
>   define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
>     %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
>     %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
>     %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
>     %1 = bitcast i16 addrspace(4)* %h1 to <2 x i16> addrspace(4)*
>     %2 = load <2 x i16>, <2 x i16> addrspace(4)* %1, align 4
>     %v11 = extractelement <2 x i16> %2, i32 0
>     %v22 = extractelement <2 x i16> %2, i32 1
>     %e1 = zext i16 %v11 to i32
>     %e2 = zext i16 %v22 to i32
>     %o = add nuw nsw i32 %e2, %e1
>     store i32 %o, i32 addrspace(1)* %out, align 4
>     ret void
>   }
>   
> 
> 
> So I think with the alignment patch, for any real world example, this would do the right thing. Your example fails here because the IR wasn't pre-optimized. We don't need to expect perfect optimization from the backend for any random IR and need to consider the entire pass pipeline

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80364/new/

https://reviews.llvm.org/D80364