[PATCH] D94648: [amdgpu] Implement lower function LDS pass

Thu May 13 06:06:53 PDT 2021

JonChesterfield added a comment.

Just run a bug to ground here. Replacing the inline asm with the donothing intrinsic is prettier, but also doesn't work as is. This is a flaw in the above only tested at the IR and application level. This shows up quickly (if opaquely) at the  executable unit test scale.

Given:

  target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
  target triple = "amdgcn-amd-amdhsa"

  @global_barrier_state = hidden addrspace(3) global i32 undef, align 4

  define i32 @rw() #0 {
  entry:
    %0 = atomicrmw add i32 addrspace(3)* @global_barrier_state, i32 1 acq_rel, align 4
    ret i32 %0
  }

  define amdgpu_kernel void @__device_start() {
  entry:
    %0 = call i32 @rw()
    ret void
  }

  attributes #0 = { noinline  }

This transform does exactly what it was intended to, the LDS variable allocated at zero, but the kernel metadata starts:

  	.amdhsa_kernel __device_start
  		.amdhsa_group_segment_fixed_size 0 ; should be 4, isn't
          .end_amdhsa_kernel

If the inline asm is reintroduced, that goes to 4. Similarly, if the test case that reduced to this is modified to allocate 4 bytes more LDS than the metadata asks for, it works again.

I suspect there is something in hardware that rounds LDS allocation up to a boundary, so as long as the kernel looks like it uses some non-zero amount of LDS, the out of bounds read hits in the allocated region.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94648/new/

https://reviews.llvm.org/D94648