[PATCH] D31804: [AMDGPU] zero extend workitem id

Mon Apr 10 11:22:45 PDT 2017

rampitec added a comment.

In https://reviews.llvm.org/D31804#722616, @arsenm wrote:

> In https://reviews.llvm.org/D31804#722593, @rampitec wrote:
>
> > In https://reviews.llvm.org/D31804#722567, @arsenm wrote:
> >
> > > Doesn't the library already annotate these with the range metadata? We should probably tighten those bounds in a pass when the required workgroup size is known on the IR metadata
> >
> >
> > Generally library cannot know the workgroup size, it is the attribute on a kernel. Then clang produces amdgpu_flat_work_group_size, which is processed here. Too bad it is flat. There is also OpenCL specific reqd_work_group_size attribute which is now flattened and translated into amdgpu_flat_work_group_size by clang. Technically it shall be possible to get a more precise range with processing OpenCL specific reqd_work_group_size, but practically we do not support flat sizes more than 256, and AssertZExt cannot give a better range representation than 'extend from byte' anyway. A computeKnownBits could do it better, but it needs to process a target opcode, when after lowering it is just a load.
> >
> > On a side note, there are other calls which can be simplified, like get_local_size(). I do not know how to do it though, because these are just loads yet in the library, they have neither intrinsics nor target opcodes.
>
>
> The library can use the hardware maximum (which I think it does already), and a pass that knows about the attribute can further reduce it. It can do better than extend from byte, it isn't limited to MVT types. Range metadata is already generically lowered to an arbitrary bitwidth to AssertZExt.
>
> Doing it here doesn't really change anything fundamentally, but fixing the range metadata will allow the IR passes the same benefit and also wouldn't require reimplementing the logic to turn the range into AssertZExt.

I do not see any range metadata, and I also do not think this is a right way to go to use HW maximum. A kernel attribute generally capable to limit it more. For example:

  __attribute__((reqd_work_group_size(128, 1, 1)))
  kernel void zext_grp_size_256(global uint *a) {
    a[0] = get_local_id(0) & 0xff;
  }

compiled to:

  ; Function Attrs: nounwind
  define amdgpu_kernel void @zext_grp_size_256(i32 addrspace(1)* nocapture %a) local_unnamed_addr #0 !kernel_arg_addr_space !2 !kernel_arg_access_qual !3 !kernel_arg_type !4 !kernel_arg_base_type !4 !kernel_arg_type_qual !5 !kernel_arg_name !6 !reqd_work_group_size !7 {
  entry:
    %call = tail call i64 @_Z12get_local_idj(i32 0) #2
    %0 = trunc i64 %call to i32
    %conv = and i32 %0, 255
    store i32 %conv, i32 addrspace(1)* %a, align 4, !tbaa !8
    ret void
  }

  ; Function Attrs: alwaysinline nounwind readnone
  define linkonce_odr protected i64 @_Z12get_local_idj(i32) local_unnamed_addr #1 {
    %2 = tail call i64 @__ockl_get_local_id(i32 %0) #2
    ret i64 %2
  }

  attributes #1 = { alwaysinline nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+fp64-fp16-denormals,-fp32-denormals" "unsafe-fp-math"="false" "use-soft-float"="false" }

  attributes #2 = { nounwind readnone }

BTW, I do not see how to use AssertZExt with an arbitrary bitwidth...

Repository:
  rL LLVM

https://reviews.llvm.org/D31804