[PATCH] D120129: [NVPTX] Enhance vectorization of ld.param & st.param

Thu Mar 17 15:18:28 PDT 2022

kovdan01 added a comment.

@tra Thanks for your comments! Updated the patch according the discussion about forcing alignment 16.

> I think we should be able to do that to all no-kernel functions if we're compiling without -fgpu-rdc. I think we do reduce visibility of non-kernels in that case, but it would be good to make sure.

Checked if we do reduce visibility in such cases, and looks like we do not. The following code:

  __device__ int foo(int a, int b, int c) {
    return (a + b) / c;
  }

Compiles to the following IR:

  ; ModuleID = 'device.cu'
  source_filename = "device.cu"
  target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
  target triple = "nvptx64-nvidia-cuda"

  ; Function Attrs: convergent mustprogress noinline nounwind optnone
  define dso_local noundef i32 @_Z3fooiii(i32 noundef %a, i32 noundef %b, i32 noundef %c) #0 {
  entry:
    %a.addr = alloca i32, align 4
    %b.addr = alloca i32, align 4
    %c.addr = alloca i32, align 4
    store i32 %a, i32* %a.addr, align 4
    store i32 %b, i32* %b.addr, align 4
    store i32 %c, i32* %c.addr, align 4
    %0 = load i32, i32* %a.addr, align 4
    %1 = load i32, i32* %b.addr, align 4
    %add = add nsw i32 %0, %1
    %2 = load i32, i32* %c.addr, align 4
    %div = sdiv i32 %add, %2
    ret i32 %div
  }

  attributes #0 = { convergent mustprogress noinline nounwind optnone "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="sm_35" "target-features"="+ptx75,+sm_35" }

  !llvm.module.flags = !{!0, !1, !2, !3}
  !llvm.ident = !{!4, !5}

  !0 = !{i32 2, !"SDK Version", [2 x i32] [i32 11, i32 5]}
  !1 = !{i32 1, !"wchar_size", i32 4}
  !2 = !{i32 4, !"nvvm-reflect-ftz", i32 0}
  !3 = !{i32 7, !"frame-pointer", i32 2}
  !4 = !{!"clang version 15.0.0 (https://github.com/llvm/llvm-project.git 9879c555f21097aee15e73dd25bd89f652dba8ea)"}
  !5 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}

The compilation command that I use:

  clang++ --cuda-gpu-arch=sm_35 --cuda-path=/opt/cuda/ -S -emit-llvm -fno-gpu-rdc device.cu

Also, searching for `GPURelocatableDeviceCode` through LLVM codebase does not get results where this value is checked in context of reducing function visibility. I could implement that change, and IMHO that should be a separate patch.

Regarding the current patch – how is https://reviews.llvm.org/D118084 going? Can we merge this patch without waiting for your change about passing byval aggregates directly? As I already mentioned in my previous comment, I suppose the change is useful as far as compiler should work good with any type of IR maybe even generated by non-clang frontend.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120129/new/

https://reviews.llvm.org/D120129