[PATCH] D120129: [NVPTX] Enhance vectorization of ld.param & st.param
Daniil Kovalev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Mar 17 15:18:28 PDT 2022
kovdan01 added a comment.
@tra Thanks for your comments! Updated the patch according the discussion about forcing alignment 16.
> I think we should be able to do that to all no-kernel functions if we're compiling without -fgpu-rdc. I think we do reduce visibility of non-kernels in that case, but it would be good to make sure.
Checked if we do reduce visibility in such cases, and looks like we do not. The following code:
__device__ int foo(int a, int b, int c) {
return (a + b) / c;
}
Compiles to the following IR:
; ModuleID = 'device.cu'
source_filename = "device.cu"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
; Function Attrs: convergent mustprogress noinline nounwind optnone
define dso_local noundef i32 @_Z3fooiii(i32 noundef %a, i32 noundef %b, i32 noundef %c) #0 {
entry:
%a.addr = alloca i32, align 4
%b.addr = alloca i32, align 4
%c.addr = alloca i32, align 4
store i32 %a, i32* %a.addr, align 4
store i32 %b, i32* %b.addr, align 4
store i32 %c, i32* %c.addr, align 4
%0 = load i32, i32* %a.addr, align 4
%1 = load i32, i32* %b.addr, align 4
%add = add nsw i32 %0, %1
%2 = load i32, i32* %c.addr, align 4
%div = sdiv i32 %add, %2
ret i32 %div
}
attributes #0 = { convergent mustprogress noinline nounwind optnone "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="sm_35" "target-features"="+ptx75,+sm_35" }
!llvm.module.flags = !{!0, !1, !2, !3}
!llvm.ident = !{!4, !5}
!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 11, i32 5]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 4, !"nvvm-reflect-ftz", i32 0}
!3 = !{i32 7, !"frame-pointer", i32 2}
!4 = !{!"clang version 15.0.0 (https://github.com/llvm/llvm-project.git 9879c555f21097aee15e73dd25bd89f652dba8ea)"}
!5 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}
The compilation command that I use:
clang++ --cuda-gpu-arch=sm_35 --cuda-path=/opt/cuda/ -S -emit-llvm -fno-gpu-rdc device.cu
Also, searching for `GPURelocatableDeviceCode` through LLVM codebase does not get results where this value is checked in context of reducing function visibility. I could implement that change, and IMHO that should be a separate patch.
Regarding the current patch – how is https://reviews.llvm.org/D118084 going? Can we merge this patch without waiting for your change about passing byval aggregates directly? As I already mentioned in my previous comment, I suppose the change is useful as far as compiler should work good with any type of IR maybe even generated by non-clang frontend.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D120129/new/
https://reviews.llvm.org/D120129
More information about the llvm-commits
mailing list