[llvm-dev] ROCm module from LLVM AMDGPU backend
Frank Winter via llvm-dev
llvm-dev at lists.llvm.org
Wed Apr 22 13:57:30 PDT 2020
Hi,
I'm trying to launch a GPU kernel which was compiled by the LLVM
AMDGPU backend. Currently I'm having no success with it and I was
hoping someone tuned in on here might have an idea.
It seems that tensorflow is doing a similar thing. So I was reading
the tensorflow code on github and I believe the following setup is
pretty close in the vital parts:
1) Compile an LLVM IR module (see below) with AMDGPU backend to a
'module.o' file. Using this triple/CPU:
llvm::Triple TheTriple;
TheTriple.setArch (llvm::Triple::ArchType::amdgcn);
TheTriple.setVendor (llvm::Triple::VendorType::AMD);
TheTriple.setOS (llvm::Triple::OSType::AMDHSA);
std::string CPUStr("gfx906");
LLVM IR passes that I use:
TargetLibraryInfoWrapperPass
TargetMachine->addPassesToEmitFile with CGFT_ObjectFile
2) LLVM linker generates a shared lib using 'system()' call
ld.lld -shared module.o -o module.so
3) Reading this shared module back into a 'vector<uint8> shared'
4) Using HIP to load this module:
hipModule_t module;
ret = hipModuleLoadData( &module , shared.data() );
(this returns hipSuccess)
5) Trying to get a HIP function:
hipFunction_t kernel;
ret = hipModuleGetFunction(&kernel, module, "kernel" );
.. and this fails with HIP error code 500 !?
I believe the vital steps here concerning ROCm are similar
(identical?) to what's in tensorflow but I don't get it to work.
I have to admit that I did not build tensorflow to see if the AMD GPU
bits actually work. I read the comments and some are saying that it
comes with some performance overhead. Performance isn't the point at
the moment - I'm working on a proof-of-concept.
My test machine has an 'AMD gfx906' card installed.
Digging deeper, the hipModule_t is a pointer to ihipModule_t and
printing out the values after loading the module gives
ihip->fileName =
ihip->hash = 3943538976062281088
ihip->kernargs.size() = 0
ihip->executable.handle = 42041072
It's not telling me much. 'Not sure what to do with the handle for the
executable.
Any ideas what could be tried next?
Frank
--------------------------------------------------------------
LLVM IR module
target datalayout =
"e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"
define void @kernel(i1 %arg0, i32 %arg1, i32 %arg2, i32 %arg3, i1 %arg4,
i32* %arg5, i1* %arg6, float* %arg7, float* %arg8, float* %arg9) {
entrypoint:
%0 = sext i1 %arg4 to i32
%1 = xor i32 -1, %0
%2 = call i32 @llvm.amdgcn.workitem.id.x()
%3 = icmp sge i32 %2, %arg1
br i1 %3, label %L0, label %L1
L0: ; preds = %entrypoint
ret void
L1: ; preds = %entrypoint
%4 = trunc i32 %1 to i1
br i1 %4, label %L3, label %L4
L2: ; preds = %L6, %L5, %L4
%5 = phi i32 [ %7, %L4 ], [ %8, %L5 ], [ %2, %L6 ]
br i1 %arg0, label %L7, label %L8
L3: ; preds = %L1
br i1 %arg0, label %L5, label %L6
L4: ; preds = %L1
%6 = getelementptr i32, i32* %arg5, i32 %2
%7 = load i32, i32* %6
br label %L2
L5: ; preds = %L3
%8 = add nsw i32 %2, %arg2
br label %L2
L6: ; preds = %L3
br label %L2
L7: ; preds = %L2
%9 = icmp sgt i32 %5, %arg3
br i1 %9, label %L12, label %L13
L8: ; preds = %L2
%10 = getelementptr i1, i1* %arg6, i32 %5
%11 = load i1, i1* %10
%12 = sext i1 %11 to i32
%13 = xor i32 -1, %12
%14 = trunc i32 %13 to i1
br i1 %14, label %L10, label %L11
L9: ; preds = %L15, %L11
%15 = add nsw i32 0, %5
%16 = add nsw i32 0, %5
%17 = getelementptr float, float* %arg8, i32 %16
%18 = load float, float* %17
%19 = add nsw i32 0, %5
%20 = getelementptr float, float* %arg9, i32 %19
%21 = load float, float* %20
%22 = fmul float %18, %21
%23 = getelementptr float, float* %arg7, i32 %15
store float %22, float* %23
ret void
L10: ; preds = %L8
ret void
L11: ; preds = %L8
br label %L9
L12: ; preds = %L7
ret void
L13: ; preds = %L7
%24 = icmp slt i32 %5, %arg2
br i1 %24, label %L14, label %L15
L14: ; preds = %L13
ret void
L15: ; preds = %L13
br label %L9
}
; Function Attrs: nounwind readnone speculatable
declare i32 @llvm.amdgcn.workitem.id.x() #0
attributes #0 = { nounwind readnone speculatable }
------------------------------------------------------------------------------
The following is the assembly output the AMDGPU backend generates:
output: .text
.amdgcn_target "amdgcn-amd-amdhsa--gfx906"
.globl kernel
.p2align 2
.type kernel, at function
kernel:
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
v_and_b32_e32 v4, 1, v4
v_cmp_eq_u32_e64 s[4:5], 1, v4
v_and_b32_e32 v0, 1, v0
v_and_b32_e32 v4, 0x3ff, v15
v_cmp_eq_u32_e32 vcc, 1, v0
v_cmp_lt_i32_e64 s[6:7], v4, v1
s_and_saveexec_b64 s[8:9], s[6:7]
s_cbranch_execz BB0_16
BB0_1:
s_and_saveexec_b64 s[6:7], s[4:5]
s_xor_b64 s[6:7], exec, s[6:7]
s_cbranch_execz BB0_3
BB0_2:
v_lshlrev_b32_e32 v0, 2, v4
v_add_co_u32_e64 v0, s[4:5], v5, v0
v_addc_co_u32_e64 v1, s[4:5], 0, v6, s[4:5]
flat_load_dword v0, v[0:1]
BB0_3:
s_or_saveexec_b64 s[4:5], s[6:7]
s_xor_b64 exec, exec, s[4:5]
s_cbranch_execz BB0_7
BB0_4:
s_xor_b64 s[6:7], vcc, -1
s_waitcnt vmcnt(0) lgkmcnt(0)
v_add_u32_e32 v0, v4, v2
s_and_saveexec_b64 s[10:11], s[6:7]
s_xor_b64 s[6:7], exec, s[10:11]
BB0_5:
v_mov_b32_e32 v0, v4
BB0_6:
s_or_b64 exec, exec, s[6:7]
BB0_7:
s_or_b64 exec, exec, s[4:5]
s_xor_b64 s[6:7], vcc, -1
s_mov_b64 s[4:5], 0
s_and_saveexec_b64 s[10:11], s[6:7]
s_xor_b64 s[6:7], exec, s[10:11]
s_cbranch_execz BB0_9
BB0_8:
s_waitcnt vmcnt(0) lgkmcnt(0)
v_ashrrev_i32_e32 v1, 31, v0
v_add_co_u32_e32 v4, vcc, v7, v0
v_addc_co_u32_e32 v5, vcc, v8, v1, vcc
flat_load_ubyte v1, v[4:5]
s_waitcnt vmcnt(0) lgkmcnt(0)
v_and_b32_e32 v1, 1, v1
v_cmp_eq_u32_e32 vcc, 1, v1
s_and_b64 s[4:5], vcc, exec
BB0_9:
s_or_saveexec_b64 s[6:7], s[6:7]
s_xor_b64 exec, exec, s[6:7]
s_cbranch_execz BB0_13
BB0_10:
s_waitcnt vmcnt(0) lgkmcnt(0)
v_cmp_le_i32_e32 vcc, v0, v3
s_mov_b64 s[12:13], s[4:5]
s_and_saveexec_b64 s[10:11], vcc
BB0_11:
v_cmp_ge_i32_e32 vcc, v0, v2
s_andn2_b64 s[12:13], s[4:5], exec
s_and_b64 s[14:15], vcc, exec
s_or_b64 s[12:13], s[12:13], s[14:15]
BB0_12:
s_or_b64 exec, exec, s[10:11]
s_andn2_b64 s[4:5], s[4:5], exec
s_and_b64 s[10:11], s[12:13], exec
s_or_b64 s[4:5], s[4:5], s[10:11]
BB0_13:
s_or_b64 exec, exec, s[6:7]
s_and_saveexec_b64 s[6:7], s[4:5]
s_cbranch_execz BB0_15
BB0_14:
s_waitcnt vmcnt(0) lgkmcnt(0)
v_ashrrev_i32_e32 v1, 31, v0
v_lshlrev_b64 v[0:1], 2, v[0:1]
v_add_co_u32_e32 v2, vcc, v11, v0
v_addc_co_u32_e32 v3, vcc, v12, v1, vcc
flat_load_dword v4, v[2:3]
v_add_co_u32_e32 v2, vcc, v13, v0
v_addc_co_u32_e32 v3, vcc, v14, v1, vcc
flat_load_dword v2, v[2:3]
v_add_co_u32_e32 v0, vcc, v9, v0
v_addc_co_u32_e32 v1, vcc, v10, v1, vcc
s_waitcnt vmcnt(0) lgkmcnt(0)
v_mul_f32_e32 v2, v4, v2
flat_store_dword v[0:1], v2
BB0_15:
s_or_b64 exec, exec, s[6:7]
BB0_16:
s_or_b64 exec, exec, s[8:9]
s_waitcnt vmcnt(0) lgkmcnt(0)
s_setpc_b64 s[30:31]
.Lfunc_end0:
.size kernel, .Lfunc_end0-kernel
.section ".note.GNU-stack"
.amdgpu_metadata
---
amdhsa.kernels: []
amdhsa.version:
- 1
- 0
...
.end_amdgpu_metadata
-----------------------------------------------------------------------
rocminfo output:
Agent 1 and 2 are the host's Intel CPUs, then Agent 3 - 6 look like:
*******
Agent 3
*******
Name: gfx906
Marketing Name: Vega 20
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26273(0x66a1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1725
BDFID: 35328
Internal Node ID: 2
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
More information about the llvm-dev
mailing list