[PATCH] D30810: Preserve vec3 type.
JinGu Kang via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Wed Mar 15 00:17:45 PDT 2017
jaykang10 added a comment.
> I believe the assumption is more practical: most part of upstream llvm targets only support vectors with even sized number of lanes. And in those cases you would have to expand to a 4x vector and leave the 4th element as undef anyway, so it was done in the front-end to get rid of it right away. Probably GPU targets do some special tricks here during legalization.
I have compiled below code, which current clang generates for vec3, using llc with amdgcn target.
LLVM IR (vec3 --> vec4)
define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) {
entry:
%castToVec4 = bitcast <3 x float>* %b to <4 x float>*
%loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16
%storetmp = bitcast <3 x float>* %a to <4 x float>*
store <4 x float> %loadVec4, <4 x float>* %storetmp, align 16
ret void
}
SelectionDAG after legalization.
Legalized selection DAG: BB#0 'foo:entry'
SelectionDAG has 43 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
t9: i64 = add t2, Constant:i64<40>
t10: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, undef:i64
t4: i64 = add t2, Constant:i64<36>
t6: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, undef:i64
t12: ch = TokenFactor t6:1, t10:1
t32: v4i32 = BUILD_VECTOR t22, t25, t27, t29
t21: v4f32 = bitcast t32
t18: v4i32 = bitcast t21
t22: i32,ch = load<LD4[%castToVec4](align=16)> t12, t10, undef:i32
t24: i32 = add t10, Constant:i32<4>
t25: i32,ch = load<LD4[%castToVec4+4]> t12, t24, undef:i32
t26: i32 = add t24, Constant:i32<4>
t27: i32,ch = load<LD4[%castToVec4+8](align=8)> t12, t26, undef:i32
t28: i32 = add t26, Constant:i32<4>
t29: i32,ch = load<LD4[%castToVec4+12]> t12, t28, undef:i32
t31: ch = TokenFactor t22:1, t25:1, t27:1, t29:1
t35: i32 = extract_vector_elt t18, Constant:i32<0>
t36: ch = store<ST4[%storetmp](align=16)> t31, t35, t6, undef:i32
t38: i32 = extract_vector_elt t18, Constant:i32<1>
t39: i32 = add t6, Constant:i32<4>
t40: ch = store<ST4[%storetmp+4]> t31, t38, t39, undef:i32
t42: i32 = extract_vector_elt t18, Constant:i32<2>
t44: i32 = add t6, Constant:i32<8>
t45: ch = store<ST4[%storetmp+8](align=8)> t31, t42, t44, undef:i32
t47: i32 = extract_vector_elt t18, Constant:i32<3>
t49: i32 = add t6, Constant:i32<12>
t50: ch = store<ST4[%storetmp+12]> t31, t47, t49, undef:i32
t51: ch = TokenFactor t36, t40, t45, t50
t17: ch = ENDPGM t51
As you can see, the SelectionDAG still has 4 load/store.
Assembly output
.section .AMDGPU.config
.long 47176
.long 11272257
.long 47180
.long 132
.long 47200
.long 0
.long 4
.long 0
.long 8
.long 0
.text
.globl foo
.p2align 8
.type foo, at function
foo: ; @foo
; BB#0: ; %entry
s_load_dword s2, s[0:1], 0x9
s_load_dword s0, s[0:1], 0xa
s_mov_b32 s4, SCRATCH_RSRC_DWORD0
s_mov_b32 s5, SCRATCH_RSRC_DWORD1
s_mov_b32 s6, -1
s_mov_b32 s8, s3
s_mov_b32 s7, 0xe8f000
s_waitcnt lgkmcnt(0)
v_mov_b32_e32 v0, s0
buffer_load_dword v2, v0, s[4:7], s8 offen
buffer_load_dword v3, v0, s[4:7], s8 offen offset:4
buffer_load_dword v4, v0, s[4:7], s8 offen offset:8
buffer_load_dword v0, v0, s[4:7], s8 offen offset:12
v_mov_b32_e32 v1, s2
s_waitcnt vmcnt(0)
buffer_store_dword v0, v1, s[4:7], s8 offen offset:12
buffer_store_dword v4, v1, s[4:7], s8 offen offset:8
buffer_store_dword v3, v1, s[4:7], s8 offen offset:4
buffer_store_dword v2, v1, s[4:7], s8 offen
s_endpgm
.Lfunc_end0:
.size foo, .Lfunc_end0-foo
.section .AMDGPU.csdata
As you can see, assembly also has 4 load/store. I think gpu target does not handle it specially.
> My guess here is that targets that do care are looking through the vector shuffle and customizing to whatever seems the best to them. If you wanna change the default behavior you need some data to show that your model solves a real issue and actually brings benefits; do you have any real examples on where that happens, or why GPU targets haven't yet tried to change this? Maybe other custom front-ends based on clang do? Finding the historical reason (if any) should be a good start point.
I did "git blame" and I read the commit's message. You can also see it with "c58dcdc8facb646d88675bb6fbcb5c787166c4be". It is same with clang code's comment. I also wonder how the vec3 --> vec4 load/store has not caused problems. As Akira's example, if struct type has float3 type as one of fields and it has 'packed' attribute, it overwrites next field. The vec3 load/store generates more instructions like stores and extract_vectors like below SelectionDAG.
LLVM IR for vec3
define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) {
entry:
%0 = load <3 x float>, <3 x float>* %b, align 16
store <3 x float> %0, <3 x float>* %a, align 16
ret void
}
SelectionDAG after type legalization for amdgcn (other targets has similar SelectionDAG after type legalization because of widen vector store on type legalizer)
Type-legalized selection DAG: BB#0 'foo:entry'
SelectionDAG has 24 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
t4: i64 = add t2, Constant:i64<36>
t6: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, undef:i64
t9: i64 = add t2, Constant:i64<40>
t10: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, undef:i64
t12: ch = TokenFactor t6:1, t10:1
t21: v4i32,ch = load<LD16[%b]> t12, t10, undef:i32
t22: v2i64 = bitcast t21
t24: i64 = extract_vector_elt t22, Constant:i32<0>
t25: ch = store<ST8[%a](align=16)> t21:1, t24, t6, undef:i32
t29: i32 = extract_vector_elt t21, Constant:i32<2>
t27: i32 = add t6, Constant:i32<8>
t30: ch = store<ST4[%a+8](align=8)> t21:1, t29, t27, undef:i32
t33: ch = TokenFactor t25, t30
t17: ch = ENDPGM t33
As you can see, the type legalizer handle vec3 load/store properly. It does not write 4th element. The vec3 load/store generates more instructions but it has correct behavior. I am not 100% sure the vec3 --> vec4 load/store is correct or not because no one has complained about it. But if the vec3 --> vec4 load/store is correct, llvm's type legalizer or somewhere on llvm's codegen could follow the approach too to generate optimal code. As a result, I think it would be good for clang to have both of features and I would like to stick to the option "-fpresereve-vec3' to change the behavior easily.
https://reviews.llvm.org/D30810
More information about the cfe-commits
mailing list