[PATCH] D30810: Preserve vec3 type.

Wed Mar 15 00:17:45 PDT 2017

jaykang10 added a comment.

> I believe the assumption is more practical: most part of upstream llvm targets only support vectors with even sized number of lanes. And in those cases you would have to expand to a 4x vector and leave the 4th element as undef anyway, so it was done in the front-end to get rid of it right away. Probably GPU targets do some special tricks here during legalization.

I have compiled below code, which current clang generates for vec3,  using llc with amdgcn target.

LLVM IR (vec3 --> vec4)

  define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) {
  entry:
    %castToVec4 = bitcast <3 x float>* %b to <4 x float>*
    %loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16
    %storetmp = bitcast <3 x float>* %a to <4 x float>*
    store <4 x float> %loadVec4, <4 x float>* %storetmp, align 16
    ret void
  }

SelectionDAG after legalization.

  Legalized selection DAG: BB#0 'foo:entry'
  SelectionDAG has 43 nodes:
    t0: ch = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
      t9: i64 = add t2, Constant:i64<40>
    t10: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, undef:i64
      t4: i64 = add t2, Constant:i64<36>
    t6: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, undef:i64
    t12: ch = TokenFactor t6:1, t10:1
        t32: v4i32 = BUILD_VECTOR t22, t25, t27, t29
      t21: v4f32 = bitcast t32
    t18: v4i32 = bitcast t21
    t22: i32,ch = load<LD4[%castToVec4](align=16)> t12, t10, undef:i32
    t24: i32 = add t10, Constant:i32<4>
    t25: i32,ch = load<LD4[%castToVec4+4]> t12, t24, undef:i32
    t26: i32 = add t24, Constant:i32<4>
    t27: i32,ch = load<LD4[%castToVec4+8](align=8)> t12, t26, undef:i32
      t28: i32 = add t26, Constant:i32<4>
    t29: i32,ch = load<LD4[%castToVec4+12]> t12, t28, undef:i32
    t31: ch = TokenFactor t22:1, t25:1, t27:1, t29:1
          t35: i32 = extract_vector_elt t18, Constant:i32<0>
        t36: ch = store<ST4[%storetmp](align=16)> t31, t35, t6, undef:i32
          t38: i32 = extract_vector_elt t18, Constant:i32<1>
          t39: i32 = add t6, Constant:i32<4>
        t40: ch = store<ST4[%storetmp+4]> t31, t38, t39, undef:i32
          t42: i32 = extract_vector_elt t18, Constant:i32<2>
          t44: i32 = add t6, Constant:i32<8>
        t45: ch = store<ST4[%storetmp+8](align=8)> t31, t42, t44, undef:i32
          t47: i32 = extract_vector_elt t18, Constant:i32<3>
          t49: i32 = add t6, Constant:i32<12>
        t50: ch = store<ST4[%storetmp+12]> t31, t47, t49, undef:i32
      t51: ch = TokenFactor t36, t40, t45, t50
    t17: ch = ENDPGM t51

As you can see, the SelectionDAG still has 4 load/store.

Assembly output

  	.section	.AMDGPU.config
  	.long	47176
  	.long	11272257
  	.long	47180
  	.long	132
  	.long	47200
  	.long	0
  	.long	4
  	.long	0
  	.long	8
  	.long	0
  	.text
  	.globl	foo
  	.p2align	8
  	.type	foo, at function
  foo:                                    ; @foo
  ; BB#0:                                 ; %entry
  	s_load_dword s2, s[0:1], 0x9
  	s_load_dword s0, s[0:1], 0xa
  	s_mov_b32 s4, SCRATCH_RSRC_DWORD0
  	s_mov_b32 s5, SCRATCH_RSRC_DWORD1
  	s_mov_b32 s6, -1
  	s_mov_b32 s8, s3
  	s_mov_b32 s7, 0xe8f000
  	s_waitcnt lgkmcnt(0)
  	v_mov_b32_e32 v0, s0
  	buffer_load_dword v2, v0, s[4:7], s8 offen
  	buffer_load_dword v3, v0, s[4:7], s8 offen offset:4
  	buffer_load_dword v4, v0, s[4:7], s8 offen offset:8
  	buffer_load_dword v0, v0, s[4:7], s8 offen offset:12
  	v_mov_b32_e32 v1, s2
  	s_waitcnt vmcnt(0)
  	buffer_store_dword v0, v1, s[4:7], s8 offen offset:12
  	buffer_store_dword v4, v1, s[4:7], s8 offen offset:8
  	buffer_store_dword v3, v1, s[4:7], s8 offen offset:4
  	buffer_store_dword v2, v1, s[4:7], s8 offen
  	s_endpgm
  .Lfunc_end0:
  	.size	foo, .Lfunc_end0-foo

  	.section	.AMDGPU.csdata

As you can see, assembly also has 4 load/store. I think gpu target does not handle it specially.

> My guess here is that targets that do care are looking through the vector shuffle and customizing to whatever seems the best to them. If you wanna change the default behavior you need some data to show that your model solves a real issue and actually brings benefits; do you have any real examples on where that happens, or why GPU targets haven't yet tried to change this? Maybe other custom front-ends based on clang do? Finding the historical reason (if any) should be a good start point.

I did "git blame" and I read the commit's message. You can also see it with "c58dcdc8facb646d88675bb6fbcb5c787166c4be". It is same with clang code's comment. I also wonder how the vec3 --> vec4 load/store has not caused problems. As Akira's example, if struct type has float3 type as one of fields and it has 'packed' attribute, it overwrites next field. The vec3 load/store generates more instructions like stores and extract_vectors like below SelectionDAG.

LLVM IR for vec3

  define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly %b) {
  entry:
    %0 = load <3 x float>, <3 x float>* %b, align 16
    store <3 x float> %0, <3 x float>* %a, align 16
    ret void
  }

SelectionDAG after type legalization for amdgcn (other targets has similar SelectionDAG after type legalization because of widen vector store on type legalizer)

  Type-legalized selection DAG: BB#0 'foo:entry'
  SelectionDAG has 24 nodes:
    t0: ch = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
      t4: i64 = add t2, Constant:i64<36>
    t6: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t4, undef:i64
      t9: i64 = add t2, Constant:i64<40>
    t10: i32,ch = load<LD4[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t9, undef:i64
      t12: ch = TokenFactor t6:1, t10:1
    t21: v4i32,ch = load<LD16[%b]> t12, t10, undef:i32
            t22: v2i64 = bitcast t21
          t24: i64 = extract_vector_elt t22, Constant:i32<0>
        t25: ch = store<ST8[%a](align=16)> t21:1, t24, t6, undef:i32
          t29: i32 = extract_vector_elt t21, Constant:i32<2>
          t27: i32 = add t6, Constant:i32<8>
        t30: ch = store<ST4[%a+8](align=8)> t21:1, t29, t27, undef:i32
      t33: ch = TokenFactor t25, t30
    t17: ch = ENDPGM t33

As you can see, the type legalizer handle vec3 load/store properly. It does not write 4th element. The vec3 load/store generates more instructions but it has correct behavior. I am not 100% sure the vec3 --> vec4 load/store is correct or not because no one has complained about it.  But if the vec3 --> vec4 load/store is correct, llvm's type legalizer or somewhere on llvm's codegen could follow the approach too to generate optimal code. As a result, I think it would be good for clang to have both of features and I would like to stick to the option "-fpresereve-vec3' to change the behavior easily.

https://reviews.llvm.org/D30810