[PATCH] D30416: [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently

Mon Apr 24 19:36:50 PDT 2017

arsenm added inline comments.

================
Comment at: include/llvm/Target/TargetLowering.h:1908
+  virtual bool isNarrowingExpensive(EVT /*VT1*/, EVT /*VT2*/) const {
+    return true;
+  }
----------------
wmi wrote:
> arsenm wrote:
> > efriedma wrote:
> > > wmi wrote:
> > > > efriedma wrote:
> > > > > I'm not sure I see the point of this hook.  Every in-tree target has cheap i8 load/store and aligned i16 load/store operations.  And we have existing hooks to check support for misaligned operations.
> > > > > 
> > > > > If there's some case I'm not thinking of, please add an example to the comment.
> > > > It is because some testcase for amdgpu. Like the testcase below:
> > > > 
> > > > define void @s_sext_in_reg_i1_i16(i16 addrspace(1)* %out, i32 addrspace(2)* %ptr) #0 {
> > > >   %ld = load i32, i32 addrspace(2)* %ptr
> > > >   %in = trunc i32 %ld to i16
> > > >   %shl = shl i16 %in, 15
> > > >   %sext = ashr i16 %shl, 15
> > > >   store i16 %sext, i16 addrspace(1)* %out
> > > >   ret void
> > > > }
> > > > 
> > > > code with the patch:
> > > > 	s_load_dwordx2 s[4:5], s[0:1], 0x9
> > > > 	s_load_dwordx2 s[0:1], s[0:1], 0xb
> > > > 	s_mov_b32 s7, 0xf000
> > > > 	s_mov_b32 s6, -1
> > > > 	s_mov_b32 s2, s6
> > > > 	s_mov_b32 s3, s7
> > > > 	s_waitcnt lgkmcnt(0)
> > > > 	buffer_load_ushort v0, off, s[0:3], 0
> > > > 	s_waitcnt vmcnt(0)
> > > > 	v_bfe_i32 v0, v0, 0, 1
> > > > 	buffer_store_short v0, off, s[4:7], 0
> > > > 	s_endpgm
> > > > 
> > > > code without the patch:
> > > > 	s_load_dwordx2 s[4:5], s[0:1], 0x9
> > > > 	s_load_dwordx2 s[0:1], s[0:1], 0xb
> > > > 	s_mov_b32 s7, 0xf000
> > > > 	s_mov_b32 s6, -1
> > > > 	s_waitcnt lgkmcnt(0)
> > > > 	s_load_dword s0, s[0:1], 0x0
> > > > 	s_waitcnt lgkmcnt(0)
> > > > 	s_bfe_i32 s0, s0, 0x10000
> > > > 	v_mov_b32_e32 v0, s0
> > > > 	buffer_store_short v0, off, s[4:7], 0
> > > > 	s_endpgm
> > > > 
> > > > amdgpu codegen chooses to use buffer_load_short instead of s_load_dword and generates longer code sequence. I know almost nothing about amdgpu so I simply add the hook and only focus on the architectures I am more faimiliar with before the patch becomes in better shape and stable. 
> > > > 
> > > Huh, GPU targets are weird like that.  I would still rather turn it off for amdgpu, as opposed to leaving it off by default.
> > 32-bit loads should not be reduced to a shorter width. Using a buffer_load_ushort is definitely worse than using s_load_dword. There is a target hook that is supposed to avoid reducing load widths like this
> Matt, thanks for the explanation.
> 
> I guess the hook is isNarrowingProfitable. However, the hook I need is a little different. I need to know whether narrowing is expensive enough. isNarrowingProfitable on x86 shows i32 --> i16 is not profitable, maybe slightly harmful, but it is not quite harmful, and the benefit to do narrowing may outweigh the cost.
The hook I was thinking of was shouldReduceLoadWidth. s_load_dword uses a different cache with much faster access than the buffer instruction if it can be used

Repository:
  rL LLVM

https://reviews.llvm.org/D30416