[PATCH] D30416: [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently

Fri Apr 28 16:11:21 PDT 2017

wmi added inline comments.

================
Comment at: include/llvm/Target/TargetLowering.h:1908
+  virtual bool isNarrowingExpensive(EVT /*VT1*/, EVT /*VT2*/) const {
+    return true;
+  }
----------------
arsenm wrote:
> wmi wrote:
> > arsenm wrote:
> > > efriedma wrote:
> > > > wmi wrote:
> > > > > efriedma wrote:
> > > > > > I'm not sure I see the point of this hook.  Every in-tree target has cheap i8 load/store and aligned i16 load/store operations.  And we have existing hooks to check support for misaligned operations.
> > > > > > 
> > > > > > If there's some case I'm not thinking of, please add an example to the comment.
> > > > > It is because some testcase for amdgpu. Like the testcase below:
> > > > > 
> > > > > define void @s_sext_in_reg_i1_i16(i16 addrspace(1)* %out, i32 addrspace(2)* %ptr) #0 {
> > > > >   %ld = load i32, i32 addrspace(2)* %ptr
> > > > >   %in = trunc i32 %ld to i16
> > > > >   %shl = shl i16 %in, 15
> > > > >   %sext = ashr i16 %shl, 15
> > > > >   store i16 %sext, i16 addrspace(1)* %out
> > > > >   ret void
> > > > > }
> > > > > 
> > > > > code with the patch:
> > > > > 	s_load_dwordx2 s[4:5], s[0:1], 0x9
> > > > > 	s_load_dwordx2 s[0:1], s[0:1], 0xb
> > > > > 	s_mov_b32 s7, 0xf000
> > > > > 	s_mov_b32 s6, -1
> > > > > 	s_mov_b32 s2, s6
> > > > > 	s_mov_b32 s3, s7
> > > > > 	s_waitcnt lgkmcnt(0)
> > > > > 	buffer_load_ushort v0, off, s[0:3], 0
> > > > > 	s_waitcnt vmcnt(0)
> > > > > 	v_bfe_i32 v0, v0, 0, 1
> > > > > 	buffer_store_short v0, off, s[4:7], 0
> > > > > 	s_endpgm
> > > > > 
> > > > > code without the patch:
> > > > > 	s_load_dwordx2 s[4:5], s[0:1], 0x9
> > > > > 	s_load_dwordx2 s[0:1], s[0:1], 0xb
> > > > > 	s_mov_b32 s7, 0xf000
> > > > > 	s_mov_b32 s6, -1
> > > > > 	s_waitcnt lgkmcnt(0)
> > > > > 	s_load_dword s0, s[0:1], 0x0
> > > > > 	s_waitcnt lgkmcnt(0)
> > > > > 	s_bfe_i32 s0, s0, 0x10000
> > > > > 	v_mov_b32_e32 v0, s0
> > > > > 	buffer_store_short v0, off, s[4:7], 0
> > > > > 	s_endpgm
> > > > > 
> > > > > amdgpu codegen chooses to use buffer_load_short instead of s_load_dword and generates longer code sequence. I know almost nothing about amdgpu so I simply add the hook and only focus on the architectures I am more faimiliar with before the patch becomes in better shape and stable. 
> > > > > 
> > > > Huh, GPU targets are weird like that.  I would still rather turn it off for amdgpu, as opposed to leaving it off by default.
> > > 32-bit loads should not be reduced to a shorter width. Using a buffer_load_ushort is definitely worse than using s_load_dword. There is a target hook that is supposed to avoid reducing load widths like this
> > Matt, thanks for the explanation.
> > 
> > I guess the hook is isNarrowingProfitable. However, the hook I need is a little different. I need to know whether narrowing is expensive enough. isNarrowingProfitable on x86 shows i32 --> i16 is not profitable, maybe slightly harmful, but it is not quite harmful, and the benefit to do narrowing may outweigh the cost.
> The hook I was thinking of was shouldReduceLoadWidth. s_load_dword uses a different cache with much faster access than the buffer instruction if it can be used
shouldReduceLoadWidth is a hook for TargetLowering but I need a hook in TargetLoweringBase, which can be used for llvm IR pass. 

I cannot change shouldReduceLoadWidth to be a hook in TargetLoweringBase because of the way in which x86 uses it, so I copy the logic in AMDGPUTargetLowering::shouldReduceLoadWidth to AMDGPUTargetLowering::isNarrowingExpensive. I can let shouldReduceLoadWidth call isNarrowingExpensive in a NFC. Is it ok?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:417-430
+  // However the pattern matching below is more complex than that because
+  // of the commutation and some matching preference we expect.
+  // for case: or(and(or(and(LI, Cst_1), MaskedVal_1), Cst_2), MaskedVal_2)
+  // and if MaskedVal_2 happens to be another and operator, we hope we can
+  // match or(and(LI, Cst_1), MaskedVal_1) to Val instead of MaskedVal.
+  bool OrAndPattern =
+      match(Val, m_c_Or(m_And(m_Load(LI), m_ConstantInt(Cst)),
----------------
wmi wrote:
> chandlerc wrote:
> > Even after reading your comment I'm not sure I understand what is driving the complexity of this match.
> > 
> > Can you explain (maybe just here in the review) what patterns you're trying to handle that are motivating this?
> > 
> > I'm wondering whether there is any canonicalization that can be leveraged (or added if not already there) to reduce the complexity of the pattern here. Or if we really have to handle this complexity, what the best way to write it and/or comment it so that readers understand the result.
> I borrow the template in your comments to explain: store(or(and(LargeVal, MaskConstant), SmallVal), address)
> The case is:
> 
> store(or_1(and_1(or_2(and_2(load, -65281), Val1), -256), and_3(Val2, 7))
> 
> The two operands of "or_1" are "and_1" and "and_3", but it doesn't know which subtree of and1 or and3 contains the LargeVal.  I hope or_2 can be matched to the LargeVal. It is a common pattern after bitfield load/store coalescing.
> 
> But I realize when I am explaining to you, that I can split the complex pattern matching above into two, which may be simpler.
> %%% bool OrAndPattern = match(Val, m_c_Or(m_And(m_Value(LargeVal), m_ConstantInt(Cst)), m_Value(SmallVal)));
> if (match(SmallVal, m_c_And(m_c_Or(m_And(m_Value(), m_Value()), m_Value())))
>   std::swap(SmallVal, LargeVal);
> %%%
I find I still have to keep the original complex pattern. Now I remember where the real difficulty is:

For case like: store(or_1(and_1(or_2(and_2(load, -65281), Val1), -256), and_3(Val2, 7)),  I want to match LargeVal to or_2(and_2(load, ...)

But I cannot use match(Val, m_c_Or(m_And(m_c_Or(m_And(...)))) because I have no way to get the intermediate results of the match, like I cannot bind LargeVal to the second m_c_Or. So I have to split the match into multiples. That is where the complexity comes from.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:467-470
+    // If MR.Width is not the length of legal type, we cannot
+    // store MaskedVal directly.
+    if (MR.Width != 8 && MR.Width != 16 && MR.Width != 32)
+      MR.ShrinkWithMaskedVal = false;
----------------
wmi wrote:
> chandlerc wrote:
> > Should this be testing against the `DataLayout` rather than hard coded `8`, `16`, and `32`? What if 64 bits is legal and that's the width of the MR?
> That is better. Will fix it.
For x8664, DataLayout works fine. However, for other architectures, like arm, the datalayout is 

target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"

DL::LegalIntWidths will only contain widths of natural integers, represented by "n32", so only 32 is Legal Integer Width for ARM.

For x8664, its datalayout is:
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

because of "n8:16:32:64",  8, 16, 32, 64 are all legal integer width.

Repository:
  rL LLVM

https://reviews.llvm.org/D30416