[PATCH] D30416: [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently

Wed May 10 17:41:47 PDT 2017

chandlerc added a comment.

Trying to provide answers to the open questions here...

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:104
+  static char ID; // Pass identification, replacement for typeid
+  MemAccessShrinkingPass(const TargetMachine *TM = nullptr)
+      : FunctionPass(ID), TM(TM) {
----------------
wmi wrote:
> chandlerc wrote:
> > Still should omit the `= nullptr` here since this is an internal type.
> I cannot omit it because In INITIALIZE_TM_PASS_END, callDefaultCtor<passName> requires the param to have a default value. 
Ah, right, I forgot that about the pass initialization. Sorry for the noise!

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:218-220
+  unsigned MaskMidZeros = !MaskLeadOnes
+                              ? Mask.countLeadingZeros()
+                              : Mask.ashr(MaskTrailOnes).countTrailingZeros();
----------------
wmi wrote:
> chandlerc wrote:
> > I'm having trouble understanding the logic here in the case where there are leading ones. Here is my reasoning, but maybe I've gotten something wrong here:
> > 
> > Shifting right will remove leading ones, but you're shifting right the number of *trailing* ones... Shouldn't that be *leading ones*? And won't the result of a shift *right* be to place the middle zero sequence at the least significant bit, meaning you would want to count the *leading* zeros?
> > 
> > Put differently, arithmetic shift is required to not change the most significant bit, so doing an arithmetic shift right based on how many ones are trailing, seems like it will never change the count of trailing zeros.
> > 
> > If this is correct, then this is a bug and you should add some test cases that will hit this bug.
> > 
> > But regardless of whether my understanding is correct or there is a bug here, I think this can be written in a more obvious way:
> > 
> >   unsigned MaskMidZeros = BitSize - (MaskLeadingOnes + MaskTrailingOnes);
> > 
> > And then directly testing whether they are all zero:
> > 
> >   if (Mask == APInt::getBitsSet(BitSize, MaskLeadingOnes,
> >                                 MaskLeadingOnes + MaskMidZeros)) {
> > Shifting right will remove leading ones, but you're shifting right the number of *trailing* ones... Shouldn't that be *leading ones*? And won't the result of a shift *right* be to place the middle zero sequence at the least significant bit, meaning you would want to count the *leading* zeros?
> 
> I think shifting right will remove trailing ones? And after the shift (Mask.ashr(MaskTrailOnes)), middle zeros are at the least sigficant bits, and they are trailing zeros, right?
> 
> But like you said, I should rule out the all zero/all one cases separately so the logic will become more clear. 
Ah, ok, this makes sense to me now. I had confused myself thinking about it. Anyways, the simpler formulation will avoid any future reader confusion.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:370-371
+  MemoryAccess *DefiningAccess = MSSAWalker->getClobberingMemoryAccess(&To);
+  if (FromAccess != DefiningAccess &&
+      MSSA->dominates(FromAccess, DefiningAccess))
+    return true;
----------------
wmi wrote:
> chandlerc wrote:
> > Maybe this is just a strange API on MemorySSA, but typically I wouldn't expect a lack of dominance to indicate that no access between two points exists.
> > 
> > How does MemorySSA model a pattern that looks like:
> > 
> >   From  x 
> >    \   /
> >     \ /
> >      A
> >      |
> >      |
> >      To
> > 
> > Where `A` is a defining access, is between `From` and `To`, but I wouldn't expect `From` to dominate `A` because there is another predecessor `x`.
> The case will not happen because we ensure `From` dominates `To` before calling the function. You are right, it is better to add an assertion at the entry of the function to prevent misuse of the API.
Ok, while that makes sense, it still seems counter-intuitive in terms of how to use MemorySSA based on my limited understanding.

I would have expected essentially walking up the def's from the use until either a clobber is found or the 'from' is found. One has to come first, and which ever is first dictates if there is a clobber. Essentially, I would expect to use the *SSA* properties to answer these questions rather than the *dominance* or control flow properties. But I'm happy if folks more deeply familiar with MemorySSA can explain better why this is the right way to use it, as I'm still new to this infrastructure in LLVM.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:596-606
+    StoreInst *NewSI = cast<StoreInst>(SI.clone());
+    NewSI->setOperand(0, SInfo.LargeVal);
+    NewSI->setOperand(1, Ptr);
+    Builder.Insert(NewSI);
+    DEBUG(dbgs() << "MemShrink: Insert" << *NewSI << " before" << SI << "\n");
+    // MemorySSA update for the new store.
+    MemoryDef *OldMemAcc = cast<MemoryDef>(MSSA->getMemoryAccess(&SI));
----------------
wmi wrote:
> chandlerc wrote:
> > It feels like all of this could be factored into an 'insertStore' method? In particular, the clone doesn't seem to buy you much as you rewrite most parts of the store anyways.
> > 
> > This could handle all of the MemorySSA updating, logging, etc.
> I use clone here just to duplicate the subclass data like volatile and ordered.
I still think it will be cleaner to directly construct the load.

Also, I wouldn't expect this pass to be valid for either volatile or ordered loads...

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:742-745
+/// Check if there is no instruction between \p From and \p To which may
+/// clobber the MemoryLocation \p Mem. However, if there are clobbers and
+/// all the clobber instructions between \p From and \p To are in the same
+/// block as \p To, We will set \p AllClobberInToBlock to true.
----------------
wmi wrote:
> chandlerc wrote:
> > There is no comment about the cost of this routine.
> > 
> > It looks *really* expensive. It appears to walk all transitive predecessors of the block containing `To`. So worst case, every basic block in the function. I see this called in several places from inside of for-loops. Is this really a reasonable approach?
> > 
> > Why aren't we just walking the def-use chain from MemorySSA to figure this kind of thing out in a much lower time complexity bound? Like, shouldn't we just be able to walk up defs until we either see a clobber or `From`?
> That is because the instruction `To` here may not be a memory access instruction (It is probably a And or Trunc instruction which indicates only some bits of the input are demanded),  and we cannot get a MemoryAccess for it.  Note hasClobberBetween are overloaded and there are two versions. The other version which walks the MSSA def-use chain is used in several for-loops as you saw. This higher cost version is not used in a loop. Besides, we only check MSSA DefList in each BB, so the worse case complexity is the number of memory access instructions in the func, which is usually much less than the number of instructions in the func.
Given how different the two routines are, I would suggest giving them separate names. It seemingly wasn't obvious that they were different already.

I'm happy to look at the use cases, but this still feels much too expensive to me. In terms of big-O, the fact that it is only memory accesses doesn't really seem to help much. Quadratic in the number of memory accesses is still probably not something we can realistically do.

I want to think more about the algorithm once I see exactly where this is being called.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:946
+                                              bool ReplaceAllUses) {
+  if (!MultiUsesSeen) {
+    // If the Inst has multiple uses and the current shrinking cannot replace
----------------
wmi wrote:
> chandlerc wrote:
> > It would be much more clear for this to be a parameter rather than an implicit parameter via class member. For example, multiple uses *of what*?
> MultiUsesSeen is not changed for every instruction. It is saying whether a previous instruction on the chain was found to have multiuse when we walk the chain bottom-up. 
> 
> r1 = ...;
> r2 = r1 + r3;
> r4 = r2 + r5;
> 
> 
> If `r2` has multiple uses, both `r2 =  r1 + r3` and `r1 = ...` cannot be removed after the shrinking. 
Ok, that explanation makes sense, but you'll need to find a way to make this clear from the code itself. =] At the very least, not using a member, but probably with some more helpful variable names, function names, structure or comments.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:967-997
+/// When the input instruction \p IN is and(Val, Cst) or trunc, it indicates
+/// only a portion of its input value has been used. We will walk through the
+/// Def-Use chain, track the range of value which will be used, remember the
+/// operations contributing to the used value range, and skip operations which
+/// changes value range that is not to be used, until a load is found.
+///
+/// So we start from and or trunc operations, then try to find a sequence of
----------------
wmi wrote:
> chandlerc wrote:
> > Rather than re-implementing all of this logic, can you re-use the existing demanded bits facilities in LLVM?
> > 
> > For example, I think you can use the `DemandedBits` analysis, walk all loads in the function, and then narrow them based on the demanded bits it has computed. Because of how `DemandedBits` works, it is both efficient and very powerful. It can handle many more patterns.
> > 
> > Thinking about this, I suspect you'll want to do two passes essentially. First, narrow all the *stores* that you can. This will likely be iterative. Once that finishes, it seems like you'll be able to then do a single walk over the loads with a fresh `DemandedBits` analysis and narrow all of those left. You'll probably want to narrow the stores first because that may make bits stop being demanded. But I don't see any way for the reverse to be true, so there should be a good sequencing.
> > 
> > To make the analysis invalidation stuff easier, you may actually need this to actually be two passes so that the store pass can invalidate the `DemandedBits` analysis, and the load pass can recompute it fresh.
> > 
> > Does that make sense?
> > 
> > If so, I would suggest getting just the store shrinking in this patch, and add the load shrinking in a follow-up patch. I'm happy for them to be implemented in a single file as they are very similar and its good for people to realize they likely want *both* passes.
> I considered demanded bits facilities before, but I found it can only simplify the code a little bit. Finding the demanded bits inside of the load is only a small part of the work. Most of the complexity of the code comes from figuring out which ops in the sequence on the Def-Use Chain change the demanded bits. Like if we see shifts, we may clear some demanded bits in less significant position to zeros because we shift right then shift left. Because we change the demanded bits, we must include the shifts into the shrinked code sequence. Like if we see Or(And(Or(And(...)) pattern,  we want to know that the bits changed by Or(And()) are different bits of the demanded bits, only when that is true, we can omit the Or(And(...)) pattern in the final shrinked code sequence. Another reason is, demanded bits analysis may not be very cheap. As for memory shrinking, few pattern like and/trunc is very common to be useful for the shrinking so we actually don't need very general demanded bits analysis for every instruction. 
I don't really understand the argument here...

I would expect the demanded bits facilities to already handle things like shifts changing which bits are demanded, does it not? If not, it seems like we should extend that general facility rather than building an isolated system over here.

Regarding the cost, I'm much more worried about the algorithmic cost of this and the fact that it seems relatively sensitive to things that we don't reliably canonicalize (using trunc or and instructions to remove demanded bits).

Generally speaking, working *backwards* from the and or trunc is going to be much more expensive than working *forwards*.

But even if it the existing demanded bits is too expensive, we still shouldn't develop a new cheap one locally here. We should either add a parameter to decrease its cost or add a new general purpose "fast" demanded bits and then use that here.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:1183-1184
+  while (true) {
+    if (!tryShrink(Fn) && !removeDeadInsts())
+      break;
+    MadeChange = true;
----------------
wmi wrote:
> chandlerc wrote:
> > Do you want to run `tryShrink` again just because you removed dead instructions?
> > 
> > If so, do you want to remove dead instructions on each iteration instead of just once `tryShrink` doesn't make a change?
> If dead instruction is removed, another iteration will be taken and tryShrink will run again.
> 
> I think it makes no difference between `running removeDeadInsts only when tryShrink makes no change` and `running removeDeadInsts everytime after tryShrink makes a change`.
I guess I'm trying to ask: why will the removal of dead instructions cause shrinking to become more effective? Most of the algorithms here don't seem likely to remove entire classes of uses, and so I'm not sure why this iteration is valuable at all.

But if it is valuable, that is, if removing dead instructions exposes more shrinking opportunities, I would expect that removing dead instructions *earlier* (IE, on each iteration) to cause this to converge faster.

Repository:
  rL LLVM

https://reviews.llvm.org/D30416