[PATCH] D30416: [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently

Wed May 10 10:43:56 PDT 2017

wmi added a comment.

Chandler, Thanks for the comments. They are very helpful. I will address them in the next revision. I only replied some comments which I had questions or concerns.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:104
+  static char ID; // Pass identification, replacement for typeid
+  MemAccessShrinkingPass(const TargetMachine *TM = nullptr)
+      : FunctionPass(ID), TM(TM) {
----------------
chandlerc wrote:
> Still should omit the `= nullptr` here since this is an internal type.
I cannot omit it because In INITIALIZE_TM_PASS_END, callDefaultCtor<passName> requires the param to have a default value. 

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:218-220
+  unsigned MaskMidZeros = !MaskLeadOnes
+                              ? Mask.countLeadingZeros()
+                              : Mask.ashr(MaskTrailOnes).countTrailingZeros();
----------------
chandlerc wrote:
> I'm having trouble understanding the logic here in the case where there are leading ones. Here is my reasoning, but maybe I've gotten something wrong here:
> 
> Shifting right will remove leading ones, but you're shifting right the number of *trailing* ones... Shouldn't that be *leading ones*? And won't the result of a shift *right* be to place the middle zero sequence at the least significant bit, meaning you would want to count the *leading* zeros?
> 
> Put differently, arithmetic shift is required to not change the most significant bit, so doing an arithmetic shift right based on how many ones are trailing, seems like it will never change the count of trailing zeros.
> 
> If this is correct, then this is a bug and you should add some test cases that will hit this bug.
> 
> But regardless of whether my understanding is correct or there is a bug here, I think this can be written in a more obvious way:
> 
>   unsigned MaskMidZeros = BitSize - (MaskLeadingOnes + MaskTrailingOnes);
> 
> And then directly testing whether they are all zero:
> 
>   if (Mask == APInt::getBitsSet(BitSize, MaskLeadingOnes,
>                                 MaskLeadingOnes + MaskMidZeros)) {
> Shifting right will remove leading ones, but you're shifting right the number of *trailing* ones... Shouldn't that be *leading ones*? And won't the result of a shift *right* be to place the middle zero sequence at the least significant bit, meaning you would want to count the *leading* zeros?

I think shifting right will remove trailing ones? And after the shift (Mask.ashr(MaskTrailOnes)), middle zeros are at the least sigficant bits, and they are trailing zeros, right?

But like you said, I should rule out the all zero/all one cases separately so the logic will become more clear. 

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:370-371
+  MemoryAccess *DefiningAccess = MSSAWalker->getClobberingMemoryAccess(&To);
+  if (FromAccess != DefiningAccess &&
+      MSSA->dominates(FromAccess, DefiningAccess))
+    return true;
----------------
chandlerc wrote:
> Maybe this is just a strange API on MemorySSA, but typically I wouldn't expect a lack of dominance to indicate that no access between two points exists.
> 
> How does MemorySSA model a pattern that looks like:
> 
>   From  x 
>    \   /
>     \ /
>      A
>      |
>      |
>      To
> 
> Where `A` is a defining access, is between `From` and `To`, but I wouldn't expect `From` to dominate `A` because there is another predecessor `x`.
The case will not happen because we ensure `From` dominates `To` before calling the function. You are right, it is better to add an assertion at the entry of the function to prevent misuse of the API.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:596-606
+    StoreInst *NewSI = cast<StoreInst>(SI.clone());
+    NewSI->setOperand(0, SInfo.LargeVal);
+    NewSI->setOperand(1, Ptr);
+    Builder.Insert(NewSI);
+    DEBUG(dbgs() << "MemShrink: Insert" << *NewSI << " before" << SI << "\n");
+    // MemorySSA update for the new store.
+    MemoryDef *OldMemAcc = cast<MemoryDef>(MSSA->getMemoryAccess(&SI));
----------------
chandlerc wrote:
> It feels like all of this could be factored into an 'insertStore' method? In particular, the clone doesn't seem to buy you much as you rewrite most parts of the store anyways.
> 
> This could handle all of the MemorySSA updating, logging, etc.
I use clone here just to duplicate the subclass data like volatile and ordered.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:742-745
+/// Check if there is no instruction between \p From and \p To which may
+/// clobber the MemoryLocation \p Mem. However, if there are clobbers and
+/// all the clobber instructions between \p From and \p To are in the same
+/// block as \p To, We will set \p AllClobberInToBlock to true.
----------------
chandlerc wrote:
> There is no comment about the cost of this routine.
> 
> It looks *really* expensive. It appears to walk all transitive predecessors of the block containing `To`. So worst case, every basic block in the function. I see this called in several places from inside of for-loops. Is this really a reasonable approach?
> 
> Why aren't we just walking the def-use chain from MemorySSA to figure this kind of thing out in a much lower time complexity bound? Like, shouldn't we just be able to walk up defs until we either see a clobber or `From`?
That is because the instruction `To` here may not be a memory access instruction (It is probably a And or Trunc instruction which indicates only some bits of the input are demanded),  and we cannot get a MemoryAccess for it.  Note hasClobberBetween are overloaded and there are two versions. The other version which walks the MSSA def-use chain is used in several for-loops as you saw. This higher cost version is not used in a loop. Besides, we only check MSSA DefList in each BB, so the worse case complexity is the number of memory access instructions in the func, which is usually much less than the number of instructions in the func.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:946
+                                              bool ReplaceAllUses) {
+  if (!MultiUsesSeen) {
+    // If the Inst has multiple uses and the current shrinking cannot replace
----------------
chandlerc wrote:
> It would be much more clear for this to be a parameter rather than an implicit parameter via class member. For example, multiple uses *of what*?
MultiUsesSeen is not changed for every instruction. It is saying whether a previous instruction on the chain was found to have multiuse when we walk the chain bottom-up. 

r1 = ...;
r2 = r1 + r3;
r4 = r2 + r5;

If `r2` has multiple uses, both `r2 =  r1 + r3` and `r1 = ...` cannot be removed after the shrinking. 

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:967-997
+/// When the input instruction \p IN is and(Val, Cst) or trunc, it indicates
+/// only a portion of its input value has been used. We will walk through the
+/// Def-Use chain, track the range of value which will be used, remember the
+/// operations contributing to the used value range, and skip operations which
+/// changes value range that is not to be used, until a load is found.
+///
+/// So we start from and or trunc operations, then try to find a sequence of
----------------
chandlerc wrote:
> Rather than re-implementing all of this logic, can you re-use the existing demanded bits facilities in LLVM?
> 
> For example, I think you can use the `DemandedBits` analysis, walk all loads in the function, and then narrow them based on the demanded bits it has computed. Because of how `DemandedBits` works, it is both efficient and very powerful. It can handle many more patterns.
> 
> Thinking about this, I suspect you'll want to do two passes essentially. First, narrow all the *stores* that you can. This will likely be iterative. Once that finishes, it seems like you'll be able to then do a single walk over the loads with a fresh `DemandedBits` analysis and narrow all of those left. You'll probably want to narrow the stores first because that may make bits stop being demanded. But I don't see any way for the reverse to be true, so there should be a good sequencing.
> 
> To make the analysis invalidation stuff easier, you may actually need this to actually be two passes so that the store pass can invalidate the `DemandedBits` analysis, and the load pass can recompute it fresh.
> 
> Does that make sense?
> 
> If so, I would suggest getting just the store shrinking in this patch, and add the load shrinking in a follow-up patch. I'm happy for them to be implemented in a single file as they are very similar and its good for people to realize they likely want *both* passes.
I considered demanded bits facilities before, but I found it can only simplify the code a little bit. Finding the demanded bits inside of the load is only a small part of the work. Most of the complexity of the code comes from figuring out which ops in the sequence on the Def-Use Chain change the demanded bits. Like if we see shifts, we may clear some demanded bits in less significant position to zeros because we shift right then shift left. Because we change the demanded bits, we must include the shifts into the shrinked code sequence. Like if we see Or(And(Or(And(...)) pattern,  we want to know that the bits changed by Or(And()) are different bits of the demanded bits, only when that is true, we can omit the Or(And(...)) pattern in the final shrinked code sequence. Another reason is, demanded bits analysis may not be very cheap. As for memory shrinking, few pattern like and/trunc is very common to be useful for the shrinking so we actually don't need very general demanded bits analysis for every instruction. 

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:1183-1184
+  while (true) {
+    if (!tryShrink(Fn) && !removeDeadInsts())
+      break;
+    MadeChange = true;
----------------
chandlerc wrote:
> Do you want to run `tryShrink` again just because you removed dead instructions?
> 
> If so, do you want to remove dead instructions on each iteration instead of just once `tryShrink` doesn't make a change?
If dead instruction is removed, another iteration will be taken and tryShrink will run again.

I think it makes no difference between `running removeDeadInsts only when tryShrink makes no change` and `running removeDeadInsts everytime after tryShrink makes a change`.

Repository:
  rL LLVM

https://reviews.llvm.org/D30416