[PATCH] D30416: [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently

Wed May 10 22:37:44 PDT 2017

wmi added inline comments.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:596-606
+    StoreInst *NewSI = cast<StoreInst>(SI.clone());
+    NewSI->setOperand(0, SInfo.LargeVal);
+    NewSI->setOperand(1, Ptr);
+    Builder.Insert(NewSI);
+    DEBUG(dbgs() << "MemShrink: Insert" << *NewSI << " before" << SI << "\n");
+    // MemorySSA update for the new store.
+    MemoryDef *OldMemAcc = cast<MemoryDef>(MSSA->getMemoryAccess(&SI));
----------------
chandlerc wrote:
> wmi wrote:
> > chandlerc wrote:
> > > It feels like all of this could be factored into an 'insertStore' method? In particular, the clone doesn't seem to buy you much as you rewrite most parts of the store anyways.
> > > 
> > > This could handle all of the MemorySSA updating, logging, etc.
> > I use clone here just to duplicate the subclass data like volatile and ordered.
> I still think it will be cleaner to directly construct the load.
> 
> Also, I wouldn't expect this pass to be valid for either volatile or ordered loads...
Ok, I will change it to use IRBuilder.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:967-997
+/// When the input instruction \p IN is and(Val, Cst) or trunc, it indicates
+/// only a portion of its input value has been used. We will walk through the
+/// Def-Use chain, track the range of value which will be used, remember the
+/// operations contributing to the used value range, and skip operations which
+/// changes value range that is not to be used, until a load is found.
+///
+/// So we start from and or trunc operations, then try to find a sequence of
----------------
chandlerc wrote:
> wmi wrote:
> > chandlerc wrote:
> > > Rather than re-implementing all of this logic, can you re-use the existing demanded bits facilities in LLVM?
> > > 
> > > For example, I think you can use the `DemandedBits` analysis, walk all loads in the function, and then narrow them based on the demanded bits it has computed. Because of how `DemandedBits` works, it is both efficient and very powerful. It can handle many more patterns.
> > > 
> > > Thinking about this, I suspect you'll want to do two passes essentially. First, narrow all the *stores* that you can. This will likely be iterative. Once that finishes, it seems like you'll be able to then do a single walk over the loads with a fresh `DemandedBits` analysis and narrow all of those left. You'll probably want to narrow the stores first because that may make bits stop being demanded. But I don't see any way for the reverse to be true, so there should be a good sequencing.
> > > 
> > > To make the analysis invalidation stuff easier, you may actually need this to actually be two passes so that the store pass can invalidate the `DemandedBits` analysis, and the load pass can recompute it fresh.
> > > 
> > > Does that make sense?
> > > 
> > > If so, I would suggest getting just the store shrinking in this patch, and add the load shrinking in a follow-up patch. I'm happy for them to be implemented in a single file as they are very similar and its good for people to realize they likely want *both* passes.
> > I considered demanded bits facilities before, but I found it can only simplify the code a little bit. Finding the demanded bits inside of the load is only a small part of the work. Most of the complexity of the code comes from figuring out which ops in the sequence on the Def-Use Chain change the demanded bits. Like if we see shifts, we may clear some demanded bits in less significant position to zeros because we shift right then shift left. Because we change the demanded bits, we must include the shifts into the shrinked code sequence. Like if we see Or(And(Or(And(...)) pattern,  we want to know that the bits changed by Or(And()) are different bits of the demanded bits, only when that is true, we can omit the Or(And(...)) pattern in the final shrinked code sequence. Another reason is, demanded bits analysis may not be very cheap. As for memory shrinking, few pattern like and/trunc is very common to be useful for the shrinking so we actually don't need very general demanded bits analysis for every instruction. 
> I don't really understand the argument here...
> 
> I would expect the demanded bits facilities to already handle things like shifts changing which bits are demanded, does it not? If not, it seems like we should extend that general facility rather than building an isolated system over here.
> 
> Regarding the cost, I'm much more worried about the algorithmic cost of this and the fact that it seems relatively sensitive to things that we don't reliably canonicalize (using trunc or and instructions to remove demanded bits).
> 
> Generally speaking, working *backwards* from the and or trunc is going to be much more expensive than working *forwards*.
> 
> But even if it the existing demanded bits is too expensive, we still shouldn't develop a new cheap one locally here. We should either add a parameter to decrease its cost or add a new general purpose "fast" demanded bits and then use that here.
> I would expect the demanded bits facilities to already handle things like shifts changing which bits are demanded, does it not?

Yes, demanded bits facilities can adjust which bits are demanded if there is shift. However, which bits are the demanded bits is not the only thing I need to know. I still need to know whether an operation in the Def-Use Chain effectively change the value of the bits demanded. If the operation changes the value of demanded bits, the operation should be shrinked together with the load. If it only changes the value of bits other than the demanded bits, the operation can be omitted. That is actually what most of the pattern matching is doing in load shrinking, and that part of work I think cannot be fullfilled by demanded bits analysis.   

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:1183-1184
+  while (true) {
+    if (!tryShrink(Fn) && !removeDeadInsts())
+      break;
+    MadeChange = true;
----------------
chandlerc wrote:
> wmi wrote:
> > chandlerc wrote:
> > > Do you want to run `tryShrink` again just because you removed dead instructions?
> > > 
> > > If so, do you want to remove dead instructions on each iteration instead of just once `tryShrink` doesn't make a change?
> > If dead instruction is removed, another iteration will be taken and tryShrink will run again.
> > 
> > I think it makes no difference between `running removeDeadInsts only when tryShrink makes no change` and `running removeDeadInsts everytime after tryShrink makes a change`.
> I guess I'm trying to ask: why will the removal of dead instructions cause shrinking to become more effective? Most of the algorithms here don't seem likely to remove entire classes of uses, and so I'm not sure why this iteration is valuable at all.
> 
> But if it is valuable, that is, if removing dead instructions exposes more shrinking opportunities, I would expect that removing dead instructions *earlier* (IE, on each iteration) to cause this to converge faster.

> why will the removal of dead instructions cause shrinking to become more effective?

After removing some dead instructions, some multi uses def will become single use def and it will help increasing the benefit to do more shrinking.  

> if removing dead instructions exposes more shrinking opportunities, I would expect that removing dead instructions *earlier* (IE, on each iteration) to cause this to converge faster.

Ok, then I will do it after tryShrink in every iteration.

Repository:
  rL LLVM

https://reviews.llvm.org/D30416