[PATCH] D30416: [BitfieldShrinking] Shrink Bitfields load/store when the bitfields are legal to access independently

Tue May 9 17:33:28 PDT 2017

chandlerc requested changes to this revision.
chandlerc added a comment.
This revision now requires changes to proceed.

Somewhat focused on the store side. Near the bottom is a high-level comment about the load shrinking approach.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:417-430
+  // However the pattern matching below is more complex than that because
+  // of the commutation and some matching preference we expect.
+  // for case: or(and(or(and(LI, Cst_1), MaskedVal_1), Cst_2), MaskedVal_2)
+  // and if MaskedVal_2 happens to be another and operator, we hope we can
+  // match or(and(LI, Cst_1), MaskedVal_1) to Val instead of MaskedVal.
+  bool OrAndPattern =
+      match(Val, m_c_Or(m_And(m_Load(LI), m_ConstantInt(Cst)),
----------------
wmi wrote:
> wmi wrote:
> > chandlerc wrote:
> > > Even after reading your comment I'm not sure I understand what is driving the complexity of this match.
> > > 
> > > Can you explain (maybe just here in the review) what patterns you're trying to handle that are motivating this?
> > > 
> > > I'm wondering whether there is any canonicalization that can be leveraged (or added if not already there) to reduce the complexity of the pattern here. Or if we really have to handle this complexity, what the best way to write it and/or comment it so that readers understand the result.
> > I borrow the template in your comments to explain: store(or(and(LargeVal, MaskConstant), SmallVal), address)
> > The case is:
> > 
> > store(or_1(and_1(or_2(and_2(load, -65281), Val1), -256), and_3(Val2, 7))
> > 
> > The two operands of "or_1" are "and_1" and "and_3", but it doesn't know which subtree of and1 or and3 contains the LargeVal.  I hope or_2 can be matched to the LargeVal. It is a common pattern after bitfield load/store coalescing.
> > 
> > But I realize when I am explaining to you, that I can split the complex pattern matching above into two, which may be simpler.
> > %%% bool OrAndPattern = match(Val, m_c_Or(m_And(m_Value(LargeVal), m_ConstantInt(Cst)), m_Value(SmallVal)));
> > if (match(SmallVal, m_c_And(m_c_Or(m_And(m_Value(), m_Value()), m_Value())))
> >   std::swap(SmallVal, LargeVal);
> > %%%
> I find I still have to keep the original complex pattern. Now I remember where the real difficulty is:
> 
> For case like: store(or_1(and_1(or_2(and_2(load, -65281), Val1), -256), and_3(Val2, 7)),  I want to match LargeVal to or_2(and_2(load, ...)
> 
> But I cannot use match(Val, m_c_Or(m_And(m_c_Or(m_And(...)))) because I have no way to get the intermediate results of the match, like I cannot bind LargeVal to the second m_c_Or. So I have to split the match into multiples. That is where the complexity comes from.
It would be really nice if LLVM would canonicalize in one way or the other so you didn't have to handle so many variations. Asking folks about whether we can/should do anything like that.

But I think the bigger question is, why would only two layers be enough? I feel like there is something more general here that will make explaining everything else much simpler.

Are you looking for a load specifically? Or are you just looking for one side of an `or` which has a "narrow" (after masking) `and`?

If the former, maybe just search for the load?

If the latter, maybe you should be just capturing the two sides of the or, and rather than looking *explicitly* for an 'and', instead compute whether the non-zero bits of one side or the other are "narrow"?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:104
+  static char ID; // Pass identification, replacement for typeid
+  MemAccessShrinkingPass(const TargetMachine *TM = nullptr)
+      : FunctionPass(ID), TM(TM) {
----------------
Still should omit the `= nullptr` here since this is an internal type.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:154-157
+  BitRange analyzeOrAndPattern(Value &SmallVal, ConstantInt &Cst,
+                               unsigned BitSize, bool &DoReplacement);
+  BitRange analyzeBOpPattern(Value &Val, ConstantInt &Cst, unsigned BitSize);
+  bool extendBitRange(BitRange &BR, unsigned BitSize, unsigned Align);
----------------
These seem like helper functions that don't actually need to be part of the class at all. Maybe make them static free functions?

(This may be true for some of the other routines in this file.)

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:205-207
+/// \p DoReplacement will contain the result of whether the requirements are
+/// satisfied. We will also return BitRange describing the maximum range of
+/// LargeVal that may be modified.
----------------
It seems much more idiomatic to return the bool indicating whether a valid BitRange was computable, and if true, set up the values in an output parameter.

Or even better, you could return an `Optional<BitRange>`, and return `None` when the requirements aren't satisfied.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:209
+BitRange MemAccessShrinkingPass::analyzeOrAndPattern(Value &SmallVal,
+                                                     ConstantInt &Cst,
+                                                     unsigned BitSize,
----------------
I know other passes use the variable name `Cst` but I'd suggest using just `C` for generic constants or some more descriptive term like you use elsewhere like `Mask`.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:212-215
+  // Cst is the mask. Analyze the pattern of mask after sext it to uint64_t. We
+  // will handle patterns like either 0..01..1 or 1..10..01..1
+  APInt Mask = Cst.getValue().sextOrTrunc(BitSize);
+  assert(Mask.getBitWidth() == BitSize && "Unexpected bitwidth of Mask");
----------------
Do you really need to extend or truncate here? Surely the type system has already caused the constant to be of the size you want? If so, I'd just assert it here. Maybe could directly pass a 'const &' APInt in as the parameter, letting you call the parameter `Mask` above?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:216-217
+  assert(Mask.getBitWidth() == BitSize && "Unexpected bitwidth of Mask");
+  unsigned MaskLeadOnes = Mask.countLeadingOnes();
+  unsigned MaskTrailOnes = Mask.countTrailingOnes();
+  unsigned MaskMidZeros = !MaskLeadOnes
----------------
I would call these `MaskLeadingOnes` and `MaskTrailingOnes`.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:218-220
+  unsigned MaskMidZeros = !MaskLeadOnes
+                              ? Mask.countLeadingZeros()
+                              : Mask.ashr(MaskTrailOnes).countTrailingZeros();
----------------
I'm having trouble understanding the logic here in the case where there are leading ones. Here is my reasoning, but maybe I've gotten something wrong here:

Shifting right will remove leading ones, but you're shifting right the number of *trailing* ones... Shouldn't that be *leading ones*? And won't the result of a shift *right* be to place the middle zero sequence at the least significant bit, meaning you would want to count the *leading* zeros?

Put differently, arithmetic shift is required to not change the most significant bit, so doing an arithmetic shift right based on how many ones are trailing, seems like it will never change the count of trailing zeros.

If this is correct, then this is a bug and you should add some test cases that will hit this bug.

But regardless of whether my understanding is correct or there is a bug here, I think this can be written in a more obvious way:

  unsigned MaskMidZeros = BitSize - (MaskLeadingOnes + MaskTrailingOnes);

And then directly testing whether they are all zero:

  if (Mask == APInt::getBitsSet(BitSize, MaskLeadingOnes,
                                MaskLeadingOnes + MaskMidZeros)) {

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:223-225
+  if (MaskLeadOnes == BitSize) {
+    MaskMidZeros = 0;
+    DoReplacement = false;
----------------
Why would we see an all ones mask? Shouldn't have that been eliminated earlier? It seems like we could just bail in this case.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:246
+  // otherwise we need to set ShrinkWithMaskedVal to false and expand BR.
+  if ((KnownZero & BitMask) != BitMask) {
+    DoReplacement = false;
----------------
The idiomatic way to test this with APInt is `BitMask.isSubsetOf(KnownZero)`.

Also, it would be good to use early-exit here. It *sounds* like you are testing whether it is valid to do anything, but that isn't clear when you have set up members of `BR` here before returning.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:260
+
+/// Analyze pattern or/xor/and(load P, \p Cst). The result of the pattern
+/// is a partially changed load P. The func will return the the range
----------------
When you have alternates, the pattern notation is a bit confusing. I'd just say something like `Analyze <bop>(load P, \p Cst) where <bop> is either 'or', 'xor', or 'and'.`.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:261
+/// Analyze pattern or/xor/and(load P, \p Cst). The result of the pattern
+/// is a partially changed load P. The func will return the the range
+/// showing where the load result are changed.
----------------
This isn't really about whether the original value is loaded or not, right? It is just bounding the changed bits?

I'd explain it that way. You'll mention the load when you use it.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:263-264
+/// showing where the load result are changed.
+BitRange MemAccessShrinkingPass::analyzeBOpPattern(Value &Val, ConstantInt &Cst,
+                                                   unsigned BitSize) {
+  APInt Mask = Cst.getValue().sextOrTrunc(BitSize);
----------------
Maybe a better name of this function would be: `computeBopChangedBitRange`?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:265
+                                                   unsigned BitSize) {
+  APInt Mask = Cst.getValue().sextOrTrunc(BitSize);
+  BinaryOperator *BOP = cast<BinaryOperator>(&Val);
----------------
Same comment above about just asserting the correct bitsize and passing the APInt Mask in directly.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:266
+  APInt Mask = Cst.getValue().sextOrTrunc(BitSize);
+  BinaryOperator *BOP = cast<BinaryOperator>(&Val);
+  if (BOP->getOpcode() == Instruction::And)
----------------
Why not pass the argument as a BinaryOperator?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:267-268
+  BinaryOperator *BOP = cast<BinaryOperator>(&Val);
+  if (BOP->getOpcode() == Instruction::And)
+    Mask = ~Mask;
+
----------------
Might be nice to add a comment explaining the logic here.

Something like:

  Both 'or' and 'xor' operations only mutate when the operand has a one bit.
  But 'and' only mutates when the operand has a zero bit, so invert the
  constant when the instruction is an and so that all the (potentially)
  changed bits are ones in the operand.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:284
+  Type *NewTy = Type::getIntNTy(*Context, NewBR.Width);
+  Type *StoreTy = Type::getIntNTy(*Context, PowerOf2Ceil(BitSize));
+
----------------
Why the `PowerOf2Ceil` here? Will the actual store used have that applied? If the actual store has that applied, why don't we want to consider that as `BitSize` so we're free to use that larger size for the narrow new type?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:286-289
+  // Check if we can find a new Shift for the Width of NewBR, so that
+  // NewBR forms a new range covering the old modified range without
+  // worsening alignment.
+  auto coverOldRange = [&](BitRange &NewBR, BitRange &BR) -> bool {
----------------
As the comment explains, this lambda is actually computing the Shift. But the name seems to indicate it is just a predicate testing whether the old range is covered by the new one.

Also, why does the old `BR` need to passed in as an argument, isn't that something that can be captured? I actually like passing `NewBR` in here to show that it is what is *changing* between calls to this routine. But it seems awkward to setup `NewBR` before this lambda (which would allow it to be implicitly capture it) and then call it with a parameter name that shadows that, therefore avoiding capturing it. I'd consider whether you want to sink `NewBR` down or othewise more cleanly handle it in the loop.

Nit pick: we typically name lambdas like variables with `FooBar` rather than like functions with `fooBar`.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:290
+  auto coverOldRange = [&](BitRange &NewBR, BitRange &BR) -> bool {
+    unsigned MAlign = MinAlign(Align, DL->getABITypeAlignment(NewTy));
+    int Shift = BR.Shift - BR.Shift % (MAlign * 8);
----------------
What about platforms with support for unaligned loads? Probably best to just leave a FIXME rather than adding more to this patch, but it seems nice to mention that technique.

As an example, on x86, if you have a bitfield that looks like:

  struct S {
    unsigned a : 48;
    unsigned b : 48;
    unsigned c : 32;
  };

It seems likely to be substantially better to do single 8-byte load and mask off the high 2 bytes when accessing `b` than to do two nicely aligned 8-byte loads and all the bit math to recombine things.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:329
+/// mainly based on BR and the endian of the target.
+static unsigned computeBitOffset(BitRange &BR, unsigned BitSize,
+                                 const DataLayout &DL) {
----------------
Pass `BR` by value? (or make it a const reference, but it seems small)

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:331-335
+  unsigned BitOffset;
+  if (DL.isBigEndian())
+    BitOffset = BitSize - BR.Shift - BR.Width;
+  else
+    BitOffset = BR.Shift;
----------------
Generally we prefer early returns to state. That would make this:

  if (...)
    return ...;

  return ...;

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:345-346
+
+/// Check whether V1 and V2 has the same ptr value by looking through bitcast.
+static bool hasSamePtr(Value *V1, Value *V2) {
+  Value *NV1 = nullptr;
----------------
You can replace uses of this with: `V1->stripPointerCasts() == V2->stripPointerCasts()`. This will be more powerful as well.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:370-371
+  MemoryAccess *DefiningAccess = MSSAWalker->getClobberingMemoryAccess(&To);
+  if (FromAccess != DefiningAccess &&
+      MSSA->dominates(FromAccess, DefiningAccess))
+    return true;
----------------
Maybe this is just a strange API on MemorySSA, but typically I wouldn't expect a lack of dominance to indicate that no access between two points exists.

How does MemorySSA model a pattern that looks like:

  From  x 
   \   /
    \ /
     A
     |
     |
     To

Where `A` is a defining access, is between `From` and `To`, but I wouldn't expect `From` to dominate `A` because there is another predecessor `x`.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:393
+/// address \p Ptr, offset \p StOffset and the new type \p NewTy.
+Value *MemAccessShrinkingPass::createNewPtr(Value *Ptr, unsigned StOffset,
+                                            Type *NewTy, IRBuilder<> &Builder) {
----------------
`StOffset` seems an odd name if this is used to create new pointers for loads as well as stores.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:402
+    NewPtr =
+        Builder.CreateGEP(Type::getInt8Ty(*Context), NewPtr, Idx, "uglygep");
+  }
----------------
No need to call it `uglygep`. If you want a clue as to the types, maybe `rawgep` or `bytegep`.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:407-408
+
+/// Check if it is legal to shrink the store if the input LargeVal is a
+/// LoadInst.
+bool MemAccessShrinkingPass::isLegalToShrinkStore(LoadInst &LI, StoreInst &SI) {
----------------
This comment mentions `LargeVal` but that isn't an argument?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:422
+
+/// we try to handle is when a smaller value is "inserted" into some bits
+/// of a larger value. This typically looks something like:
----------------
This reads like a fragment, were there supposed to be more comments before this line?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:557-558
+
+/// Try to extend the existing BitRange \BR to legal integer width.
+bool MemAccessShrinkingPass::findBRWithLegalType(StoreInst &SI, BitRange &BR) {
+  Type *StoreTy = SI.getOperand(0)->getType();
----------------
Keeping this close to `extendBitRange` would make it a lot easier to read. Also, why have two functions at all? It appears this is the only thing calling `extendBitRange`. (I'm OK if there is a reason, just curious what it is.)

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:562-565
+  // Check whether the store is of illegal type. If it is not, don't bother.
+  if (!TLI || TLI->isOperationLegalOrCustom(ISD::STORE,
+                                            TLI->getValueType(*DL, StoreTy)))
+    return false;
----------------
I'm surprised this doesn't just fall out from the logic in `extendBitRange`.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:575
+/// is got from a load or there is an existing store for the LargeVal.
+bool MemAccessShrinkingPass::splitStoreIntoTwo(StoreShrunkInfo &SInfo) {
+  BitRange &BR = SInfo.BR;
----------------
Is `StoreShrunkInfo` buying much here? It seems to mostly be used in arguments, why not just pass the argument directly? The first bit of the code seems to just unpack everything into local variables.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:587-588
+  unsigned Align = SI.getAlignment();
+  if (StoreByteOffset)
+    Align = MinAlign(StoreByteOffset, Align);
+
----------------
Doesn't MinAlign handle 0 correctly so that you can just do this unconditionally?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:593-595
+  // Check if we need to split the original store and generate a new
+  // store for the LargeVal.
+  if (!isa<LoadInst>(SInfo.LargeVal) && needNewStore(*SInfo.LargeVal, SI)) {
----------------
Isn't this only called when we need to insert two stores?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:596-606
+    StoreInst *NewSI = cast<StoreInst>(SI.clone());
+    NewSI->setOperand(0, SInfo.LargeVal);
+    NewSI->setOperand(1, Ptr);
+    Builder.Insert(NewSI);
+    DEBUG(dbgs() << "MemShrink: Insert" << *NewSI << " before" << SI << "\n");
+    // MemorySSA update for the new store.
+    MemoryDef *OldMemAcc = cast<MemoryDef>(MSSA->getMemoryAccess(&SI));
----------------
It feels like all of this could be factored into an 'insertStore' method? In particular, the clone doesn't seem to buy you much as you rewrite most parts of the store anyways.

This could handle all of the MemorySSA updating, logging, etc.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:662-663
+
+  LoadInst *NewLI = cast<LoadInst>(cast<LoadInst>(SInfo.LargeVal)->clone());
+  NewLI->mutateType(NewTy);
+  NewLI->setOperand(0, NewPtr);
----------------
Rather than cloning and mutating, just build a new load? the IRBuilder has a helpful API here.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:679-687
+    if (auto MVCst = dyn_cast<ConstantInt>(SInfo.SmallVal)) {
+      ModifiedCst = MVCst->getValue().lshr(BR.Shift).trunc(BR.Width);
+      Trunc = ConstantInt::get(*Context, ModifiedCst);
+    } else {
+      Value *ShiftedVal =
+          BR.Shift ? Builder.CreateLShr(SInfo.SmallVal, BR.Shift, "lshr")
+                   : SInfo.SmallVal;
----------------
There is no need to handle constants specially. The constant folder will do all the work for you.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:690-705
+  } else {
+    BinaryOperator *BOP = cast<BinaryOperator>(Val);
+    switch (BOP->getOpcode()) {
+    default:
+      llvm_unreachable("BOP can only be And/Or/Xor");
+    case Instruction::And:
+      SmallVal = Builder.CreateAnd(NewLI, NewCst, "and.trunc");
----------------
I think the amount of code that is special cased here for one caller of this routine or the other is an indication that there is a better factoring of the code.

If you had load insertion and store insertion factored out, then each caller could cleanly insert the narrow load, compute the narrow store (differently), and then insert it.

Does that make sense? Maybe there is a reason why that doesn't work well?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:729-730
+
+/// AI contains a consecutive run of 1. Check the range \p BR of \p AI are all
+/// 1s.
+static bool isRangeofAIAllOnes(BitRange &BR, APInt &AI) {
----------------
Might read more easily as: "Assuming that \p AI contains a single sequence of bits set to 1, check whether the range \p BR is covered by that sequence."

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:732-733
+static bool isRangeofAIAllOnes(BitRange &BR, APInt &AI) {
+  if (BR.Shift < AI.countTrailingZeros() ||
+      BR.Width + BR.Shift > (AI.getBitWidth() - AI.countLeadingZeros()))
+    return false;
----------------
It seems more obvious to me to test this the other way:

  BR.Shift >= AI.countLeadingZeros() &&
  BR.Shift + BR.Width < (AI.getBitWidth() - AI.countTrailingZeros())

Is this not equivalent for some reason? (Maybe my brain is off...)

The reason I find this easier to read is because it seems to more directly test: "is the start of the BitRange after the start of the 1s, and is the end of the BitRange before the end of the 1s.".

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:742-745
+/// Check if there is no instruction between \p From and \p To which may
+/// clobber the MemoryLocation \p Mem. However, if there are clobbers and
+/// all the clobber instructions between \p From and \p To are in the same
+/// block as \p To, We will set \p AllClobberInToBlock to true.
----------------
There is no comment about the cost of this routine.

It looks *really* expensive. It appears to walk all transitive predecessors of the block containing `To`. So worst case, every basic block in the function. I see this called in several places from inside of for-loops. Is this really a reasonable approach?

Why aren't we just walking the def-use chain from MemorySSA to figure this kind of thing out in a much lower time complexity bound? Like, shouldn't we just be able to walk up defs until we either see a clobber or `From`?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:762-763
+    for (const BasicBlock *Pred : predecessors(BB)) {
+      if (!BBSet.count(Pred)) {
+        BBSet.insert(Pred);
+        Worklist.push_back(Pred);
----------------
You can just insert -- that will return whether it succeeded in inserting the block.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:770
+  // Check the DefsList inside of each BB.
+  bool hasClobber = false;
+  AllClobberInToBlock = true;
----------------
Naming convention.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:780-781
+        Instruction *Inst = MD->getMemoryInst();
+        if ((FBB == Inst->getParent() && DT->dominates(Inst, &From)) ||
+            (TBB == Inst->getParent() && DT->dominates(&To, Inst)))
+          continue;
----------------
So, each of these `dominates` queries are *incredibly* slow. They require linearly walking every instruction in the basic block (worst case).

Why doesn't MemorySSA handle this for you? (Maybe my comment above about using MemorySSA will obviate this comment though.)

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:946
+                                              bool ReplaceAllUses) {
+  if (!MultiUsesSeen) {
+    // If the Inst has multiple uses and the current shrinking cannot replace
----------------
It would be much more clear for this to be a parameter rather than an implicit parameter via class member. For example, multiple uses *of what*?

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:967-997
+/// When the input instruction \p IN is and(Val, Cst) or trunc, it indicates
+/// only a portion of its input value has been used. We will walk through the
+/// Def-Use chain, track the range of value which will be used, remember the
+/// operations contributing to the used value range, and skip operations which
+/// changes value range that is not to be used, until a load is found.
+///
+/// So we start from and or trunc operations, then try to find a sequence of
----------------
Rather than re-implementing all of this logic, can you re-use the existing demanded bits facilities in LLVM?

For example, I think you can use the `DemandedBits` analysis, walk all loads in the function, and then narrow them based on the demanded bits it has computed. Because of how `DemandedBits` works, it is both efficient and very powerful. It can handle many more patterns.

Thinking about this, I suspect you'll want to do two passes essentially. First, narrow all the *stores* that you can. This will likely be iterative. Once that finishes, it seems like you'll be able to then do a single walk over the loads with a fresh `DemandedBits` analysis and narrow all of those left. You'll probably want to narrow the stores first because that may make bits stop being demanded. But I don't see any way for the reverse to be true, so there should be a good sequencing.

To make the analysis invalidation stuff easier, you may actually need this to actually be two passes so that the store pass can invalidate the `DemandedBits` analysis, and the load pass can recompute it fresh.

Does that make sense?

If so, I would suggest getting just the store shrinking in this patch, and add the load shrinking in a follow-up patch. I'm happy for them to be implemented in a single file as they are very similar and its good for people to realize they likely want *both* passes.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:998
+///
+bool MemAccessShrinkingPass::tryShrinkLoadUse(Instruction &IN) {
+  // If the instruction is actually dead, skip it.
----------------
`Inst` would seem like a more common variable name here.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:1144-1145
+  bool MadeChange = false;
+  for (BasicBlock *BB : post_order(&Fn)) {
+    for (auto InstI = BB->rbegin(), InstE = BB->rend(); InstI != InstE;) {
+      Instruction &Inst = *InstI++;
----------------
Comments would be good explaining the very particular iteration order.

================
Comment at: lib/CodeGen/MemAccessShrinking.cpp:1183-1184
+  while (true) {
+    if (!tryShrink(Fn) && !removeDeadInsts())
+      break;
+    MadeChange = true;
----------------
Do you want to run `tryShrink` again just because you removed dead instructions?

If so, do you want to remove dead instructions on each iteration instead of just once `tryShrink` doesn't make a change?

Repository:
  rL LLVM

https://reviews.llvm.org/D30416