[llvm] r227149 - Commoning of target specific load/store intrinsics in Early CSE.

Thu Feb 26 14:33:56 PST 2015

Hi Sanjin, et al.,

Since I've now done a lot of typing in support of this commit, I also need to be the bearer of bad news: There is a bug...

The problem is that when we process a store intrinsic, we add its pointer/value to the set of AvailableLoads, just as we do with a regular store instruction:

        // Okay, we just invalidated anything we knew about loaded values.  Try
        // to salvage *something* by remembering that the stored value is a live
        // version of the pointer.  It is safe to forward from volatile stores
        // to non-volatile loads, so we don't have to check for volatility of
        // the store.
        AvailableLoads.insert(MemInst.getPtr(), std::pair<Value *, unsigned>(
                                                    Inst, CurrentGeneration));

If we then see a later load intrinsic, we look up its pointer in AvailableLoads to see if the value is already available:

      // If we have an available version of this load, and if it is the right
      // generation, replace this instruction.
      std::pair<Value *, unsigned> InVal =
          AvailableLoads.lookup(MemInst.getPtr());
      if (InVal.first != nullptr && InVal.second == CurrentGeneration) {
        Value *Op = getOrCreateResult(InVal.first, Inst->getType());
        if (Op != nullptr) {
          DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst
                       << "  to: " << *InVal.first << '\n');
          if (!Inst->use_empty())
            Inst->replaceAllUsesWith(Op);
          Inst->eraseFromParent();
      ...

and if not, we then add that pointer/value to the map of available loads:

      // Otherwise, remember that we have this instruction.
      AvailableLoads.insert(MemInst.getPtr(), std::pair<Value *, unsigned>(
                                                  Inst, CurrentGeneration));

but this process completely ignores the 'MatchingId' of the intrinsic that had been processed. As a result, we might CSE loads of the same pointer using incompatible intrinsics, and we might do store-to-load forwarding from store and load intrinsics that are not logical inverses of each other.

One possible solution here is to change AvailableLoads to use std::pair<Value *, unsigned /*MatchingId*/> as the key type, instead of just Value*. But since the intrinsic case is relatively rare, we might want to keep a side map of (Value* -> MatchingId), and simply record in the AvailableLoads map whether or not there is some relevant MatchingId to fetch (I'm somewhat worried that just changing the key type to something more complicated will slow down the code for the benefit of a rare case).

In any case, this needs to be fixed.

 -Hal

----- Original Message -----
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Chandler Carruth" <chandlerc at google.com>
> Cc: "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Tuesday, January 27, 2015 6:07:57 AM
> Subject: Re: [llvm] r227149 - Commoning of target specific load/store	intrinsics in Early CSE.
> 
> ----- Original Message -----
> > From: "Chandler Carruth" <chandlerc at google.com>
> > To: "Chad Rosier" <mcrosier at codeaurora.org>
> > Cc: "Commit Messages and Patches for LLVM"
> > <llvm-commits at cs.uiuc.edu>
> > Sent: Tuesday, January 27, 2015 2:24:19 AM
> > Subject: Re: [llvm] r227149 - Commoning of target specific
> > load/store	intrinsics in Early CSE.
> > 
> > 
> > 
> > I know its a pain to have post-commit design review, but its
> > impossible to catch everything pre-commit.
> > 
> > 
> > On Mon, Jan 26, 2015 at 2:51 PM, Chad Rosier <
> > mcrosier at codeaurora.org > wrote:
> > 
> > 
> > Author: mcrosier
> > Date: Mon Jan 26 16:51:15 2015
> > New Revision: 227149
> > 
> > URL: http://llvm.org/viewvc/llvm-project?rev=227149&view=rev
> > Log:
> > Commoning of target specific load/store intrinsics in Early CSE.
> > 
> > Phabricator revision: http://reviews.llvm.org/D7121
> > Patch by Sanjin Sijaric < ssijaric at codeaurora.org >!
> > 
> > 
> > 
> > The thing I completely missed when skimming the subject was the
> > fact
> > that this is not just target-specific, but leveraging TTI...
> > 
> > 
> > So, first question: why is this desirable? Why do we need this?
> > There
> > seems to be no real justification for this in the commit log,
> > comments, or anywhere.
> > 
> > 
> > EarlyCSE is a very high-level pass. I don't know why it is
> > reasonable
> > to teach it how to deal with every kind of target-specific load
> > intrinsic we come up with. This is really contrary to the entire
> > IR's design.
> > 
> > 
> > The point of TTI was to expose *cost models* from the code
> > generator
> > to the IR pass. This is something completely different and subverts
> > that design in a really fundamental way. I'm pretty opposed to this
> > entire direction unless there is some deep and fundamental reason
> > why we *have* to do this and so far I'm not seeing it. I could
> > speculate about any number of other ways you might solve similar or
> > related problems, but it seems pointless until there is a clear and
> > precise description of the problem this was intended to solve.
> > 
> > So on multiple levels I feel like this is not the right design.
> > 
> 
> I disagree, and furthermore, I'd like to encourage the general design
> paradigm explored by this change:
> 
>  1. I don't believe that TTI is or needs to be *just* about cost
>  modeling. It should be, generically, our mechanism to enforce a
>  proper separation of concerns between the IR-level optimizers and
>  the providers of target-specific knowledge. Cost modeling is one
>  area where this is important, but handling of target-specific
>  intrinsics is another.
> 
>  2. Currently, several IR-level passes, InstCombine, BasicAA, etc.
>  have built-in knowledge of some target-specific intrinsics. IMHO,
>  this is a pretty-awful design, but we also did not have any
>  alternative when the code was first written. Now we do, and we can
>  refactor this code so that knowledge regarding the properties of
>  target-specific intrinsics lives in the targets, as it should. In
>  addition to yielding a better separation of concerns, and being
>  friendlier to out-of-tree targets, it also reduces the importance
>  burden of extending this knowledge (the benefit needs to feel
>  really large to make it worthwhile cluttering the IR transforms
>  with it).
> 
>  3. Forcing all optimizations on target-specific intrinsics, or their
>  associated instructions, into the backend will generically produce
>  sub-optimal code. The SDAG/MI-level optimizers are not as powerful
>  as the IR-level ones (MachineLICM does not track alias sets, for
>  example), and you can't recover completely at that level. Maybe one
>  day this will be different, but I'm not sure providing parity here
>  is a worthwhile aim.
> 
> With this change, there is something of a conceptual completeness
> problem because the same capabilities have not been added to GVN.
> But touching GVN has a much higher cost, and I'm fine with letting
> this new interface settle a bit and leaving GVN changes for some
> later time (if ever).
> 
> Thanks again,
> Hal
> 
> > 
> > -Chandler
> > 
> > 
> > 
> > 
> > Added:
> > llvm/trunk/test/Transforms/EarlyCSE/AArch64/
> > llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > Modified:
> > llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h
> > llvm/trunk/lib/Analysis/TargetTransformInfo.cpp
> > llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp
> > 
> > Modified: llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h?rev=227149&r1=227148&r2=227149&view=diff
> > ==============================================================================
> > --- llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h
> > (original)
> > +++ llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h Mon Jan
> > 26
> > 16:51:15 2015
> > @@ -23,6 +23,7 @@
> > #define LLVM_ANALYSIS_TARGETTRANSFORMINFO_H
> > 
> > #include "llvm/IR/Intrinsics.h"
> > +#include "llvm/IR/IntrinsicInst.h"
> > #include "llvm/Pass.h"
> > #include "llvm/Support/DataTypes.h"
> > 
> > @@ -35,6 +36,20 @@ class Type;
> > class User;
> > class Value;
> > 
> > +/// \brief Information about a load/store intrinsic defined by the
> > target.
> > +struct MemIntrinsicInfo {
> > + MemIntrinsicInfo()
> > + : ReadMem(false), WriteMem(false), Vol(false), MatchingId(0),
> > + NumMemRefs(0), PtrVal(nullptr) {}
> > + bool ReadMem;
> > + bool WriteMem;
> > + bool Vol;
> > + // Same Id is set by the target for corresponding load/store
> > intrinsics.
> > + unsigned short MatchingId;
> > + int NumMemRefs;
> > + Value *PtrVal;
> > +};
> > +
> > /// TargetTransformInfo - This pass provides access to the codegen
> > /// interfaces that are needed for IR-level transformations.
> > class TargetTransformInfo {
> > @@ -443,6 +458,20 @@ public:
> > /// any callee-saved registers, so would require a spill and fill.
> > virtual unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type*> Tys)
> > const;
> > 
> > + /// \returns True if the intrinsic is a supported memory
> > intrinsic.
> > Info
> > + /// will contain additional information - whether the intrinsic
> > may
> > write
> > + /// or read to memory, volatility and the pointer. Info is
> > undefined
> > + /// if false is returned.
> > + virtual bool getTgtMemIntrinsic(IntrinsicInst *Inst,
> > + MemIntrinsicInfo &Info) const;
> > +
> > + /// \returns A value which is the result of the given memory
> > intrinsic. New
> > + /// instructions may be created to extract the result from the
> > given intrinsic
> > + /// memory operation. Returns nullptr if the target cannot create
> > a
> > result
> > + /// from the given intrinsic.
> > + virtual Value *getOrCreateResultFromMemIntrinsic(IntrinsicInst
> > *Inst,
> > + Type *ExpectedType) const;
> > +
> > /// @}
> > 
> > /// Analysis group identification.
> > 
> > Modified: llvm/trunk/lib/Analysis/TargetTransformInfo.cpp
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Analysis/TargetTransformInfo.cpp?rev=227149&r1=227148&r2=227149&view=diff
> > ==============================================================================
> > --- llvm/trunk/lib/Analysis/TargetTransformInfo.cpp (original)
> > +++ llvm/trunk/lib/Analysis/TargetTransformInfo.cpp Mon Jan 26
> > 16:51:15 2015
> > @@ -254,6 +254,16 @@ unsigned TargetTransformInfo::getCostOfK
> > return PrevTTI->getCostOfKeepingLiveOverCall(Tys);
> > }
> > 
> > +Value *TargetTransformInfo::getOrCreateResultFromMemIntrinsic(
> > + IntrinsicInst *Inst, Type *ExpectedType) const {
> > + return PrevTTI->getOrCreateResultFromMemIntrinsic(Inst,
> > ExpectedType);
> > +}
> > +
> > +bool TargetTransformInfo::getTgtMemIntrinsic(IntrinsicInst *Inst,
> > + MemIntrinsicInfo &Info) const {
> > + return PrevTTI->getTgtMemIntrinsic(Inst, Info);
> > +}
> > +
> > namespace {
> > 
> > struct NoTTI final : ImmutablePass, TargetTransformInfo {
> > @@ -656,6 +666,15 @@ struct NoTTI final : ImmutablePass, Targ
> > return 0;
> > }
> > 
> > + bool getTgtMemIntrinsic(IntrinsicInst *Inst,
> > + MemIntrinsicInfo &Info) const override {
> > + return false;
> > + }
> > +
> > + Value *getOrCreateResultFromMemIntrinsic(IntrinsicInst *Inst,
> > + Type *ExpectedType) const override {
> > + return nullptr;
> > + }
> > };
> > 
> > } // end anonymous namespace
> > 
> > Modified:
> > llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp?rev=227149&r1=227148&r2=227149&view=diff
> > ==============================================================================
> > --- llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > (original)
> > +++ llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > Mon
> > Jan 26 16:51:15 2015
> > @@ -44,6 +44,12 @@ class AArch64TTI final : public Immutabl
> > /// are set if the result needs to be inserted and/or extracted
> > from
> > vectors.
> > unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool
> > Extract) const;
> > 
> > + enum MemIntrinsicType {
> > + VECTOR_LDST_TWO_ELEMENTS,
> > + VECTOR_LDST_THREE_ELEMENTS,
> > + VECTOR_LDST_FOUR_ELEMENTS
> > + };
> > +
> > public:
> > AArch64TTI() : ImmutablePass(ID), TM(nullptr), ST(nullptr),
> > TLI(nullptr) {
> > llvm_unreachable("This pass cannot be directly constructed");
> > @@ -131,6 +137,11 @@ public:
> > void getUnrollingPreferences(const Function *F, Loop *L,
> > UnrollingPreferences &UP) const override;
> > 
> > + Value *getOrCreateResultFromMemIntrinsic(IntrinsicInst *Inst,
> > + Type *ExpectedType) const override;
> > +
> > + bool getTgtMemIntrinsic(IntrinsicInst *Inst,
> > + MemIntrinsicInfo &Info) const override;
> > 
> > /// @}
> > };
> > @@ -554,3 +565,83 @@ void AArch64TTI::getUnrollingPreferences
> > // Disable partial & runtime unrolling on -Os.
> > UP.PartialOptSizeThreshold = 0;
> > }
> > +
> > +Value *AArch64TTI::getOrCreateResultFromMemIntrinsic(IntrinsicInst
> > *Inst,
> > + Type *ExpectedType) const {
> > + switch (Inst->getIntrinsicID()) {
> > + default:
> > + return nullptr;
> > + case Intrinsic::aarch64_neon_st2:
> > + case Intrinsic::aarch64_neon_st3:
> > + case Intrinsic::aarch64_neon_st4: {
> > + // Create a struct type
> > + StructType *ST = dyn_cast<StructType>(ExpectedType);
> > + if (!ST)
> > + return nullptr;
> > + unsigned NumElts = Inst->getNumArgOperands() - 1;
> > + if (ST->getNumElements() != NumElts)
> > + return nullptr;
> > + for (unsigned i = 0, e = NumElts; i != e; ++i) {
> > + if (Inst->getArgOperand(i)->getType() != ST->getElementType(i))
> > + return nullptr;
> > + }
> > + Value *Res = UndefValue::get(ExpectedType);
> > + IRBuilder<> Builder(Inst);
> > + for (unsigned i = 0, e = NumElts; i != e; ++i) {
> > + Value *L = Inst->getArgOperand(i);
> > + Res = Builder.CreateInsertValue(Res, L, i);
> > + }
> > + return Res;
> > + }
> > + case Intrinsic::aarch64_neon_ld2:
> > + case Intrinsic::aarch64_neon_ld3:
> > + case Intrinsic::aarch64_neon_ld4:
> > + if (Inst->getType() == ExpectedType)
> > + return Inst;
> > + return nullptr;
> > + }
> > +}
> > +
> > +bool AArch64TTI::getTgtMemIntrinsic(IntrinsicInst *Inst,
> > + MemIntrinsicInfo &Info) const {
> > + switch (Inst->getIntrinsicID()) {
> > + default:
> > + break;
> > + case Intrinsic::aarch64_neon_ld2:
> > + case Intrinsic::aarch64_neon_ld3:
> > + case Intrinsic::aarch64_neon_ld4:
> > + Info.ReadMem = true;
> > + Info.WriteMem = false;
> > + Info.Vol = false;
> > + Info.NumMemRefs = 1;
> > + Info.PtrVal = Inst->getArgOperand(0);
> > + break;
> > + case Intrinsic::aarch64_neon_st2:
> > + case Intrinsic::aarch64_neon_st3:
> > + case Intrinsic::aarch64_neon_st4:
> > + Info.ReadMem = false;
> > + Info.WriteMem = true;
> > + Info.Vol = false;
> > + Info.NumMemRefs = 1;
> > + Info.PtrVal = Inst->getArgOperand(Inst->getNumArgOperands() - 1);
> > + break;
> > + }
> > +
> > + switch (Inst->getIntrinsicID()) {
> > + default:
> > + return false;
> > + case Intrinsic::aarch64_neon_ld2:
> > + case Intrinsic::aarch64_neon_st2:
> > + Info.MatchingId = VECTOR_LDST_TWO_ELEMENTS;
> > + break;
> > + case Intrinsic::aarch64_neon_ld3:
> > + case Intrinsic::aarch64_neon_st3:
> > + Info.MatchingId = VECTOR_LDST_THREE_ELEMENTS;
> > + break;
> > + case Intrinsic::aarch64_neon_ld4:
> > + case Intrinsic::aarch64_neon_st4:
> > + Info.MatchingId = VECTOR_LDST_FOUR_ELEMENTS;
> > + break;
> > + }
> > + return true;
> > +}
> > 
> > Modified: llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp?rev=227149&r1=227148&r2=227149&view=diff
> > ==============================================================================
> > --- llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp (original)
> > +++ llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp Mon Jan 26
> > 16:51:15
> > 2015
> > @@ -18,6 +18,7 @@
> > #include "llvm/ADT/Statistic.h"
> > #include "llvm/Analysis/AssumptionCache.h"
> > #include "llvm/Analysis/InstructionSimplify.h"
> > +#include "llvm/Analysis/TargetTransformInfo.h"
> > #include "llvm/IR/DataLayout.h"
> > #include "llvm/IR/Dominators.h"
> > #include "llvm/IR/Instructions.h"
> > @@ -273,6 +274,7 @@ class EarlyCSE : public FunctionPass {
> > public:
> > const DataLayout *DL;
> > const TargetLibraryInfo *TLI;
> > + const TargetTransformInfo *TTI;
> > DominatorTree *DT;
> > AssumptionCache *AC;
> > typedef RecyclingAllocator<
> > @@ -383,14 +385,83 @@ private:
> > bool Processed;
> > };
> > 
> > + /// \brief Wrapper class to handle memory instructions, including
> > loads,
> > + /// stores and intrinsic loads and stores defined by the target.
> > + class ParseMemoryInst {
> > + public:
> > + ParseMemoryInst(Instruction *Inst, const TargetTransformInfo
> > *TTI)
> > + : Load(false), Store(false), Vol(false),
> > MayReadFromMemory(false),
> > + MayWriteToMemory(false), MatchingId(-1), Ptr(nullptr) {
> > + MayReadFromMemory = Inst->mayReadFromMemory();
> > + MayWriteToMemory = Inst->mayWriteToMemory();
> > + if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(Inst)) {
> > + MemIntrinsicInfo Info;
> > + if (!TTI->getTgtMemIntrinsic(II, Info))
> > + return;
> > + if (Info.NumMemRefs == 1) {
> > + Store = Info.WriteMem;
> > + Load = Info.ReadMem;
> > + MatchingId = Info.MatchingId;
> > + MayReadFromMemory = Info.ReadMem;
> > + MayWriteToMemory = Info.WriteMem;
> > + Vol = Info.Vol;
> > + Ptr = Info.PtrVal;
> > + }
> > + } else if (LoadInst *LI = dyn_cast<LoadInst>(Inst)) {
> > + Load = true;
> > + Vol = !LI->isSimple();
> > + Ptr = LI->getPointerOperand();
> > + } else if (StoreInst *SI = dyn_cast<StoreInst>(Inst)) {
> > + Store = true;
> > + Vol = !SI->isSimple();
> > + Ptr = SI->getPointerOperand();
> > + }
> > + }
> > + bool isLoad() { return Load; }
> > + bool isStore() { return Store; }
> > + bool isVolatile() { return Vol; }
> > + bool isMatchingMemLoc(const ParseMemoryInst &Inst) {
> > + return Ptr == Inst.Ptr && MatchingId == Inst.MatchingId;
> > + }
> > + bool isValid() { return Ptr != nullptr; }
> > + int getMatchingId() { return MatchingId; }
> > + Value *getPtr() { return Ptr; }
> > + bool mayReadFromMemory() { return MayReadFromMemory; }
> > + bool mayWriteToMemory() { return MayWriteToMemory; }
> > +
> > + private:
> > + bool Load;
> > + bool Store;
> > + bool Vol;
> > + bool MayReadFromMemory;
> > + bool MayWriteToMemory;
> > + // For regular (non-intrinsic) loads/stores, this is set to -1.
> > For
> > + // intrinsic loads/stores, the id is retrieved from the
> > corresponding
> > + // field in the MemIntrinsicInfo structure. That field contains
> > + // non-negative values only.
> > + int MatchingId;
> > + Value *Ptr;
> > + };
> > +
> > bool processNode(DomTreeNode *Node);
> > 
> > void getAnalysisUsage(AnalysisUsage &AU) const override {
> > AU.addRequired<AssumptionCacheTracker>();
> > AU.addRequired<DominatorTreeWrapperPass>();
> > AU.addRequired<TargetLibraryInfoWrapperPass>();
> > + AU.addRequired<TargetTransformInfo>();
> > AU.setPreservesCFG();
> > }
> > +
> > + Value *getOrCreateResult(Value *Inst, Type *ExpectedType) const {
> > + if (LoadInst *LI = dyn_cast<LoadInst>(Inst))
> > + return LI;
> > + else if (StoreInst *SI = dyn_cast<StoreInst>(Inst))
> > + return SI->getValueOperand();
> > + assert(isa<IntrinsicInst>(Inst) && "Instruction not supported");
> > + return
> > TTI->getOrCreateResultFromMemIntrinsic(cast<IntrinsicInst>(Inst),
> > + ExpectedType);
> > + }
> > };
> > }
> > 
> > @@ -420,7 +491,7 @@ bool EarlyCSE::processNode(DomTreeNode *
> > /// as long as there in no instruction that reads memory. If we see
> > a
> > store
> > /// to the same location, we delete the dead store. This zaps
> > trivial
> > dead
> > /// stores which can occur in bitfield code among other things.
> > - StoreInst *LastStore = nullptr;
> > + Instruction *LastStore = nullptr;
> > 
> > bool Changed = false;
> > 
> > @@ -475,10 +546,11 @@ bool EarlyCSE::processNode(DomTreeNode *
> > continue;
> > }
> > 
> > + ParseMemoryInst MemInst(Inst, TTI);
> > // If this is a non-volatile load, process it.
> > - if (LoadInst *LI = dyn_cast<LoadInst>(Inst)) {
> > + if (MemInst.isValid() && MemInst.isLoad()) {
> > // Ignore volatile loads.
> > - if (!LI->isSimple()) {
> > + if (MemInst.isVolatile()) {
> > LastStore = nullptr;
> > continue;
> > }
> > @@ -486,27 +558,35 @@ bool EarlyCSE::processNode(DomTreeNode *
> > // If we have an available version of this load, and if it is the
> > right
> > // generation, replace this instruction.
> > std::pair<Value *, unsigned> InVal =
> > - AvailableLoads->lookup(Inst->getOperand(0));
> > + AvailableLoads->lookup(MemInst.getPtr());
> > if (InVal.first != nullptr && InVal.second == CurrentGeneration) {
> > - DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst
> > - << " to: " << *InVal.first << '\n');
> > - if (!Inst->use_empty())
> > - Inst->replaceAllUsesWith(InVal.first);
> > - Inst->eraseFromParent();
> > - Changed = true;
> > - ++NumCSELoad;
> > - continue;
> > + Value *Op = getOrCreateResult(InVal.first, Inst->getType());
> > + if (Op != nullptr) {
> > + DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst
> > + << " to: " << *InVal.first << '\n');
> > + if (!Inst->use_empty())
> > + Inst->replaceAllUsesWith(Op);
> > + Inst->eraseFromParent();
> > + Changed = true;
> > + ++NumCSELoad;
> > + continue;
> > + }
> > }
> > 
> > // Otherwise, remember that we have this instruction.
> > - AvailableLoads->insert(Inst->getOperand(0), std::pair<Value *,
> > unsigned>(
> > - Inst, CurrentGeneration));
> > + AvailableLoads->insert(MemInst.getPtr(), std::pair<Value *,
> > unsigned>(
> > + Inst, CurrentGeneration));
> > LastStore = nullptr;
> > continue;
> > }
> > 
> > // If this instruction may read from memory, forget LastStore.
> > - if (Inst->mayReadFromMemory())
> > + // Load/store intrinsics will indicate both a read and a write to
> > + // memory. The target may override this (e.g. so that a store
> > intrinsic
> > + // does not read from memory, and thus will be treated the same
> > as
> > a
> > + // regular store for commoning purposes).
> > + if (Inst->mayReadFromMemory() &&
> > + !(MemInst.isValid() && !MemInst.mayReadFromMemory()))
> > LastStore = nullptr;
> > 
> > // If this is a read-only call, process it.
> > @@ -537,17 +617,19 @@ bool EarlyCSE::processNode(DomTreeNode *
> > if (Inst->mayWriteToMemory()) {
> > ++CurrentGeneration;
> > 
> > - if (StoreInst *SI = dyn_cast<StoreInst>(Inst)) {
> > + if (MemInst.isValid() && MemInst.isStore()) {
> > // We do a trivial form of DSE if there are two stores to the same
> > // location with no intervening loads. Delete the earlier store.
> > - if (LastStore &&
> > - LastStore->getPointerOperand() == SI->getPointerOperand()) {
> > - DEBUG(dbgs() << "EarlyCSE DEAD STORE: " << *LastStore
> > - << " due to: " << *Inst << '\n');
> > - LastStore->eraseFromParent();
> > - Changed = true;
> > - ++NumDSE;
> > - LastStore = nullptr;
> > + if (LastStore) {
> > + ParseMemoryInst LastStoreMemInst(LastStore, TTI);
> > + if (LastStoreMemInst.isMatchingMemLoc(MemInst)) {
> > + DEBUG(dbgs() << "EarlyCSE DEAD STORE: " << *LastStore
> > + << " due to: " << *Inst << '\n');
> > + LastStore->eraseFromParent();
> > + Changed = true;
> > + ++NumDSE;
> > + LastStore = nullptr;
> > + }
> > // fallthrough - we can exploit information about this store
> > }
> > 
> > @@ -556,13 +638,12 @@ bool EarlyCSE::processNode(DomTreeNode *
> > // version of the pointer. It is safe to forward from volatile
> > stores
> > // to non-volatile loads, so we don't have to check for volatility
> > of
> > // the store.
> > - AvailableLoads->insert(SI->getPointerOperand(),
> > - std::pair<Value *, unsigned>(
> > - SI->getValueOperand(), CurrentGeneration));
> > + AvailableLoads->insert(MemInst.getPtr(), std::pair<Value *,
> > unsigned>(
> > + Inst, CurrentGeneration));
> > 
> > // Remember that this was the last store we saw for DSE.
> > - if (SI->isSimple())
> > - LastStore = SI;
> > + if (!MemInst.isVolatile())
> > + LastStore = Inst;
> > }
> > }
> > }
> > @@ -584,6 +665,7 @@ bool EarlyCSE::runOnFunction(Function &F
> > DataLayoutPass *DLP = getAnalysisIfAvailable<DataLayoutPass>();
> > DL = DLP ? &DLP->getDataLayout() : nullptr;
> > TLI = &getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();
> > + TTI = &getAnalysis<TargetTransformInfo>();
> > DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
> > AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
> > 
> > 
> > Added: llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll?rev=227149&view=auto
> > ==============================================================================
> > --- llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > (added)
> > +++ llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll Mon
> > Jan
> > 26 16:51:15 2015
> > @@ -0,0 +1,231 @@
> > +; RUN: opt < %s -S -mtriple=aarch64-none-linux-gnu -mattr=+neon
> > -early-cse | FileCheck %s
> > +
> > +define <4 x i32> @test_cse(i32* %a, [2 x <4 x i32>] %s.coerce, i32
> > %n) {
> > +entry:
> > +; Check that @llvm.aarch64.neon.ld2 is optimized away by Early
> > CSE.
> > +; CHECK-LABEL: @test_cse
> > +; CHECK-NOT: call { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 0
> > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 1
> > + br label %for.cond
> > +
> > +for.cond: ; preds = %for.body, %entry
> > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ]
> > + %cmp = icmp slt i32 %i.0, %n
> > + br i1 %cmp, label %for.body, label %for.end
> > +
> > +for.body: ; preds = %for.cond
> > + %0 = bitcast i32* %a to i8*
> > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > + %4 = bitcast <16 x i8> %2 to <4 x i32>
> > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > i32>
> > %4, i8* %0)
> > + %5 = bitcast i32* %a to i8*
> > + %vld2 = call { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %5)
> > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld2,
> > 0
> > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld2,
> > 1
> > + %call = call <4 x i32> @vaddq_s32(<4 x i32> %vld2.fca.0.extract,
> > <4
> > x i32> %vld2.fca.0.extract)
> > + %inc = add nsw i32 %i.0, 1
> > + br label %for.cond
> > +
> > +for.end: ; preds = %for.cond
> > + ret <4 x i32> %res.0
> > +}
> > +
> > +define <4 x i32> @test_cse2(i32* %a, [2 x <4 x i32>] %s.coerce,
> > i32
> > %n) {
> > +entry:
> > +; Check that the first @llvm.aarch64.neon.st2 is optimized away by
> > Early CSE.
> > +; CHECK-LABEL: @test_cse2
> > +; CHECK-NOT: call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32>
> > %3, <4 x i32> %3, i8* %0)
> > +; CHECK: call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3,
> > <4 x i32> %4, i8* %0)
> > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 0
> > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 1
> > + br label %for.cond
> > +
> > +for.cond: ; preds = %for.body, %entry
> > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ]
> > + %cmp = icmp slt i32 %i.0, %n
> > + br i1 %cmp, label %for.body, label %for.end
> > +
> > +for.body: ; preds = %for.cond
> > + %0 = bitcast i32* %a to i8*
> > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > + %4 = bitcast <16 x i8> %2 to <4 x i32>
> > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > i32>
> > %3, i8* %0)
> > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > i32>
> > %4, i8* %0)
> > + %5 = bitcast i32* %a to i8*
> > + %vld2 = call { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %5)
> > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld2,
> > 0
> > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld2,
> > 1
> > + %call = call <4 x i32> @vaddq_s32(<4 x i32> %vld2.fca.0.extract,
> > <4
> > x i32> %vld2.fca.0.extract)
> > + %inc = add nsw i32 %i.0, 1
> > + br label %for.cond
> > +
> > +for.end: ; preds = %for.cond
> > + ret <4 x i32> %res.0
> > +}
> > +
> > +define <4 x i32> @test_cse3(i32* %a, [2 x <4 x i32>] %s.coerce,
> > i32
> > %n) #0 {
> > +entry:
> > +; Check that the first @llvm.aarch64.neon.ld2 is optimized away by
> > Early CSE.
> > +; CHECK-LABEL: @test_cse3
> > +; CHECK: call { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > +; CHECK-NOT: call { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 0
> > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 1
> > + br label %for.cond
> > +
> > +for.cond: ; preds = %for.body, %entry
> > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ]
> > + %cmp = icmp slt i32 %i.0, %n
> > + br i1 %cmp, label %for.body, label %for.end
> > +
> > +for.body: ; preds = %for.cond
> > + %0 = bitcast i32* %a to i8*
> > + %vld2 = call { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %0)
> > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld2,
> > 0
> > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld2,
> > 1
> > + %1 = bitcast i32* %a to i8*
> > + %vld22 = call { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %1)
> > + %vld22.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld22, 0
> > + %vld22.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld22, 1
> > + %call = call <4 x i32> @vaddq_s32(<4 x i32> %vld2.fca.0.extract,
> > <4
> > x i32> %vld22.fca.0.extract)
> > + %inc = add nsw i32 %i.0, 1
> > + br label %for.cond
> > +
> > +for.end: ; preds = %for.cond
> > + ret <4 x i32> %res.0
> > +}
> > +
> > +
> > +define <4 x i32> @test_nocse(i32* %a, i32* %b, [2 x <4 x i32>]
> > %s.coerce, i32 %n) {
> > +entry:
> > +; Check that the store prevents @llvm.aarch64.neon.ld2 from being
> > optimized
> > +; away by Early CSE.
> > +; CHECK-LABEL: @test_nocse
> > +; CHECK: call { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 0
> > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 1
> > + br label %for.cond
> > +
> > +for.cond: ; preds = %for.body, %entry
> > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ]
> > + %cmp = icmp slt i32 %i.0, %n
> > + br i1 %cmp, label %for.body, label %for.end
> > +
> > +for.body: ; preds = %for.cond
> > + %0 = bitcast i32* %a to i8*
> > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > + %4 = bitcast <16 x i8> %2 to <4 x i32>
> > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > i32>
> > %4, i8* %0)
> > + store i32 0, i32* %b, align 4
> > + %5 = bitcast i32* %a to i8*
> > + %vld2 = call { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %5)
> > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld2,
> > 0
> > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > %vld2,
> > 1
> > + %call = call <4 x i32> @vaddq_s32(<4 x i32> %vld2.fca.0.extract,
> > <4
> > x i32> %vld2.fca.0.extract)
> > + %inc = add nsw i32 %i.0, 1
> > + br label %for.cond
> > +
> > +for.end: ; preds = %for.cond
> > + ret <4 x i32> %res.0
> > +}
> > +
> > +define <4 x i32> @test_nocse2(i32* %a, [2 x <4 x i32>] %s.coerce,
> > i32 %n) {
> > +entry:
> > +; Check that @llvm.aarch64.neon.ld3 is not optimized away by Early
> > CSE due
> > +; to mismatch between st2 and ld3.
> > +; CHECK-LABEL: @test_nocse2
> > +; CHECK: call { <4 x i32>, <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld3.v4i32.p0i8
> > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 0
> > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 1
> > + br label %for.cond
> > +
> > +for.cond: ; preds = %for.body, %entry
> > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ]
> > + %cmp = icmp slt i32 %i.0, %n
> > + br i1 %cmp, label %for.body, label %for.end
> > +
> > +for.body: ; preds = %for.cond
> > + %0 = bitcast i32* %a to i8*
> > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > + %4 = bitcast <16 x i8> %2 to <4 x i32>
> > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > i32>
> > %4, i8* %0)
> > + %5 = bitcast i32* %a to i8*
> > + %vld3 = call { <4 x i32>, <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld3.v4i32.p0i8(i8* %5)
> > + %vld3.fca.0.extract = extractvalue { <4 x i32>, <4 x i32>, <4 x
> > i32> } %vld3, 0
> > + %vld3.fca.2.extract = extractvalue { <4 x i32>, <4 x i32>, <4 x
> > i32> } %vld3, 2
> > + %call = call <4 x i32> @vaddq_s32(<4 x i32> %vld3.fca.0.extract,
> > <4
> > x i32> %vld3.fca.2.extract)
> > + %inc = add nsw i32 %i.0, 1
> > + br label %for.cond
> > +
> > +for.end: ; preds = %for.cond
> > + ret <4 x i32> %res.0
> > +}
> > +
> > +define <4 x i32> @test_nocse3(i32* %a, [2 x <4 x i32>] %s.coerce,
> > i32 %n) {
> > +entry:
> > +; Check that @llvm.aarch64.neon.st3 is not optimized away by Early
> > CSE due to
> > +; mismatch between st2 and st3.
> > +; CHECK-LABEL: @test_nocse3
> > +; CHECK: call void @llvm.aarch64.neon.st3.v4i32.p0i8
> > +; CHECK: call void @llvm.aarch64.neon.st2.v4i32.p0i8
> > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 0
> > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>] %s.coerce,
> > 1
> > + br label %for.cond
> > +
> > +for.cond: ; preds = %for.body, %entry
> > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ]
> > + %cmp = icmp slt i32 %i.0, %n
> > + br i1 %cmp, label %for.body, label %for.end
> > +
> > +for.body: ; preds = %for.cond
> > + %0 = bitcast i32* %a to i8*
> > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > + %4 = bitcast <16 x i8> %2 to <4 x i32>
> > + call void @llvm.aarch64.neon.st3.v4i32.p0i8(<4 x i32> %4, <4 x
> > i32>
> > %3, <4 x i32> %3, i8* %0)
> > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > i32>
> > %3, i8* %0)
> > + %5 = bitcast i32* %a to i8*
> > + %vld3 = call { <4 x i32>, <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld3.v4i32.p0i8(i8* %5)
> > + %vld3.fca.0.extract = extractvalue { <4 x i32>, <4 x i32>, <4 x
> > i32> } %vld3, 0
> > + %vld3.fca.1.extract = extractvalue { <4 x i32>, <4 x i32>, <4 x
> > i32> } %vld3, 1
> > + %call = call <4 x i32> @vaddq_s32(<4 x i32> %vld3.fca.0.extract,
> > <4
> > x i32> %vld3.fca.0.extract)
> > + %inc = add nsw i32 %i.0, 1
> > + br label %for.cond
> > +
> > +for.end: ; preds = %for.cond
> > + ret <4 x i32> %res.0
> > +}
> > +
> > +; Function Attrs: nounwind
> > +declare void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32>, <4 x
> > i32>,
> > i8* nocapture)
> > +
> > +; Function Attrs: nounwind
> > +declare void @llvm.aarch64.neon.st3.v4i32.p0i8(<4 x i32>, <4 x
> > i32>,
> > <4 x i32>, i8* nocapture)
> > +
> > +; Function Attrs: nounwind readonly
> > +declare { <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8*)
> > +
> > +; Function Attrs: nounwind readonly
> > +declare { <4 x i32>, <4 x i32>, <4 x i32> }
> > @llvm.aarch64.neon.ld3.v4i32.p0i8(i8*)
> > +
> > +define internal fastcc <4 x i32> @vaddq_s32(<4 x i32> %__p0, <4 x
> > i32> %__p1) {
> > +entry:
> > + %add = add <4 x i32> %__p0, %__p1
> > + ret <4 x i32> %add
> > +}
> > 
> > Added: llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg?rev=227149&view=auto
> > ==============================================================================
> > --- llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > (added)
> > +++ llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg Mon
> > Jan
> > 26 16:51:15 2015
> > @@ -0,0 +1,5 @@
> > +config.suffixes = ['.ll']
> > +
> > +targets = set(config.root.targets_to_build.split())
> > +if not 'AArch64' in targets:
> > + config.unsupported = True
> > 
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > 
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory