[llvm] r227149 - Commoning of target specific load/store intrinsics in Early CSE.

Fri Feb 27 12:55:37 PST 2015

-----Original Message-----
From: Hal Finkel [mailto:hfinkel at anl.gov] 
Sent: Thursday, February 26, 2015 5:19 PM
To: Sanjin Sijaric
Cc: Commit Messages and Patches for LLVM; Pete Cooper; Chandler Carruth; Philip Reames
Subject: Re: [llvm] r227149 - Commoning of target specific load/store intrinsics in Early CSE.

----- Original Message -----
> From: "Sanjin Sijaric" <ssijaric at codeaurora.org>
> To: "Hal Finkel" <hfinkel at anl.gov>, "Chandler Carruth" <chandlerc at google.com>, "Philip Reames"
> <listmail at philipreames.com>
> Cc: "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, 
> "Pete Cooper" <peter_cooper at apple.com>
> Sent: Thursday, February 26, 2015 7:02:13 PM
> Subject: RE: [llvm] r227149 - Commoning of target specific load/store	intrinsics in Early CSE.
> 
> Hi Hal,
> 
> Thanks for catching this.
> 
> In the tests I've done on AArch64, mismatching loads and stores were 
> handled correctly due to the following check (test_nocse2 in
> test/Transforms/EarlyCSE/AArch64/intrinsics.ll):
> 
> Value *Op = getOrCreateResult(InVal.first, Inst->getType());  if (Op 
> != nullptr) {
>    DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst
>                  << "  to: " << *InVal.first << '\n');  ....
>  ....
> 
> For example, we may have st2 and ld3 off the same pointer, ptr, and 
> something like
> 
> st2  <=== add ptr to Available Loads
> ...
> ...
> ld3 <=== we have a matching pointer in AvailableLoads, so consider 
> commoning
> 
> The function getOrCreateResult(InVal.first, Inst->getType()) will 
> return nullptr, as the types will not match.  No commoning will occur.
> 
> Using MatchingId would be more consistent as it is used for commoning 
> stores.  Am I missing a case where this check is not sufficient?

If all of your intrinsics with different matching ids also have different types, and if those are also equivalent to regular loads/stores of that same type, then you'll never see the problem. But if you need the MatchingId to distinguish between different intrinsics with the same type, or between the intrinsics and regular loads/stores, then you'll have a problem. When I looked at implementing these interfaces for PowerPC, I ran into this problem.

 -Hal

I see .  Thanks.  I'll move the check for MatchingId into LoadSite/StoreSite.  That way we don't have to change the type of AvailableLoads.   

Alternatively, I can change the type of AvailableLoads from

typedef ScopedHashTable<Value*, std::pair<Value*, unsigned>,
                          DenseMapInfo<Value*>, LoadMapAllocator> LoadHTType;

to

typedef ScopedHashTable<Value*, std::tuple<Value*, unsigned, short>,
                          DenseMapInfo<Value*>, LoadMapAllocator> LoadHTType;

We don't change the key type, but this increases the size of AvailableLoads for cases that don't need the MatchingId (which is the majority).  I'm leaning towards the first approach, as we can just do the comparison via LoadSite or StoreSite.  Something like

If (LoadSite LS = LoadSite(Inst, TTI)) {
  ...
  ...
  if (LS.MatchId(InVal.first)) {....

Thanks,
Sanjin

> 
> I started work on LoadSite and StoreSite that Philip suggested, and 
> will upload it shortly.  I can either incorporate this change with 
> that patch, or as a separate patch.
> 
> Thanks again,
> Sanjin
> 
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Thursday, February 26, 2015 2:34 PM
> To: Chandler Carruth; Sanjin Sijaric; Philip Reames
> Cc: Commit Messages and Patches for LLVM; Pete Cooper
> Subject: Re: [llvm] r227149 - Commoning of target specific load/store 
> intrinsics in Early CSE.
> 
> Hi Sanjin, et al.,
> 
> Since I've now done a lot of typing in support of this commit, I also 
> need to be the bearer of bad news: There is a bug...
> 
> The problem is that when we process a store intrinsic, we add its 
> pointer/value to the set of AvailableLoads, just as we do with a 
> regular store instruction:
> 
>         // Okay, we just invalidated anything we knew about loaded
>         values.  Try
>         // to salvage *something* by remembering that the stored
>         value is a live
>         // version of the pointer.  It is safe to forward from
>         volatile stores
>         // to non-volatile loads, so we don't have to check for
>         volatility of
>         // the store.
>         AvailableLoads.insert(MemInst.getPtr(), std::pair<Value *,
>         unsigned>(
>                                                     Inst,
>                                                     
> CurrentGeneration));
> 
> If we then see a later load intrinsic, we look up its pointer in 
> AvailableLoads to see if the value is already available:
> 
>       // If we have an available version of this load, and if it is
>       the right
>       // generation, replace this instruction.
>       std::pair<Value *, unsigned> InVal =
>           AvailableLoads.lookup(MemInst.getPtr());
>       if (InVal.first != nullptr && InVal.second ==
>       CurrentGeneration) {
>         Value *Op = getOrCreateResult(InVal.first, Inst->getType());
>         if (Op != nullptr) {
>           DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst
>                        << "  to: " << *InVal.first << '\n');
>           if (!Inst->use_empty())
>             Inst->replaceAllUsesWith(Op);
>           Inst->eraseFromParent();
>       ...
> 
> and if not, we then add that pointer/value to the map of available
> loads:
> 
>       // Otherwise, remember that we have this instruction.
>       AvailableLoads.insert(MemInst.getPtr(), std::pair<Value *,
>       unsigned>(
>                                                   Inst,
>                                                   CurrentGeneration));
> 
> but this process completely ignores the 'MatchingId' of the intrinsic 
> that had been processed. As a result, we might CSE loads of the same 
> pointer using incompatible intrinsics, and we might do store-to-load 
> forwarding from store and load intrinsics that are not logical 
> inverses of each other.
> 
> One possible solution here is to change AvailableLoads to use 
> std::pair<Value *, unsigned /*MatchingId*/> as the key type, instead 
> of just Value*. But since the intrinsic case is relatively rare, we 
> might want to keep a side map of (Value* -> MatchingId), and simply 
> record in the AvailableLoads map whether or not there is some relevant 
> MatchingId to fetch (I'm somewhat worried that just changing the key 
> type to something more complicated will slow down the code for the 
> benefit of a rare case).
> 
> In any case, this needs to be fixed.
> 
>  -Hal
> 
> ----- Original Message -----
> > From: "Hal Finkel" <hfinkel at anl.gov>
> > To: "Chandler Carruth" <chandlerc at google.com>
> > Cc: "Commit Messages and Patches for LLVM"
> > <llvm-commits at cs.uiuc.edu>
> > Sent: Tuesday, January 27, 2015 6:07:57 AM
> > Subject: Re: [llvm] r227149 - Commoning of target specific
> > load/store	intrinsics in Early CSE.
> > 
> > ----- Original Message -----
> > > From: "Chandler Carruth" <chandlerc at google.com>
> > > To: "Chad Rosier" <mcrosier at codeaurora.org>
> > > Cc: "Commit Messages and Patches for LLVM"
> > > <llvm-commits at cs.uiuc.edu>
> > > Sent: Tuesday, January 27, 2015 2:24:19 AM
> > > Subject: Re: [llvm] r227149 - Commoning of target specific
> > > load/store	intrinsics in Early CSE.
> > > 
> > > 
> > > 
> > > I know its a pain to have post-commit design review, but its 
> > > impossible to catch everything pre-commit.
> > > 
> > > 
> > > On Mon, Jan 26, 2015 at 2:51 PM, Chad Rosier < 
> > > mcrosier at codeaurora.org > wrote:
> > > 
> > > 
> > > Author: mcrosier
> > > Date: Mon Jan 26 16:51:15 2015
> > > New Revision: 227149
> > > 
> > > URL: http://llvm.org/viewvc/llvm-project?rev=227149&view=rev
> > > Log:
> > > Commoning of target specific load/store intrinsics in Early CSE.
> > > 
> > > Phabricator revision: http://reviews.llvm.org/D7121 Patch by 
> > > Sanjin Sijaric < ssijaric at codeaurora.org >!
> > > 
> > > 
> > > 
> > > The thing I completely missed when skimming the subject was the 
> > > fact that this is not just target-specific, but leveraging TTI...
> > > 
> > > 
> > > So, first question: why is this desirable? Why do we need this?
> > > There
> > > seems to be no real justification for this in the commit log, 
> > > comments, or anywhere.
> > > 
> > > 
> > > EarlyCSE is a very high-level pass. I don't know why it is 
> > > reasonable to teach it how to deal with every kind of 
> > > target-specific load intrinsic we come up with. This is really 
> > > contrary to the entire IR's design.
> > > 
> > > 
> > > The point of TTI was to expose *cost models* from the code 
> > > generator to the IR pass. This is something completely different 
> > > and subverts that design in a really fundamental way. I'm pretty 
> > > opposed to this entire direction unless there is some deep and 
> > > fundamental reason why we *have* to do this and so far I'm not 
> > > seeing it. I could speculate about any number of other ways you 
> > > might solve similar or related problems, but it seems pointless 
> > > until there is a clear and precise description of the problem this 
> > > was intended to solve.
> > > 
> > > So on multiple levels I feel like this is not the right design.
> > > 
> > 
> > I disagree, and furthermore, I'd like to encourage the general 
> > design paradigm explored by this change:
> > 
> >  1. I don't believe that TTI is or needs to be *just* about cost 
> > modeling. It should be, generically, our mechanism to enforce a 
> > proper separation of concerns between the IR-level optimizers and  
> > the providers of target-specific knowledge. Cost modeling is one  
> > area where this is important, but handling of target-specific  
> > intrinsics is another.
> > 
> >  2. Currently, several IR-level passes, InstCombine, BasicAA, etc.
> >  have built-in knowledge of some target-specific intrinsics. IMHO, 
> > this is a pretty-awful design, but we also did not have any 
> > alternative when the code was first written. Now we do, and we can 
> > refactor this code so that knowledge regarding the properties of 
> > target-specific intrinsics lives in the targets, as it should. In 
> > addition to yielding a better separation of concerns, and being 
> > friendlier to out-of-tree targets, it also reduces the importance 
> > burden of extending this knowledge (the benefit needs to feel  
> > really large to make it worthwhile cluttering the IR transforms  
> > with it).
> > 
> >  3. Forcing all optimizations on target-specific intrinsics, or  
> > their associated instructions, into the backend will generically 
> > produce sub-optimal code. The SDAG/MI-level optimizers are not as 
> > powerful  as the IR-level ones (MachineLICM does not track alias 
> > sets, for example), and you can't recover completely at that level. 
> > Maybe one day this will be different, but I'm not sure providing 
> > parity here  is a worthwhile aim.
> > 
> > With this change, there is something of a conceptual completeness 
> > problem because the same capabilities have not been added to GVN.
> > But touching GVN has a much higher cost, and I'm fine with letting 
> > this new interface settle a bit and leaving GVN changes for some 
> > later time (if ever).
> > 
> > Thanks again,
> > Hal
> > 
> > > 
> > > -Chandler
> > > 
> > > 
> > > 
> > > 
> > > Added:
> > > llvm/trunk/test/Transforms/EarlyCSE/AArch64/
> > > llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > > llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > > Modified:
> > > llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h
> > > llvm/trunk/lib/Analysis/TargetTransformInfo.cpp
> > > llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > > llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp
> > > 
> > > Modified: llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h
> > > URL:
> > > http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/Analys
> > > is /TargetTransformInfo.h?rev=227149&r1=227148&r2=227149&view=diff
> > > ==================================================================
> > > ==
> > > ==========
> > > --- llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h
> > > (original)
> > > +++ llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h Mon
> > > Jan
> > > 26
> > > 16:51:15 2015
> > > @@ -23,6 +23,7 @@
> > > #define LLVM_ANALYSIS_TARGETTRANSFORMINFO_H
> > > 
> > > #include "llvm/IR/Intrinsics.h"
> > > +#include "llvm/IR/IntrinsicInst.h"
> > > #include "llvm/Pass.h"
> > > #include "llvm/Support/DataTypes.h"
> > > 
> > > @@ -35,6 +36,20 @@ class Type;
> > > class User;
> > > class Value;
> > > 
> > > +/// \brief Information about a load/store intrinsic defined by
> > > the
> > > target.
> > > +struct MemIntrinsicInfo {
> > > + MemIntrinsicInfo()
> > > + : ReadMem(false), WriteMem(false), Vol(false), MatchingId(0), 
> > > +NumMemRefs(0), PtrVal(nullptr) {}  bool ReadMem;  bool WriteMem; 
> > > +bool Vol;  // Same Id is set by the target for corresponding 
> > > +load/store
> > > intrinsics.
> > > + unsigned short MatchingId;
> > > + int NumMemRefs;
> > > + Value *PtrVal;
> > > +};
> > > +
> > > /// TargetTransformInfo - This pass provides access to the codegen 
> > > /// interfaces that are needed for IR-level transformations.
> > > class TargetTransformInfo {
> > > @@ -443,6 +458,20 @@ public:
> > > /// any callee-saved registers, so would require a spill and fill.
> > > virtual unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type*>
> > > Tys)
> > > const;
> > > 
> > > + /// \returns True if the intrinsic is a supported memory
> > > intrinsic.
> > > Info
> > > + /// will contain additional information - whether the intrinsic
> > > may
> > > write
> > > + /// or read to memory, volatility and the pointer. Info is
> > > undefined
> > > + /// if false is returned.
> > > + virtual bool getTgtMemIntrinsic(IntrinsicInst *Inst, 
> > > + MemIntrinsicInfo &Info) const;
> > > +
> > > + /// \returns A value which is the result of the given memory
> > > intrinsic. New
> > > + /// instructions may be created to extract the result from the
> > > given intrinsic
> > > + /// memory operation. Returns nullptr if the target cannot
> > > create
> > > a
> > > result
> > > + /// from the given intrinsic.
> > > + virtual Value *getOrCreateResultFromMemIntrinsic(IntrinsicInst
> > > *Inst,
> > > + Type *ExpectedType) const;
> > > +
> > > /// @}
> > > 
> > > /// Analysis group identification.
> > > 
> > > Modified: llvm/trunk/lib/Analysis/TargetTransformInfo.cpp
> > > URL:
> > > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Analysis/Target
> > > Tr ansformInfo.cpp?rev=227149&r1=227148&r2=227149&view=diff
> > > ==================================================================
> > > ==
> > > ==========
> > > --- llvm/trunk/lib/Analysis/TargetTransformInfo.cpp (original)
> > > +++ llvm/trunk/lib/Analysis/TargetTransformInfo.cpp Mon Jan 26
> > > 16:51:15 2015
> > > @@ -254,6 +254,16 @@ unsigned TargetTransformInfo::getCostOfK 
> > > return
> > > PrevTTI->getCostOfKeepingLiveOverCall(Tys);
> > > }
> > > 
> > > +Value *TargetTransformInfo::getOrCreateResultFromMemIntrinsic(
> > > + IntrinsicInst *Inst, Type *ExpectedType) const {  return
> > > +PrevTTI->getOrCreateResultFromMemIntrinsic(Inst,
> > > ExpectedType);
> > > +}
> > > +
> > > +bool TargetTransformInfo::getTgtMemIntrinsic(IntrinsicInst
> > > *Inst,
> > > +MemIntrinsicInfo &Info) const {  return
> > > +PrevTTI->getTgtMemIntrinsic(Inst, Info); }
> > > +
> > > namespace {
> > > 
> > > struct NoTTI final : ImmutablePass, TargetTransformInfo { @@
> > > -656,6
> > > +666,15 @@ struct NoTTI final : ImmutablePass, Targ return 0; }
> > > 
> > > + bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo
> > > + &Info) const override { return false; }
> > > +
> > > + Value *getOrCreateResultFromMemIntrinsic(IntrinsicInst *Inst,
> > > Type
> > > + *ExpectedType) const override { return nullptr; }
> > > };
> > > 
> > > } // end anonymous namespace
> > > 
> > > Modified:
> > > llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > > URL:
> > > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AArch64/
> > > AA 
> > > rch64TargetTransformInfo.cpp?rev=227149&r1=227148&r2=227149&view=d
> > > if
> > > f
> > > ==================================================================
> > > ==
> > > ==========
> > > --- llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > > (original)
> > > +++ llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > > Mon
> > > Jan 26 16:51:15 2015
> > > @@ -44,6 +44,12 @@ class AArch64TTI final : public Immutabl /// 
> > > are set if the result needs to be inserted and/or extracted from 
> > > vectors.
> > > unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool
> > > Extract) const;
> > > 
> > > + enum MemIntrinsicType {
> > > + VECTOR_LDST_TWO_ELEMENTS,
> > > + VECTOR_LDST_THREE_ELEMENTS,
> > > + VECTOR_LDST_FOUR_ELEMENTS
> > > + };
> > > +
> > > public:
> > > AArch64TTI() : ImmutablePass(ID), TM(nullptr), ST(nullptr),
> > > TLI(nullptr) {
> > > llvm_unreachable("This pass cannot be directly constructed"); @@
> > > -131,6 +137,11 @@ public:
> > > void getUnrollingPreferences(const Function *F, Loop *L, 
> > > UnrollingPreferences &UP) const override;
> > > 
> > > + Value *getOrCreateResultFromMemIntrinsic(IntrinsicInst *Inst,
> > > Type
> > > + *ExpectedType) const override;
> > > +
> > > + bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo
> > > + &Info) const override;
> > > 
> > > /// @}
> > > };
> > > @@ -554,3 +565,83 @@ void AArch64TTI::getUnrollingPreferences
> > > // Disable partial & runtime unrolling on -Os.
> > > UP.PartialOptSizeThreshold = 0;
> > > }
> > > +
> > > +Value
> > > *AArch64TTI::getOrCreateResultFromMemIntrinsic(IntrinsicInst
> > > *Inst,
> > > + Type *ExpectedType) const {
> > > + switch (Inst->getIntrinsicID()) {
> > > + default:
> > > + return nullptr;
> > > + case Intrinsic::aarch64_neon_st2:
> > > + case Intrinsic::aarch64_neon_st3:
> > > + case Intrinsic::aarch64_neon_st4: {  // Create a struct type 
> > > +StructType *ST = dyn_cast<StructType>(ExpectedType);
> > > + if (!ST)
> > > + return nullptr;
> > > + unsigned NumElts = Inst->getNumArgOperands() - 1;  if
> > > +(ST->getNumElements() != NumElts)  return nullptr;  for
> > > (unsigned i
> > > += 0, e = NumElts; i != e; ++i) {  if
> > > +(Inst->getArgOperand(i)->getType() != ST->getElementType(i)) 
> > > +return nullptr;  }  Value *Res = UndefValue::get(ExpectedType); 
> > > +IRBuilder<> Builder(Inst);  for (unsigned i = 0, e = NumElts; i
> > > !=
> > > +e; ++i) {  Value *L = Inst->getArgOperand(i);  Res = 
> > > +Builder.CreateInsertValue(Res, L, i);  }  return Res;  }  case
> > > +Intrinsic::aarch64_neon_ld2:
> > > + case Intrinsic::aarch64_neon_ld3:
> > > + case Intrinsic::aarch64_neon_ld4:
> > > + if (Inst->getType() == ExpectedType)  return Inst;  return 
> > > +nullptr;  } }
> > > +
> > > +bool AArch64TTI::getTgtMemIntrinsic(IntrinsicInst *Inst, 
> > > +MemIntrinsicInfo &Info) const {  switch (Inst->getIntrinsicID())
> > > {
> > > + default:
> > > + break;
> > > + case Intrinsic::aarch64_neon_ld2:
> > > + case Intrinsic::aarch64_neon_ld3:
> > > + case Intrinsic::aarch64_neon_ld4:
> > > + Info.ReadMem = true;
> > > + Info.WriteMem = false;
> > > + Info.Vol = false;
> > > + Info.NumMemRefs = 1;
> > > + Info.PtrVal = Inst->getArgOperand(0);  break;  case
> > > +Intrinsic::aarch64_neon_st2:
> > > + case Intrinsic::aarch64_neon_st3:
> > > + case Intrinsic::aarch64_neon_st4:
> > > + Info.ReadMem = false;
> > > + Info.WriteMem = true;
> > > + Info.Vol = false;
> > > + Info.NumMemRefs = 1;
> > > + Info.PtrVal = Inst->getArgOperand(Inst->getNumArgOperands() -
> > > 1);
> > > +break;  }
> > > +
> > > + switch (Inst->getIntrinsicID()) {
> > > + default:
> > > + return false;
> > > + case Intrinsic::aarch64_neon_ld2:
> > > + case Intrinsic::aarch64_neon_st2:
> > > + Info.MatchingId = VECTOR_LDST_TWO_ELEMENTS;  break;  case
> > > +Intrinsic::aarch64_neon_ld3:
> > > + case Intrinsic::aarch64_neon_st3:
> > > + Info.MatchingId = VECTOR_LDST_THREE_ELEMENTS;  break;  case
> > > +Intrinsic::aarch64_neon_ld4:
> > > + case Intrinsic::aarch64_neon_st4:
> > > + Info.MatchingId = VECTOR_LDST_FOUR_ELEMENTS;  break;  }  return 
> > > +true; }
> > > 
> > > Modified: llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp
> > > URL:
> > > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scal
> > > ar /EarlyCSE.cpp?rev=227149&r1=227148&r2=227149&view=diff
> > > ==================================================================
> > > ==
> > > ==========
> > > --- llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp (original)
> > > +++ llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp Mon Jan 26
> > > 16:51:15
> > > 2015
> > > @@ -18,6 +18,7 @@
> > > #include "llvm/ADT/Statistic.h"
> > > #include "llvm/Analysis/AssumptionCache.h"
> > > #include "llvm/Analysis/InstructionSimplify.h"
> > > +#include "llvm/Analysis/TargetTransformInfo.h"
> > > #include "llvm/IR/DataLayout.h"
> > > #include "llvm/IR/Dominators.h"
> > > #include "llvm/IR/Instructions.h"
> > > @@ -273,6 +274,7 @@ class EarlyCSE : public FunctionPass {
> > > public:
> > > const DataLayout *DL;
> > > const TargetLibraryInfo *TLI;
> > > + const TargetTransformInfo *TTI;
> > > DominatorTree *DT;
> > > AssumptionCache *AC;
> > > typedef RecyclingAllocator<
> > > @@ -383,14 +385,83 @@ private:
> > > bool Processed;
> > > };
> > > 
> > > + /// \brief Wrapper class to handle memory instructions,
> > > including
> > > loads,
> > > + /// stores and intrinsic loads and stores defined by the
> > > target.
> > > + class ParseMemoryInst {
> > > + public:
> > > + ParseMemoryInst(Instruction *Inst, const TargetTransformInfo
> > > *TTI)
> > > + : Load(false), Store(false), Vol(false),
> > > MayReadFromMemory(false),
> > > + MayWriteToMemory(false), MatchingId(-1), Ptr(nullptr) { 
> > > + MayReadFromMemory = Inst->mayReadFromMemory(); MayWriteToMemory
> > > =
> > > + Inst->mayWriteToMemory(); if (IntrinsicInst *II =
> > > + dyn_cast<IntrinsicInst>(Inst)) { MemIntrinsicInfo Info; if 
> > > + (!TTI->getTgtMemIntrinsic(II, Info)) return; if
> > > (Info.NumMemRefs
> > > + == 1) { Store = Info.WriteMem; Load = Info.ReadMem; MatchingId
> > > =
> > > + Info.MatchingId; MayReadFromMemory = Info.ReadMem; 
> > > + MayWriteToMemory = Info.WriteMem; Vol = Info.Vol; Ptr = 
> > > + Info.PtrVal; } } else if (LoadInst *LI =
> > > dyn_cast<LoadInst>(Inst))
> > > + { Load = true; Vol = !LI->isSimple(); Ptr =
> > > + LI->getPointerOperand(); } else if (StoreInst *SI =
> > > + dyn_cast<StoreInst>(Inst)) { Store = true; Vol =
> > > !SI->isSimple();
> > > + Ptr = SI->getPointerOperand(); } } bool isLoad() { return Load;
> > > }
> > > + bool isStore() { return Store; } bool isVolatile() { return
> > > Vol; }
> > > + bool isMatchingMemLoc(const ParseMemoryInst &Inst) { return Ptr
> > > ==
> > > + Inst.Ptr && MatchingId == Inst.MatchingId; } bool isValid() { 
> > > + return Ptr != nullptr; } int getMatchingId() { return
> > > MatchingId;
> > > + } Value *getPtr() { return Ptr; } bool mayReadFromMemory() { 
> > > + return MayReadFromMemory; } bool mayWriteToMemory() { return 
> > > + MayWriteToMemory; }
> > > +
> > > + private:
> > > + bool Load;
> > > + bool Store;
> > > + bool Vol;
> > > + bool MayReadFromMemory;
> > > + bool MayWriteToMemory;
> > > + // For regular (non-intrinsic) loads/stores, this is set to -1.
> > > For
> > > + // intrinsic loads/stores, the id is retrieved from the
> > > corresponding
> > > + // field in the MemIntrinsicInfo structure. That field contains
> > > //
> > > + non-negative values only.
> > > + int MatchingId;
> > > + Value *Ptr;
> > > + };
> > > +
> > > bool processNode(DomTreeNode *Node);
> > > 
> > > void getAnalysisUsage(AnalysisUsage &AU) const override { 
> > > AU.addRequired<AssumptionCacheTracker>();
> > > AU.addRequired<DominatorTreeWrapperPass>();
> > > AU.addRequired<TargetLibraryInfoWrapperPass>();
> > > + AU.addRequired<TargetTransformInfo>();
> > > AU.setPreservesCFG();
> > > }
> > > +
> > > + Value *getOrCreateResult(Value *Inst, Type *ExpectedType) const
> > > {
> > > + if (LoadInst *LI = dyn_cast<LoadInst>(Inst)) return LI; else if 
> > > + (StoreInst *SI = dyn_cast<StoreInst>(Inst)) return
> > > + SI->getValueOperand();
> > > + assert(isa<IntrinsicInst>(Inst) && "Instruction not
> > > supported");
> > > + return
> > > TTI->getOrCreateResultFromMemIntrinsic(cast<IntrinsicInst>(Inst),
> > > + ExpectedType);
> > > + }
> > > };
> > > }
> > > 
> > > @@ -420,7 +491,7 @@ bool EarlyCSE::processNode(DomTreeNode * /// 
> > > as long as there in no instruction that reads memory. If we see a 
> > > store /// to the same location, we delete the dead store. This 
> > > zaps trivial dead /// stores which can occur in bitfield code 
> > > among other things.
> > > - StoreInst *LastStore = nullptr;
> > > + Instruction *LastStore = nullptr;
> > > 
> > > bool Changed = false;
> > > 
> > > @@ -475,10 +546,11 @@ bool EarlyCSE::processNode(DomTreeNode * 
> > > continue; }
> > > 
> > > + ParseMemoryInst MemInst(Inst, TTI);
> > > // If this is a non-volatile load, process it.
> > > - if (LoadInst *LI = dyn_cast<LoadInst>(Inst)) {
> > > + if (MemInst.isValid() && MemInst.isLoad()) {
> > > // Ignore volatile loads.
> > > - if (!LI->isSimple()) {
> > > + if (MemInst.isVolatile()) {
> > > LastStore = nullptr;
> > > continue;
> > > }
> > > @@ -486,27 +558,35 @@ bool EarlyCSE::processNode(DomTreeNode * // 
> > > If we have an available version of this load, and if it is the 
> > > right // generation, replace this instruction.
> > > std::pair<Value *, unsigned> InVal =
> > > - AvailableLoads->lookup(Inst->getOperand(0));
> > > + AvailableLoads->lookup(MemInst.getPtr());
> > > if (InVal.first != nullptr && InVal.second == CurrentGeneration) {
> > > - DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst
> > > - << " to: " << *InVal.first << '\n');
> > > - if (!Inst->use_empty())
> > > - Inst->replaceAllUsesWith(InVal.first);
> > > - Inst->eraseFromParent();
> > > - Changed = true;
> > > - ++NumCSELoad;
> > > - continue;
> > > + Value *Op = getOrCreateResult(InVal.first, Inst->getType()); if 
> > > + (Op != nullptr) {
> > > + DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst << " to: " << 
> > > + *InVal.first << '\n'); if (!Inst->use_empty())
> > > + Inst->replaceAllUsesWith(Op);
> > > + Inst->eraseFromParent();
> > > + Changed = true;
> > > + ++NumCSELoad;
> > > + continue;
> > > + }
> > > }
> > > 
> > > // Otherwise, remember that we have this instruction.
> > > - AvailableLoads->insert(Inst->getOperand(0), std::pair<Value *,
> > > unsigned>(
> > > - Inst, CurrentGeneration));
> > > + AvailableLoads->insert(MemInst.getPtr(), std::pair<Value *,
> > > unsigned>(
> > > + Inst, CurrentGeneration));
> > > LastStore = nullptr;
> > > continue;
> > > }
> > > 
> > > // If this instruction may read from memory, forget LastStore.
> > > - if (Inst->mayReadFromMemory())
> > > + // Load/store intrinsics will indicate both a read and a write
> > > to
> > > + // memory. The target may override this (e.g. so that a store
> > > intrinsic
> > > + // does not read from memory, and thus will be treated the same
> > > as
> > > a
> > > + // regular store for commoning purposes).
> > > + if (Inst->mayReadFromMemory() &&
> > > + !(MemInst.isValid() && !MemInst.mayReadFromMemory()))
> > > LastStore = nullptr;
> > > 
> > > // If this is a read-only call, process it.
> > > @@ -537,17 +617,19 @@ bool EarlyCSE::processNode(DomTreeNode * if
> > > (Inst->mayWriteToMemory()) {
> > > ++CurrentGeneration;
> > > 
> > > - if (StoreInst *SI = dyn_cast<StoreInst>(Inst)) {
> > > + if (MemInst.isValid() && MemInst.isStore()) {
> > > // We do a trivial form of DSE if there are two stores to the same 
> > > // location with no intervening loads. Delete the earlier store.
> > > - if (LastStore &&
> > > - LastStore->getPointerOperand() == SI->getPointerOperand()) {
> > > - DEBUG(dbgs() << "EarlyCSE DEAD STORE: " << *LastStore
> > > - << " due to: " << *Inst << '\n');
> > > - LastStore->eraseFromParent();
> > > - Changed = true;
> > > - ++NumDSE;
> > > - LastStore = nullptr;
> > > + if (LastStore) {
> > > + ParseMemoryInst LastStoreMemInst(LastStore, TTI); if
> > > + (LastStoreMemInst.isMatchingMemLoc(MemInst)) {
> > > + DEBUG(dbgs() << "EarlyCSE DEAD STORE: " << *LastStore << " due
> > > to:
> > > + " << *Inst << '\n');
> > > + LastStore->eraseFromParent();
> > > + Changed = true;
> > > + ++NumDSE;
> > > + LastStore = nullptr;
> > > + }
> > > // fallthrough - we can exploit information about this store }
> > > 
> > > @@ -556,13 +638,12 @@ bool EarlyCSE::processNode(DomTreeNode * // 
> > > version of the pointer. It is safe to forward from volatile stores 
> > > // to non-volatile loads, so we don't have to check for volatility 
> > > of // the store.
> > > - AvailableLoads->insert(SI->getPointerOperand(),
> > > - std::pair<Value *, unsigned>(
> > > - SI->getValueOperand(), CurrentGeneration));
> > > + AvailableLoads->insert(MemInst.getPtr(), std::pair<Value *,
> > > unsigned>(
> > > + Inst, CurrentGeneration));
> > > 
> > > // Remember that this was the last store we saw for DSE.
> > > - if (SI->isSimple())
> > > - LastStore = SI;
> > > + if (!MemInst.isVolatile())
> > > + LastStore = Inst;
> > > }
> > > }
> > > }
> > > @@ -584,6 +665,7 @@ bool EarlyCSE::runOnFunction(Function &F 
> > > DataLayoutPass *DLP = getAnalysisIfAvailable<DataLayoutPass>();
> > > DL = DLP ? &DLP->getDataLayout() : nullptr; TLI = 
> > > &getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();
> > > + TTI = &getAnalysis<TargetTransformInfo>();
> > > DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
> > > AC =
> > > &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
> > > 
> > > 
> > > Added: llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > > URL:
> > > http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/Ear
> > > ly CSE/AArch64/intrinsics.ll?rev=227149&view=auto
> > > ==================================================================
> > > ==
> > > ==========
> > > --- llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > > (added)
> > > +++ llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll Mon
> > > Jan
> > > 26 16:51:15 2015
> > > @@ -0,0 +1,231 @@
> > > +; RUN: opt < %s -S -mtriple=aarch64-none-linux-gnu -mattr=+neon
> > > -early-cse | FileCheck %s
> > > +
> > > +define <4 x i32> @test_cse(i32* %a, [2 x <4 x i32>] %s.coerce,
> > > i32
> > > %n) {
> > > +entry:
> > > +; Check that @llvm.aarch64.neon.ld2 is optimized away by Early
> > > CSE.
> > > +; CHECK-LABEL: @test_cse
> > > +; CHECK-NOT: call { <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 0
> > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 1
> > > + br label %for.cond
> > > +
> > > +for.cond: ; preds = %for.body, %entry
> > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ] 
> > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body, label 
> > > +%for.end
> > > +
> > > +for.body: ; preds = %for.cond
> > > + %0 = bitcast i32* %a to i8*
> > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > + at llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > i32>
> > > %4, i8* %0)
> > > + %5 = bitcast i32* %a to i8*
> > > + %vld2 = call { <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %5)
> > > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld2,
> > > 0
> > > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld2,
> > > 1
> > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > %vld2.fca.0.extract,
> > > <4
> > > x i32> %vld2.fca.0.extract)
> > > + %inc = add nsw i32 %i.0, 1
> > > + br label %for.cond
> > > +
> > > +for.end: ; preds = %for.cond
> > > + ret <4 x i32> %res.0
> > > +}
> > > +
> > > +define <4 x i32> @test_cse2(i32* %a, [2 x <4 x i32>] %s.coerce,
> > > i32
> > > %n) {
> > > +entry:
> > > +; Check that the first @llvm.aarch64.neon.st2 is optimized away
> > > by
> > > Early CSE.
> > > +; CHECK-LABEL: @test_cse2
> > > +; CHECK-NOT: call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x
> > > i32>
> > > %3, <4 x i32> %3, i8* %0)
> > > +; CHECK: call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32>
> > > %3,
> > > <4 x i32> %4, i8* %0)
> > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 0
> > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 1
> > > + br label %for.cond
> > > +
> > > +for.cond: ; preds = %for.body, %entry
> > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ] 
> > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body, label 
> > > +%for.end
> > > +
> > > +for.body: ; preds = %for.cond
> > > + %0 = bitcast i32* %a to i8*
> > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > + at llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > i32>
> > > %3, i8* %0)
> > > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > i32>
> > > %4, i8* %0)
> > > + %5 = bitcast i32* %a to i8*
> > > + %vld2 = call { <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %5)
> > > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld2,
> > > 0
> > > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld2,
> > > 1
> > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > %vld2.fca.0.extract,
> > > <4
> > > x i32> %vld2.fca.0.extract)
> > > + %inc = add nsw i32 %i.0, 1
> > > + br label %for.cond
> > > +
> > > +for.end: ; preds = %for.cond
> > > + ret <4 x i32> %res.0
> > > +}
> > > +
> > > +define <4 x i32> @test_cse3(i32* %a, [2 x <4 x i32>] %s.coerce,
> > > i32
> > > %n) #0 {
> > > +entry:
> > > +; Check that the first @llvm.aarch64.neon.ld2 is optimized away
> > > by
> > > Early CSE.
> > > +; CHECK-LABEL: @test_cse3
> > > +; CHECK: call { <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > > +; CHECK-NOT: call { <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 0
> > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 1
> > > + br label %for.cond
> > > +
> > > +for.cond: ; preds = %for.body, %entry
> > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ] 
> > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body, label 
> > > +%for.end
> > > +
> > > +for.body: ; preds = %for.cond
> > > + %0 = bitcast i32* %a to i8*
> > > + %vld2 = call { <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %0)
> > > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld2,
> > > 0
> > > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld2,
> > > 1
> > > + %1 = bitcast i32* %a to i8*
> > > + %vld22 = call { <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %1)
> > > + %vld22.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld22, 0
> > > + %vld22.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld22, 1
> > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > %vld2.fca.0.extract,
> > > <4
> > > x i32> %vld22.fca.0.extract)
> > > + %inc = add nsw i32 %i.0, 1
> > > + br label %for.cond
> > > +
> > > +for.end: ; preds = %for.cond
> > > + ret <4 x i32> %res.0
> > > +}
> > > +
> > > +
> > > +define <4 x i32> @test_nocse(i32* %a, i32* %b, [2 x <4 x i32>]
> > > %s.coerce, i32 %n) {
> > > +entry:
> > > +; Check that the store prevents @llvm.aarch64.neon.ld2 from
> > > being
> > > optimized
> > > +; away by Early CSE.
> > > +; CHECK-LABEL: @test_nocse
> > > +; CHECK: call { <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 0
> > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 1
> > > + br label %for.cond
> > > +
> > > +for.cond: ; preds = %for.body, %entry
> > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ] 
> > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body, label 
> > > +%for.end
> > > +
> > > +for.body: ; preds = %for.cond
> > > + %0 = bitcast i32* %a to i8*
> > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > + at llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > i32>
> > > %4, i8* %0)
> > > + store i32 0, i32* %b, align 4
> > > + %5 = bitcast i32* %a to i8*
> > > + %vld2 = call { <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %5)
> > > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld2,
> > > 0
> > > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > %vld2,
> > > 1
> > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > %vld2.fca.0.extract,
> > > <4
> > > x i32> %vld2.fca.0.extract)
> > > + %inc = add nsw i32 %i.0, 1
> > > + br label %for.cond
> > > +
> > > +for.end: ; preds = %for.cond
> > > + ret <4 x i32> %res.0
> > > +}
> > > +
> > > +define <4 x i32> @test_nocse2(i32* %a, [2 x <4 x i32>]
> > > %s.coerce,
> > > i32 %n) {
> > > +entry:
> > > +; Check that @llvm.aarch64.neon.ld3 is not optimized away by
> > > Early
> > > CSE due
> > > +; to mismatch between st2 and ld3.
> > > +; CHECK-LABEL: @test_nocse2
> > > +; CHECK: call { <4 x i32>, <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld3.v4i32.p0i8
> > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 0
> > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 1
> > > + br label %for.cond
> > > +
> > > +for.cond: ; preds = %for.body, %entry
> > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ] 
> > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body, label 
> > > +%for.end
> > > +
> > > +for.body: ; preds = %for.cond
> > > + %0 = bitcast i32* %a to i8*
> > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > + at llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > i32>
> > > %4, i8* %0)
> > > + %5 = bitcast i32* %a to i8*
> > > + %vld3 = call { <4 x i32>, <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld3.v4i32.p0i8(i8* %5)
> > > + %vld3.fca.0.extract = extractvalue { <4 x i32>, <4 x i32>, <4 x
> > > i32> } %vld3, 0
> > > + %vld3.fca.2.extract = extractvalue { <4 x i32>, <4 x i32>, <4 x
> > > i32> } %vld3, 2
> > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > %vld3.fca.0.extract,
> > > <4
> > > x i32> %vld3.fca.2.extract)
> > > + %inc = add nsw i32 %i.0, 1
> > > + br label %for.cond
> > > +
> > > +for.end: ; preds = %for.cond
> > > + ret <4 x i32> %res.0
> > > +}
> > > +
> > > +define <4 x i32> @test_nocse3(i32* %a, [2 x <4 x i32>]
> > > %s.coerce,
> > > i32 %n) {
> > > +entry:
> > > +; Check that @llvm.aarch64.neon.st3 is not optimized away by
> > > Early
> > > CSE due to
> > > +; mismatch between st2 and st3.
> > > +; CHECK-LABEL: @test_nocse3
> > > +; CHECK: call void @llvm.aarch64.neon.st3.v4i32.p0i8 ; CHECK:
> > > call
> > > +void @llvm.aarch64.neon.st2.v4i32.p0i8  %s.coerce.fca.0.extract
> > > =
> > > +extractvalue [2 x <4 x i32>] %s.coerce,
> > > 0
> > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > %s.coerce,
> > > 1
> > > + br label %for.cond
> > > +
> > > +for.cond: ; preds = %for.body, %entry
> > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body ] 
> > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body, label 
> > > +%for.end
> > > +
> > > +for.body: ; preds = %for.cond
> > > + %0 = bitcast i32* %a to i8*
> > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > + at llvm.aarch64.neon.st3.v4i32.p0i8(<4 x i32> %4, <4 x
> > > i32>
> > > %3, <4 x i32> %3, i8* %0)
> > > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > i32>
> > > %3, i8* %0)
> > > + %5 = bitcast i32* %a to i8*
> > > + %vld3 = call { <4 x i32>, <4 x i32>, <4 x i32> }
> > > @llvm.aarch64.neon.ld3.v4i32.p0i8(i8* %5)
> > > + %vld3.fca.0.extract = extractvalue { <4 x i32>, <4 x i32>, <4 x
> > > i32> } %vld3, 0
> > > + %vld3.fca.1.extract = extractvalue { <4 x i32>, <4 x i32>, <4 x
> > > i32> } %vld3, 1
> > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > %vld3.fca.0.extract,
> > > <4
> > > x i32> %vld3.fca.0.extract)
> > > + %inc = add nsw i32 %i.0, 1
> > > + br label %for.cond
> > > +
> > > +for.end: ; preds = %for.cond
> > > + ret <4 x i32> %res.0
> > > +}
> > > +
> > > +; Function Attrs: nounwind
> > > +declare void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32>, <4 x
> > > i32>,
> > > i8* nocapture)
> > > +
> > > +; Function Attrs: nounwind
> > > +declare void @llvm.aarch64.neon.st3.v4i32.p0i8(<4 x i32>, <4 x
> > > i32>,
> > > <4 x i32>, i8* nocapture)
> > > +
> > > +; Function Attrs: nounwind readonly declare { <4 x i32>, <4 x
> > > i32>
> > > +}
> > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8*)
> > > +
> > > +; Function Attrs: nounwind readonly declare { <4 x i32>, <4 x
> > > i32>,
> > > +<4 x i32> }
> > > @llvm.aarch64.neon.ld3.v4i32.p0i8(i8*)
> > > +
> > > +define internal fastcc <4 x i32> @vaddq_s32(<4 x i32> %__p0, <4
> > > x
> > > i32> %__p1) {
> > > +entry:
> > > + %add = add <4 x i32> %__p0, %__p1  ret <4 x i32> %add }
> > > 
> > > Added: llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > > URL:
> > > http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/Ear
> > > ly CSE/AArch64/lit.local.cfg?rev=227149&view=auto
> > > ==================================================================
> > > ==
> > > ==========
> > > --- llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > > (added)
> > > +++ llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg Mon
> > > Jan
> > > 26 16:51:15 2015
> > > @@ -0,0 +1,5 @@
> > > +config.suffixes = ['.ll']
> > > +
> > > +targets = set(config.root.targets_to_build.split())
> > > +if not 'AArch64' in targets:
> > > + config.unsupported = True
> > > 
> > > 
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > 
> > > 
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory