[llvm] r227149 - Commoning of target specific load/store intrinsics in Early CSE.

Fri Feb 27 13:20:41 PST 2015

----- Original Message -----
> From: "Sanjin Sijaric" <ssijaric at codeaurora.org>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, "Pete Cooper" <peter_cooper at apple.com>,
> "Chandler Carruth" <chandlerc at google.com>, "Philip Reames" <listmail at philipreames.com>
> Sent: Friday, February 27, 2015 2:55:37 PM
> Subject: RE: [llvm] r227149 - Commoning of target specific load/store	intrinsics in Early CSE.
> 
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Thursday, February 26, 2015 5:19 PM
> To: Sanjin Sijaric
> Cc: Commit Messages and Patches for LLVM; Pete Cooper; Chandler
> Carruth; Philip Reames
> Subject: Re: [llvm] r227149 - Commoning of target specific load/store
> intrinsics in Early CSE.
> 
> ----- Original Message -----
> > From: "Sanjin Sijaric" <ssijaric at codeaurora.org>
> > To: "Hal Finkel" <hfinkel at anl.gov>, "Chandler Carruth"
> > <chandlerc at google.com>, "Philip Reames"
> > <listmail at philipreames.com>
> > Cc: "Commit Messages and Patches for LLVM"
> > <llvm-commits at cs.uiuc.edu>,
> > "Pete Cooper" <peter_cooper at apple.com>
> > Sent: Thursday, February 26, 2015 7:02:13 PM
> > Subject: RE: [llvm] r227149 - Commoning of target specific
> > load/store	intrinsics in Early CSE.
> > 
> > Hi Hal,
> > 
> > Thanks for catching this.
> > 
> > In the tests I've done on AArch64, mismatching loads and stores
> > were
> > handled correctly due to the following check (test_nocse2 in
> > test/Transforms/EarlyCSE/AArch64/intrinsics.ll):
> > 
> > Value *Op = getOrCreateResult(InVal.first, Inst->getType());  if
> > (Op
> > != nullptr) {
> >    DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst
> >                  << "  to: " << *InVal.first << '\n');  ....
> >  ....
> > 
> > For example, we may have st2 and ld3 off the same pointer, ptr, and
> > something like
> > 
> > st2  <=== add ptr to Available Loads
> > ...
> > ...
> > ld3 <=== we have a matching pointer in AvailableLoads, so consider
> > commoning
> > 
> > The function getOrCreateResult(InVal.first, Inst->getType()) will
> > return nullptr, as the types will not match.  No commoning will
> > occur.
> > 
> > Using MatchingId would be more consistent as it is used for
> > commoning
> > stores.  Am I missing a case where this check is not sufficient?
> 
> If all of your intrinsics with different matching ids also have
> different types, and if those are also equivalent to regular
> loads/stores of that same type, then you'll never see the problem.
> But if you need the MatchingId to distinguish between different
> intrinsics with the same type, or between the intrinsics and regular
> loads/stores, then you'll have a problem. When I looked at
> implementing these interfaces for PowerPC, I ran into this problem.
> 
>  -Hal
> 
> 
> I see .  Thanks.  I'll move the check for MatchingId into
> LoadSite/StoreSite.  That way we don't have to change the type of
> AvailableLoads.
> 
> Alternatively, I can change the type of AvailableLoads from
> 
> typedef ScopedHashTable<Value*, std::pair<Value*, unsigned>,
>                           DenseMapInfo<Value*>, LoadMapAllocator>
>                           LoadHTType;
> 
> to
> 
> typedef ScopedHashTable<Value*, std::tuple<Value*, unsigned, short>,
>                           DenseMapInfo<Value*>, LoadMapAllocator>
>                           LoadHTType;
> 
> 
> We don't change the key type, but this increases the size of
> AvailableLoads for cases that don't need the MatchingId (which is
> the majority).  I'm leaning towards the first approach, as we can
> just do the comparison via LoadSite or StoreSite.  Something like
> 
> If (LoadSite LS = LoadSite(Inst, TTI)) {
>   ...
>   ...
>   if (LS.MatchId(InVal.first)) {....

Alright, let's try that. I'd prefer not to change AvailableLoads if not necessary, because that will likely have negative performance implications.

 -Hal

> 
> 
> Thanks,
> Sanjin
> 
> > 
> > I started work on LoadSite and StoreSite that Philip suggested, and
> > will upload it shortly.  I can either incorporate this change with
> > that patch, or as a separate patch.
> > 
> > Thanks again,
> > Sanjin
> > 
> > -----Original Message-----
> > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > Sent: Thursday, February 26, 2015 2:34 PM
> > To: Chandler Carruth; Sanjin Sijaric; Philip Reames
> > Cc: Commit Messages and Patches for LLVM; Pete Cooper
> > Subject: Re: [llvm] r227149 - Commoning of target specific
> > load/store
> > intrinsics in Early CSE.
> > 
> > Hi Sanjin, et al.,
> > 
> > Since I've now done a lot of typing in support of this commit, I
> > also
> > need to be the bearer of bad news: There is a bug...
> > 
> > The problem is that when we process a store intrinsic, we add its
> > pointer/value to the set of AvailableLoads, just as we do with a
> > regular store instruction:
> > 
> >         // Okay, we just invalidated anything we knew about loaded
> >         values.  Try
> >         // to salvage *something* by remembering that the stored
> >         value is a live
> >         // version of the pointer.  It is safe to forward from
> >         volatile stores
> >         // to non-volatile loads, so we don't have to check for
> >         volatility of
> >         // the store.
> >         AvailableLoads.insert(MemInst.getPtr(), std::pair<Value *,
> >         unsigned>(
> >                                                     Inst,
> >                                                     
> > CurrentGeneration));
> > 
> > If we then see a later load intrinsic, we look up its pointer in
> > AvailableLoads to see if the value is already available:
> > 
> >       // If we have an available version of this load, and if it is
> >       the right
> >       // generation, replace this instruction.
> >       std::pair<Value *, unsigned> InVal =
> >           AvailableLoads.lookup(MemInst.getPtr());
> >       if (InVal.first != nullptr && InVal.second ==
> >       CurrentGeneration) {
> >         Value *Op = getOrCreateResult(InVal.first,
> >         Inst->getType());
> >         if (Op != nullptr) {
> >           DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst
> >                        << "  to: " << *InVal.first << '\n');
> >           if (!Inst->use_empty())
> >             Inst->replaceAllUsesWith(Op);
> >           Inst->eraseFromParent();
> >       ...
> > 
> > and if not, we then add that pointer/value to the map of available
> > loads:
> > 
> >       // Otherwise, remember that we have this instruction.
> >       AvailableLoads.insert(MemInst.getPtr(), std::pair<Value *,
> >       unsigned>(
> >                                                   Inst,
> >                                                   CurrentGeneration));
> > 
> > but this process completely ignores the 'MatchingId' of the
> > intrinsic
> > that had been processed. As a result, we might CSE loads of the
> > same
> > pointer using incompatible intrinsics, and we might do
> > store-to-load
> > forwarding from store and load intrinsics that are not logical
> > inverses of each other.
> > 
> > One possible solution here is to change AvailableLoads to use
> > std::pair<Value *, unsigned /*MatchingId*/> as the key type,
> > instead
> > of just Value*. But since the intrinsic case is relatively rare, we
> > might want to keep a side map of (Value* -> MatchingId), and simply
> > record in the AvailableLoads map whether or not there is some
> > relevant
> > MatchingId to fetch (I'm somewhat worried that just changing the
> > key
> > type to something more complicated will slow down the code for the
> > benefit of a rare case).
> > 
> > In any case, this needs to be fixed.
> > 
> >  -Hal
> > 
> > ----- Original Message -----
> > > From: "Hal Finkel" <hfinkel at anl.gov>
> > > To: "Chandler Carruth" <chandlerc at google.com>
> > > Cc: "Commit Messages and Patches for LLVM"
> > > <llvm-commits at cs.uiuc.edu>
> > > Sent: Tuesday, January 27, 2015 6:07:57 AM
> > > Subject: Re: [llvm] r227149 - Commoning of target specific
> > > load/store	intrinsics in Early CSE.
> > > 
> > > ----- Original Message -----
> > > > From: "Chandler Carruth" <chandlerc at google.com>
> > > > To: "Chad Rosier" <mcrosier at codeaurora.org>
> > > > Cc: "Commit Messages and Patches for LLVM"
> > > > <llvm-commits at cs.uiuc.edu>
> > > > Sent: Tuesday, January 27, 2015 2:24:19 AM
> > > > Subject: Re: [llvm] r227149 - Commoning of target specific
> > > > load/store	intrinsics in Early CSE.
> > > > 
> > > > 
> > > > 
> > > > I know its a pain to have post-commit design review, but its
> > > > impossible to catch everything pre-commit.
> > > > 
> > > > 
> > > > On Mon, Jan 26, 2015 at 2:51 PM, Chad Rosier <
> > > > mcrosier at codeaurora.org > wrote:
> > > > 
> > > > 
> > > > Author: mcrosier
> > > > Date: Mon Jan 26 16:51:15 2015
> > > > New Revision: 227149
> > > > 
> > > > URL: http://llvm.org/viewvc/llvm-project?rev=227149&view=rev
> > > > Log:
> > > > Commoning of target specific load/store intrinsics in Early
> > > > CSE.
> > > > 
> > > > Phabricator revision: http://reviews.llvm.org/D7121 Patch by
> > > > Sanjin Sijaric < ssijaric at codeaurora.org >!
> > > > 
> > > > 
> > > > 
> > > > The thing I completely missed when skimming the subject was the
> > > > fact that this is not just target-specific, but leveraging
> > > > TTI...
> > > > 
> > > > 
> > > > So, first question: why is this desirable? Why do we need this?
> > > > There
> > > > seems to be no real justification for this in the commit log,
> > > > comments, or anywhere.
> > > > 
> > > > 
> > > > EarlyCSE is a very high-level pass. I don't know why it is
> > > > reasonable to teach it how to deal with every kind of
> > > > target-specific load intrinsic we come up with. This is really
> > > > contrary to the entire IR's design.
> > > > 
> > > > 
> > > > The point of TTI was to expose *cost models* from the code
> > > > generator to the IR pass. This is something completely
> > > > different
> > > > and subverts that design in a really fundamental way. I'm
> > > > pretty
> > > > opposed to this entire direction unless there is some deep and
> > > > fundamental reason why we *have* to do this and so far I'm not
> > > > seeing it. I could speculate about any number of other ways you
> > > > might solve similar or related problems, but it seems pointless
> > > > until there is a clear and precise description of the problem
> > > > this
> > > > was intended to solve.
> > > > 
> > > > So on multiple levels I feel like this is not the right design.
> > > > 
> > > 
> > > I disagree, and furthermore, I'd like to encourage the general
> > > design paradigm explored by this change:
> > > 
> > >  1. I don't believe that TTI is or needs to be *just* about cost
> > > modeling. It should be, generically, our mechanism to enforce a
> > > proper separation of concerns between the IR-level optimizers and
> > > the providers of target-specific knowledge. Cost modeling is one
> > > area where this is important, but handling of target-specific
> > > intrinsics is another.
> > > 
> > >  2. Currently, several IR-level passes, InstCombine, BasicAA,
> > >  etc.
> > >  have built-in knowledge of some target-specific intrinsics.
> > >  IMHO,
> > > this is a pretty-awful design, but we also did not have any
> > > alternative when the code was first written. Now we do, and we
> > > can
> > > refactor this code so that knowledge regarding the properties of
> > > target-specific intrinsics lives in the targets, as it should. In
> > > addition to yielding a better separation of concerns, and being
> > > friendlier to out-of-tree targets, it also reduces the importance
> > > burden of extending this knowledge (the benefit needs to feel
> > > really large to make it worthwhile cluttering the IR transforms
> > > with it).
> > > 
> > >  3. Forcing all optimizations on target-specific intrinsics, or
> > > their associated instructions, into the backend will generically
> > > produce sub-optimal code. The SDAG/MI-level optimizers are not as
> > > powerful  as the IR-level ones (MachineLICM does not track alias
> > > sets, for example), and you can't recover completely at that
> > > level.
> > > Maybe one day this will be different, but I'm not sure providing
> > > parity here  is a worthwhile aim.
> > > 
> > > With this change, there is something of a conceptual completeness
> > > problem because the same capabilities have not been added to GVN.
> > > But touching GVN has a much higher cost, and I'm fine with
> > > letting
> > > this new interface settle a bit and leaving GVN changes for some
> > > later time (if ever).
> > > 
> > > Thanks again,
> > > Hal
> > > 
> > > > 
> > > > -Chandler
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Added:
> > > > llvm/trunk/test/Transforms/EarlyCSE/AArch64/
> > > > llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > > > llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > > > Modified:
> > > > llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h
> > > > llvm/trunk/lib/Analysis/TargetTransformInfo.cpp
> > > > llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > > > llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp
> > > > 
> > > > Modified:
> > > > llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h
> > > > URL:
> > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/Analys
> > > > is
> > > > /TargetTransformInfo.h?rev=227149&r1=227148&r2=227149&view=diff
> > > > ==================================================================
> > > > ==
> > > > ==========
> > > > --- llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h
> > > > (original)
> > > > +++ llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h Mon
> > > > Jan
> > > > 26
> > > > 16:51:15 2015
> > > > @@ -23,6 +23,7 @@
> > > > #define LLVM_ANALYSIS_TARGETTRANSFORMINFO_H
> > > > 
> > > > #include "llvm/IR/Intrinsics.h"
> > > > +#include "llvm/IR/IntrinsicInst.h"
> > > > #include "llvm/Pass.h"
> > > > #include "llvm/Support/DataTypes.h"
> > > > 
> > > > @@ -35,6 +36,20 @@ class Type;
> > > > class User;
> > > > class Value;
> > > > 
> > > > +/// \brief Information about a load/store intrinsic defined by
> > > > the
> > > > target.
> > > > +struct MemIntrinsicInfo {
> > > > + MemIntrinsicInfo()
> > > > + : ReadMem(false), WriteMem(false), Vol(false), MatchingId(0),
> > > > +NumMemRefs(0), PtrVal(nullptr) {}  bool ReadMem;  bool
> > > > WriteMem;
> > > > +bool Vol;  // Same Id is set by the target for corresponding
> > > > +load/store
> > > > intrinsics.
> > > > + unsigned short MatchingId;
> > > > + int NumMemRefs;
> > > > + Value *PtrVal;
> > > > +};
> > > > +
> > > > /// TargetTransformInfo - This pass provides access to the
> > > > codegen
> > > > /// interfaces that are needed for IR-level transformations.
> > > > class TargetTransformInfo {
> > > > @@ -443,6 +458,20 @@ public:
> > > > /// any callee-saved registers, so would require a spill and
> > > > fill.
> > > > virtual unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type*>
> > > > Tys)
> > > > const;
> > > > 
> > > > + /// \returns True if the intrinsic is a supported memory
> > > > intrinsic.
> > > > Info
> > > > + /// will contain additional information - whether the
> > > > intrinsic
> > > > may
> > > > write
> > > > + /// or read to memory, volatility and the pointer. Info is
> > > > undefined
> > > > + /// if false is returned.
> > > > + virtual bool getTgtMemIntrinsic(IntrinsicInst *Inst,
> > > > + MemIntrinsicInfo &Info) const;
> > > > +
> > > > + /// \returns A value which is the result of the given memory
> > > > intrinsic. New
> > > > + /// instructions may be created to extract the result from
> > > > the
> > > > given intrinsic
> > > > + /// memory operation. Returns nullptr if the target cannot
> > > > create
> > > > a
> > > > result
> > > > + /// from the given intrinsic.
> > > > + virtual Value
> > > > *getOrCreateResultFromMemIntrinsic(IntrinsicInst
> > > > *Inst,
> > > > + Type *ExpectedType) const;
> > > > +
> > > > /// @}
> > > > 
> > > > /// Analysis group identification.
> > > > 
> > > > Modified: llvm/trunk/lib/Analysis/TargetTransformInfo.cpp
> > > > URL:
> > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Analysis/Target
> > > > Tr ansformInfo.cpp?rev=227149&r1=227148&r2=227149&view=diff
> > > > ==================================================================
> > > > ==
> > > > ==========
> > > > --- llvm/trunk/lib/Analysis/TargetTransformInfo.cpp (original)
> > > > +++ llvm/trunk/lib/Analysis/TargetTransformInfo.cpp Mon Jan 26
> > > > 16:51:15 2015
> > > > @@ -254,6 +254,16 @@ unsigned TargetTransformInfo::getCostOfK
> > > > return
> > > > PrevTTI->getCostOfKeepingLiveOverCall(Tys);
> > > > }
> > > > 
> > > > +Value *TargetTransformInfo::getOrCreateResultFromMemIntrinsic(
> > > > + IntrinsicInst *Inst, Type *ExpectedType) const {  return
> > > > +PrevTTI->getOrCreateResultFromMemIntrinsic(Inst,
> > > > ExpectedType);
> > > > +}
> > > > +
> > > > +bool TargetTransformInfo::getTgtMemIntrinsic(IntrinsicInst
> > > > *Inst,
> > > > +MemIntrinsicInfo &Info) const {  return
> > > > +PrevTTI->getTgtMemIntrinsic(Inst, Info); }
> > > > +
> > > > namespace {
> > > > 
> > > > struct NoTTI final : ImmutablePass, TargetTransformInfo { @@
> > > > -656,6
> > > > +666,15 @@ struct NoTTI final : ImmutablePass, Targ return 0; }
> > > > 
> > > > + bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo
> > > > + &Info) const override { return false; }
> > > > +
> > > > + Value *getOrCreateResultFromMemIntrinsic(IntrinsicInst *Inst,
> > > > Type
> > > > + *ExpectedType) const override { return nullptr; }
> > > > };
> > > > 
> > > > } // end anonymous namespace
> > > > 
> > > > Modified:
> > > > llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > > > URL:
> > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/AArch64/
> > > > AA
> > > > rch64TargetTransformInfo.cpp?rev=227149&r1=227148&r2=227149&view=d
> > > > if
> > > > f
> > > > ==================================================================
> > > > ==
> > > > ==========
> > > > ---
> > > > llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > > > (original)
> > > > +++
> > > > llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
> > > > Mon
> > > > Jan 26 16:51:15 2015
> > > > @@ -44,6 +44,12 @@ class AArch64TTI final : public Immutabl ///
> > > > are set if the result needs to be inserted and/or extracted
> > > > from
> > > > vectors.
> > > > unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool
> > > > Extract) const;
> > > > 
> > > > + enum MemIntrinsicType {
> > > > + VECTOR_LDST_TWO_ELEMENTS,
> > > > + VECTOR_LDST_THREE_ELEMENTS,
> > > > + VECTOR_LDST_FOUR_ELEMENTS
> > > > + };
> > > > +
> > > > public:
> > > > AArch64TTI() : ImmutablePass(ID), TM(nullptr), ST(nullptr),
> > > > TLI(nullptr) {
> > > > llvm_unreachable("This pass cannot be directly constructed");
> > > > @@
> > > > -131,6 +137,11 @@ public:
> > > > void getUnrollingPreferences(const Function *F, Loop *L,
> > > > UnrollingPreferences &UP) const override;
> > > > 
> > > > + Value *getOrCreateResultFromMemIntrinsic(IntrinsicInst *Inst,
> > > > Type
> > > > + *ExpectedType) const override;
> > > > +
> > > > + bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo
> > > > + &Info) const override;
> > > > 
> > > > /// @}
> > > > };
> > > > @@ -554,3 +565,83 @@ void AArch64TTI::getUnrollingPreferences
> > > > // Disable partial & runtime unrolling on -Os.
> > > > UP.PartialOptSizeThreshold = 0;
> > > > }
> > > > +
> > > > +Value
> > > > *AArch64TTI::getOrCreateResultFromMemIntrinsic(IntrinsicInst
> > > > *Inst,
> > > > + Type *ExpectedType) const {
> > > > + switch (Inst->getIntrinsicID()) {
> > > > + default:
> > > > + return nullptr;
> > > > + case Intrinsic::aarch64_neon_st2:
> > > > + case Intrinsic::aarch64_neon_st3:
> > > > + case Intrinsic::aarch64_neon_st4: {  // Create a struct type
> > > > +StructType *ST = dyn_cast<StructType>(ExpectedType);
> > > > + if (!ST)
> > > > + return nullptr;
> > > > + unsigned NumElts = Inst->getNumArgOperands() - 1;  if
> > > > +(ST->getNumElements() != NumElts)  return nullptr;  for
> > > > (unsigned i
> > > > += 0, e = NumElts; i != e; ++i) {  if
> > > > +(Inst->getArgOperand(i)->getType() != ST->getElementType(i))
> > > > +return nullptr;  }  Value *Res =
> > > > UndefValue::get(ExpectedType);
> > > > +IRBuilder<> Builder(Inst);  for (unsigned i = 0, e = NumElts;
> > > > i
> > > > !=
> > > > +e; ++i) {  Value *L = Inst->getArgOperand(i);  Res =
> > > > +Builder.CreateInsertValue(Res, L, i);  }  return Res;  }  case
> > > > +Intrinsic::aarch64_neon_ld2:
> > > > + case Intrinsic::aarch64_neon_ld3:
> > > > + case Intrinsic::aarch64_neon_ld4:
> > > > + if (Inst->getType() == ExpectedType)  return Inst;  return
> > > > +nullptr;  } }
> > > > +
> > > > +bool AArch64TTI::getTgtMemIntrinsic(IntrinsicInst *Inst,
> > > > +MemIntrinsicInfo &Info) const {  switch
> > > > (Inst->getIntrinsicID())
> > > > {
> > > > + default:
> > > > + break;
> > > > + case Intrinsic::aarch64_neon_ld2:
> > > > + case Intrinsic::aarch64_neon_ld3:
> > > > + case Intrinsic::aarch64_neon_ld4:
> > > > + Info.ReadMem = true;
> > > > + Info.WriteMem = false;
> > > > + Info.Vol = false;
> > > > + Info.NumMemRefs = 1;
> > > > + Info.PtrVal = Inst->getArgOperand(0);  break;  case
> > > > +Intrinsic::aarch64_neon_st2:
> > > > + case Intrinsic::aarch64_neon_st3:
> > > > + case Intrinsic::aarch64_neon_st4:
> > > > + Info.ReadMem = false;
> > > > + Info.WriteMem = true;
> > > > + Info.Vol = false;
> > > > + Info.NumMemRefs = 1;
> > > > + Info.PtrVal = Inst->getArgOperand(Inst->getNumArgOperands() -
> > > > 1);
> > > > +break;  }
> > > > +
> > > > + switch (Inst->getIntrinsicID()) {
> > > > + default:
> > > > + return false;
> > > > + case Intrinsic::aarch64_neon_ld2:
> > > > + case Intrinsic::aarch64_neon_st2:
> > > > + Info.MatchingId = VECTOR_LDST_TWO_ELEMENTS;  break;  case
> > > > +Intrinsic::aarch64_neon_ld3:
> > > > + case Intrinsic::aarch64_neon_st3:
> > > > + Info.MatchingId = VECTOR_LDST_THREE_ELEMENTS;  break;  case
> > > > +Intrinsic::aarch64_neon_ld4:
> > > > + case Intrinsic::aarch64_neon_st4:
> > > > + Info.MatchingId = VECTOR_LDST_FOUR_ELEMENTS;  break;  }
> > > >  return
> > > > +true; }
> > > > 
> > > > Modified: llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp
> > > > URL:
> > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scal
> > > > ar /EarlyCSE.cpp?rev=227149&r1=227148&r2=227149&view=diff
> > > > ==================================================================
> > > > ==
> > > > ==========
> > > > --- llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp (original)
> > > > +++ llvm/trunk/lib/Transforms/Scalar/EarlyCSE.cpp Mon Jan 26
> > > > 16:51:15
> > > > 2015
> > > > @@ -18,6 +18,7 @@
> > > > #include "llvm/ADT/Statistic.h"
> > > > #include "llvm/Analysis/AssumptionCache.h"
> > > > #include "llvm/Analysis/InstructionSimplify.h"
> > > > +#include "llvm/Analysis/TargetTransformInfo.h"
> > > > #include "llvm/IR/DataLayout.h"
> > > > #include "llvm/IR/Dominators.h"
> > > > #include "llvm/IR/Instructions.h"
> > > > @@ -273,6 +274,7 @@ class EarlyCSE : public FunctionPass {
> > > > public:
> > > > const DataLayout *DL;
> > > > const TargetLibraryInfo *TLI;
> > > > + const TargetTransformInfo *TTI;
> > > > DominatorTree *DT;
> > > > AssumptionCache *AC;
> > > > typedef RecyclingAllocator<
> > > > @@ -383,14 +385,83 @@ private:
> > > > bool Processed;
> > > > };
> > > > 
> > > > + /// \brief Wrapper class to handle memory instructions,
> > > > including
> > > > loads,
> > > > + /// stores and intrinsic loads and stores defined by the
> > > > target.
> > > > + class ParseMemoryInst {
> > > > + public:
> > > > + ParseMemoryInst(Instruction *Inst, const TargetTransformInfo
> > > > *TTI)
> > > > + : Load(false), Store(false), Vol(false),
> > > > MayReadFromMemory(false),
> > > > + MayWriteToMemory(false), MatchingId(-1), Ptr(nullptr) {
> > > > + MayReadFromMemory = Inst->mayReadFromMemory();
> > > > MayWriteToMemory
> > > > =
> > > > + Inst->mayWriteToMemory(); if (IntrinsicInst *II =
> > > > + dyn_cast<IntrinsicInst>(Inst)) { MemIntrinsicInfo Info; if
> > > > + (!TTI->getTgtMemIntrinsic(II, Info)) return; if
> > > > (Info.NumMemRefs
> > > > + == 1) { Store = Info.WriteMem; Load = Info.ReadMem;
> > > > MatchingId
> > > > =
> > > > + Info.MatchingId; MayReadFromMemory = Info.ReadMem;
> > > > + MayWriteToMemory = Info.WriteMem; Vol = Info.Vol; Ptr =
> > > > + Info.PtrVal; } } else if (LoadInst *LI =
> > > > dyn_cast<LoadInst>(Inst))
> > > > + { Load = true; Vol = !LI->isSimple(); Ptr =
> > > > + LI->getPointerOperand(); } else if (StoreInst *SI =
> > > > + dyn_cast<StoreInst>(Inst)) { Store = true; Vol =
> > > > !SI->isSimple();
> > > > + Ptr = SI->getPointerOperand(); } } bool isLoad() { return
> > > > Load;
> > > > }
> > > > + bool isStore() { return Store; } bool isVolatile() { return
> > > > Vol; }
> > > > + bool isMatchingMemLoc(const ParseMemoryInst &Inst) { return
> > > > Ptr
> > > > ==
> > > > + Inst.Ptr && MatchingId == Inst.MatchingId; } bool isValid() {
> > > > + return Ptr != nullptr; } int getMatchingId() { return
> > > > MatchingId;
> > > > + } Value *getPtr() { return Ptr; } bool mayReadFromMemory() {
> > > > + return MayReadFromMemory; } bool mayWriteToMemory() { return
> > > > + MayWriteToMemory; }
> > > > +
> > > > + private:
> > > > + bool Load;
> > > > + bool Store;
> > > > + bool Vol;
> > > > + bool MayReadFromMemory;
> > > > + bool MayWriteToMemory;
> > > > + // For regular (non-intrinsic) loads/stores, this is set to
> > > > -1.
> > > > For
> > > > + // intrinsic loads/stores, the id is retrieved from the
> > > > corresponding
> > > > + // field in the MemIntrinsicInfo structure. That field
> > > > contains
> > > > //
> > > > + non-negative values only.
> > > > + int MatchingId;
> > > > + Value *Ptr;
> > > > + };
> > > > +
> > > > bool processNode(DomTreeNode *Node);
> > > > 
> > > > void getAnalysisUsage(AnalysisUsage &AU) const override {
> > > > AU.addRequired<AssumptionCacheTracker>();
> > > > AU.addRequired<DominatorTreeWrapperPass>();
> > > > AU.addRequired<TargetLibraryInfoWrapperPass>();
> > > > + AU.addRequired<TargetTransformInfo>();
> > > > AU.setPreservesCFG();
> > > > }
> > > > +
> > > > + Value *getOrCreateResult(Value *Inst, Type *ExpectedType)
> > > > const
> > > > {
> > > > + if (LoadInst *LI = dyn_cast<LoadInst>(Inst)) return LI; else
> > > > if
> > > > + (StoreInst *SI = dyn_cast<StoreInst>(Inst)) return
> > > > + SI->getValueOperand();
> > > > + assert(isa<IntrinsicInst>(Inst) && "Instruction not
> > > > supported");
> > > > + return
> > > > TTI->getOrCreateResultFromMemIntrinsic(cast<IntrinsicInst>(Inst),
> > > > + ExpectedType);
> > > > + }
> > > > };
> > > > }
> > > > 
> > > > @@ -420,7 +491,7 @@ bool EarlyCSE::processNode(DomTreeNode *
> > > > ///
> > > > as long as there in no instruction that reads memory. If we see
> > > > a
> > > > store /// to the same location, we delete the dead store. This
> > > > zaps trivial dead /// stores which can occur in bitfield code
> > > > among other things.
> > > > - StoreInst *LastStore = nullptr;
> > > > + Instruction *LastStore = nullptr;
> > > > 
> > > > bool Changed = false;
> > > > 
> > > > @@ -475,10 +546,11 @@ bool EarlyCSE::processNode(DomTreeNode *
> > > > continue; }
> > > > 
> > > > + ParseMemoryInst MemInst(Inst, TTI);
> > > > // If this is a non-volatile load, process it.
> > > > - if (LoadInst *LI = dyn_cast<LoadInst>(Inst)) {
> > > > + if (MemInst.isValid() && MemInst.isLoad()) {
> > > > // Ignore volatile loads.
> > > > - if (!LI->isSimple()) {
> > > > + if (MemInst.isVolatile()) {
> > > > LastStore = nullptr;
> > > > continue;
> > > > }
> > > > @@ -486,27 +558,35 @@ bool EarlyCSE::processNode(DomTreeNode *
> > > > //
> > > > If we have an available version of this load, and if it is the
> > > > right // generation, replace this instruction.
> > > > std::pair<Value *, unsigned> InVal =
> > > > - AvailableLoads->lookup(Inst->getOperand(0));
> > > > + AvailableLoads->lookup(MemInst.getPtr());
> > > > if (InVal.first != nullptr && InVal.second ==
> > > > CurrentGeneration) {
> > > > - DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst
> > > > - << " to: " << *InVal.first << '\n');
> > > > - if (!Inst->use_empty())
> > > > - Inst->replaceAllUsesWith(InVal.first);
> > > > - Inst->eraseFromParent();
> > > > - Changed = true;
> > > > - ++NumCSELoad;
> > > > - continue;
> > > > + Value *Op = getOrCreateResult(InVal.first, Inst->getType());
> > > > if
> > > > + (Op != nullptr) {
> > > > + DEBUG(dbgs() << "EarlyCSE CSE LOAD: " << *Inst << " to: " <<
> > > > + *InVal.first << '\n'); if (!Inst->use_empty())
> > > > + Inst->replaceAllUsesWith(Op);
> > > > + Inst->eraseFromParent();
> > > > + Changed = true;
> > > > + ++NumCSELoad;
> > > > + continue;
> > > > + }
> > > > }
> > > > 
> > > > // Otherwise, remember that we have this instruction.
> > > > - AvailableLoads->insert(Inst->getOperand(0), std::pair<Value
> > > > *,
> > > > unsigned>(
> > > > - Inst, CurrentGeneration));
> > > > + AvailableLoads->insert(MemInst.getPtr(), std::pair<Value *,
> > > > unsigned>(
> > > > + Inst, CurrentGeneration));
> > > > LastStore = nullptr;
> > > > continue;
> > > > }
> > > > 
> > > > // If this instruction may read from memory, forget LastStore.
> > > > - if (Inst->mayReadFromMemory())
> > > > + // Load/store intrinsics will indicate both a read and a
> > > > write
> > > > to
> > > > + // memory. The target may override this (e.g. so that a store
> > > > intrinsic
> > > > + // does not read from memory, and thus will be treated the
> > > > same
> > > > as
> > > > a
> > > > + // regular store for commoning purposes).
> > > > + if (Inst->mayReadFromMemory() &&
> > > > + !(MemInst.isValid() && !MemInst.mayReadFromMemory()))
> > > > LastStore = nullptr;
> > > > 
> > > > // If this is a read-only call, process it.
> > > > @@ -537,17 +617,19 @@ bool EarlyCSE::processNode(DomTreeNode *
> > > > if
> > > > (Inst->mayWriteToMemory()) {
> > > > ++CurrentGeneration;
> > > > 
> > > > - if (StoreInst *SI = dyn_cast<StoreInst>(Inst)) {
> > > > + if (MemInst.isValid() && MemInst.isStore()) {
> > > > // We do a trivial form of DSE if there are two stores to the
> > > > same
> > > > // location with no intervening loads. Delete the earlier
> > > > store.
> > > > - if (LastStore &&
> > > > - LastStore->getPointerOperand() == SI->getPointerOperand()) {
> > > > - DEBUG(dbgs() << "EarlyCSE DEAD STORE: " << *LastStore
> > > > - << " due to: " << *Inst << '\n');
> > > > - LastStore->eraseFromParent();
> > > > - Changed = true;
> > > > - ++NumDSE;
> > > > - LastStore = nullptr;
> > > > + if (LastStore) {
> > > > + ParseMemoryInst LastStoreMemInst(LastStore, TTI); if
> > > > + (LastStoreMemInst.isMatchingMemLoc(MemInst)) {
> > > > + DEBUG(dbgs() << "EarlyCSE DEAD STORE: " << *LastStore << "
> > > > due
> > > > to:
> > > > + " << *Inst << '\n');
> > > > + LastStore->eraseFromParent();
> > > > + Changed = true;
> > > > + ++NumDSE;
> > > > + LastStore = nullptr;
> > > > + }
> > > > // fallthrough - we can exploit information about this store }
> > > > 
> > > > @@ -556,13 +638,12 @@ bool EarlyCSE::processNode(DomTreeNode *
> > > > //
> > > > version of the pointer. It is safe to forward from volatile
> > > > stores
> > > > // to non-volatile loads, so we don't have to check for
> > > > volatility
> > > > of // the store.
> > > > - AvailableLoads->insert(SI->getPointerOperand(),
> > > > - std::pair<Value *, unsigned>(
> > > > - SI->getValueOperand(), CurrentGeneration));
> > > > + AvailableLoads->insert(MemInst.getPtr(), std::pair<Value *,
> > > > unsigned>(
> > > > + Inst, CurrentGeneration));
> > > > 
> > > > // Remember that this was the last store we saw for DSE.
> > > > - if (SI->isSimple())
> > > > - LastStore = SI;
> > > > + if (!MemInst.isVolatile())
> > > > + LastStore = Inst;
> > > > }
> > > > }
> > > > }
> > > > @@ -584,6 +665,7 @@ bool EarlyCSE::runOnFunction(Function &F
> > > > DataLayoutPass *DLP = getAnalysisIfAvailable<DataLayoutPass>();
> > > > DL = DLP ? &DLP->getDataLayout() : nullptr; TLI =
> > > > &getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();
> > > > + TTI = &getAnalysis<TargetTransformInfo>();
> > > > DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
> > > > AC =
> > > > &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
> > > > 
> > > > 
> > > > Added:
> > > > llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > > > URL:
> > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/Ear
> > > > ly CSE/AArch64/intrinsics.ll?rev=227149&view=auto
> > > > ==================================================================
> > > > ==
> > > > ==========
> > > > --- llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > > > (added)
> > > > +++ llvm/trunk/test/Transforms/EarlyCSE/AArch64/intrinsics.ll
> > > > Mon
> > > > Jan
> > > > 26 16:51:15 2015
> > > > @@ -0,0 +1,231 @@
> > > > +; RUN: opt < %s -S -mtriple=aarch64-none-linux-gnu
> > > > -mattr=+neon
> > > > -early-cse | FileCheck %s
> > > > +
> > > > +define <4 x i32> @test_cse(i32* %a, [2 x <4 x i32>] %s.coerce,
> > > > i32
> > > > %n) {
> > > > +entry:
> > > > +; Check that @llvm.aarch64.neon.ld2 is optimized away by Early
> > > > CSE.
> > > > +; CHECK-LABEL: @test_cse
> > > > +; CHECK-NOT: call { <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 0
> > > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 1
> > > > + br label %for.cond
> > > > +
> > > > +for.cond: ; preds = %for.body, %entry
> > > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body
> > > > ]
> > > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body,
> > > > label
> > > > +%for.end
> > > > +
> > > > +for.body: ; preds = %for.cond
> > > > + %0 = bitcast i32* %a to i8*
> > > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > > + at llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > > i32>
> > > > %4, i8* %0)
> > > > + %5 = bitcast i32* %a to i8*
> > > > + %vld2 = call { <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %5)
> > > > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld2,
> > > > 0
> > > > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld2,
> > > > 1
> > > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > > %vld2.fca.0.extract,
> > > > <4
> > > > x i32> %vld2.fca.0.extract)
> > > > + %inc = add nsw i32 %i.0, 1
> > > > + br label %for.cond
> > > > +
> > > > +for.end: ; preds = %for.cond
> > > > + ret <4 x i32> %res.0
> > > > +}
> > > > +
> > > > +define <4 x i32> @test_cse2(i32* %a, [2 x <4 x i32>]
> > > > %s.coerce,
> > > > i32
> > > > %n) {
> > > > +entry:
> > > > +; Check that the first @llvm.aarch64.neon.st2 is optimized
> > > > away
> > > > by
> > > > Early CSE.
> > > > +; CHECK-LABEL: @test_cse2
> > > > +; CHECK-NOT: call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x
> > > > i32>
> > > > %3, <4 x i32> %3, i8* %0)
> > > > +; CHECK: call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32>
> > > > %3,
> > > > <4 x i32> %4, i8* %0)
> > > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 0
> > > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 1
> > > > + br label %for.cond
> > > > +
> > > > +for.cond: ; preds = %for.body, %entry
> > > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body
> > > > ]
> > > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body,
> > > > label
> > > > +%for.end
> > > > +
> > > > +for.body: ; preds = %for.cond
> > > > + %0 = bitcast i32* %a to i8*
> > > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > > + at llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > > i32>
> > > > %3, i8* %0)
> > > > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4
> > > > x
> > > > i32>
> > > > %4, i8* %0)
> > > > + %5 = bitcast i32* %a to i8*
> > > > + %vld2 = call { <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %5)
> > > > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld2,
> > > > 0
> > > > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld2,
> > > > 1
> > > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > > %vld2.fca.0.extract,
> > > > <4
> > > > x i32> %vld2.fca.0.extract)
> > > > + %inc = add nsw i32 %i.0, 1
> > > > + br label %for.cond
> > > > +
> > > > +for.end: ; preds = %for.cond
> > > > + ret <4 x i32> %res.0
> > > > +}
> > > > +
> > > > +define <4 x i32> @test_cse3(i32* %a, [2 x <4 x i32>]
> > > > %s.coerce,
> > > > i32
> > > > %n) #0 {
> > > > +entry:
> > > > +; Check that the first @llvm.aarch64.neon.ld2 is optimized
> > > > away
> > > > by
> > > > Early CSE.
> > > > +; CHECK-LABEL: @test_cse3
> > > > +; CHECK: call { <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > > > +; CHECK-NOT: call { <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 0
> > > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 1
> > > > + br label %for.cond
> > > > +
> > > > +for.cond: ; preds = %for.body, %entry
> > > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body
> > > > ]
> > > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body,
> > > > label
> > > > +%for.end
> > > > +
> > > > +for.body: ; preds = %for.cond
> > > > + %0 = bitcast i32* %a to i8*
> > > > + %vld2 = call { <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %0)
> > > > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld2,
> > > > 0
> > > > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld2,
> > > > 1
> > > > + %1 = bitcast i32* %a to i8*
> > > > + %vld22 = call { <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %1)
> > > > + %vld22.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld22, 0
> > > > + %vld22.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld22, 1
> > > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > > %vld2.fca.0.extract,
> > > > <4
> > > > x i32> %vld22.fca.0.extract)
> > > > + %inc = add nsw i32 %i.0, 1
> > > > + br label %for.cond
> > > > +
> > > > +for.end: ; preds = %for.cond
> > > > + ret <4 x i32> %res.0
> > > > +}
> > > > +
> > > > +
> > > > +define <4 x i32> @test_nocse(i32* %a, i32* %b, [2 x <4 x i32>]
> > > > %s.coerce, i32 %n) {
> > > > +entry:
> > > > +; Check that the store prevents @llvm.aarch64.neon.ld2 from
> > > > being
> > > > optimized
> > > > +; away by Early CSE.
> > > > +; CHECK-LABEL: @test_nocse
> > > > +; CHECK: call { <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8
> > > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 0
> > > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 1
> > > > + br label %for.cond
> > > > +
> > > > +for.cond: ; preds = %for.body, %entry
> > > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body
> > > > ]
> > > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body,
> > > > label
> > > > +%for.end
> > > > +
> > > > +for.body: ; preds = %for.cond
> > > > + %0 = bitcast i32* %a to i8*
> > > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > > + at llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > > i32>
> > > > %4, i8* %0)
> > > > + store i32 0, i32* %b, align 4
> > > > + %5 = bitcast i32* %a to i8*
> > > > + %vld2 = call { <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8* %5)
> > > > + %vld2.fca.0.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld2,
> > > > 0
> > > > + %vld2.fca.1.extract = extractvalue { <4 x i32>, <4 x i32> }
> > > > %vld2,
> > > > 1
> > > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > > %vld2.fca.0.extract,
> > > > <4
> > > > x i32> %vld2.fca.0.extract)
> > > > + %inc = add nsw i32 %i.0, 1
> > > > + br label %for.cond
> > > > +
> > > > +for.end: ; preds = %for.cond
> > > > + ret <4 x i32> %res.0
> > > > +}
> > > > +
> > > > +define <4 x i32> @test_nocse2(i32* %a, [2 x <4 x i32>]
> > > > %s.coerce,
> > > > i32 %n) {
> > > > +entry:
> > > > +; Check that @llvm.aarch64.neon.ld3 is not optimized away by
> > > > Early
> > > > CSE due
> > > > +; to mismatch between st2 and ld3.
> > > > +; CHECK-LABEL: @test_nocse2
> > > > +; CHECK: call { <4 x i32>, <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld3.v4i32.p0i8
> > > > + %s.coerce.fca.0.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 0
> > > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 1
> > > > + br label %for.cond
> > > > +
> > > > +for.cond: ; preds = %for.body, %entry
> > > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body
> > > > ]
> > > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body,
> > > > label
> > > > +%for.end
> > > > +
> > > > +for.body: ; preds = %for.cond
> > > > + %0 = bitcast i32* %a to i8*
> > > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > > + at llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4 x
> > > > i32>
> > > > %4, i8* %0)
> > > > + %5 = bitcast i32* %a to i8*
> > > > + %vld3 = call { <4 x i32>, <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld3.v4i32.p0i8(i8* %5)
> > > > + %vld3.fca.0.extract = extractvalue { <4 x i32>, <4 x i32>, <4
> > > > x
> > > > i32> } %vld3, 0
> > > > + %vld3.fca.2.extract = extractvalue { <4 x i32>, <4 x i32>, <4
> > > > x
> > > > i32> } %vld3, 2
> > > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > > %vld3.fca.0.extract,
> > > > <4
> > > > x i32> %vld3.fca.2.extract)
> > > > + %inc = add nsw i32 %i.0, 1
> > > > + br label %for.cond
> > > > +
> > > > +for.end: ; preds = %for.cond
> > > > + ret <4 x i32> %res.0
> > > > +}
> > > > +
> > > > +define <4 x i32> @test_nocse3(i32* %a, [2 x <4 x i32>]
> > > > %s.coerce,
> > > > i32 %n) {
> > > > +entry:
> > > > +; Check that @llvm.aarch64.neon.st3 is not optimized away by
> > > > Early
> > > > CSE due to
> > > > +; mismatch between st2 and st3.
> > > > +; CHECK-LABEL: @test_nocse3
> > > > +; CHECK: call void @llvm.aarch64.neon.st3.v4i32.p0i8 ; CHECK:
> > > > call
> > > > +void @llvm.aarch64.neon.st2.v4i32.p0i8
> > > >  %s.coerce.fca.0.extract
> > > > =
> > > > +extractvalue [2 x <4 x i32>] %s.coerce,
> > > > 0
> > > > + %s.coerce.fca.1.extract = extractvalue [2 x <4 x i32>]
> > > > %s.coerce,
> > > > 1
> > > > + br label %for.cond
> > > > +
> > > > +for.cond: ; preds = %for.body, %entry
> > > > + %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > > + %res.0 = phi <4 x i32> [ undef, %entry ], [ %call, %for.body
> > > > ]
> > > > +%cmp = icmp slt i32 %i.0, %n  br i1 %cmp, label %for.body,
> > > > label
> > > > +%for.end
> > > > +
> > > > +for.body: ; preds = %for.cond
> > > > + %0 = bitcast i32* %a to i8*
> > > > + %1 = bitcast <4 x i32> %s.coerce.fca.0.extract to <16 x i8>
> > > > + %2 = bitcast <4 x i32> %s.coerce.fca.1.extract to <16 x i8>
> > > > + %3 = bitcast <16 x i8> %1 to <4 x i32>
> > > > + %4 = bitcast <16 x i8> %2 to <4 x i32>  call void
> > > > + at llvm.aarch64.neon.st3.v4i32.p0i8(<4 x i32> %4, <4 x
> > > > i32>
> > > > %3, <4 x i32> %3, i8* %0)
> > > > + call void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32> %3, <4
> > > > x
> > > > i32>
> > > > %3, i8* %0)
> > > > + %5 = bitcast i32* %a to i8*
> > > > + %vld3 = call { <4 x i32>, <4 x i32>, <4 x i32> }
> > > > @llvm.aarch64.neon.ld3.v4i32.p0i8(i8* %5)
> > > > + %vld3.fca.0.extract = extractvalue { <4 x i32>, <4 x i32>, <4
> > > > x
> > > > i32> } %vld3, 0
> > > > + %vld3.fca.1.extract = extractvalue { <4 x i32>, <4 x i32>, <4
> > > > x
> > > > i32> } %vld3, 1
> > > > + %call = call <4 x i32> @vaddq_s32(<4 x i32>
> > > > %vld3.fca.0.extract,
> > > > <4
> > > > x i32> %vld3.fca.0.extract)
> > > > + %inc = add nsw i32 %i.0, 1
> > > > + br label %for.cond
> > > > +
> > > > +for.end: ; preds = %for.cond
> > > > + ret <4 x i32> %res.0
> > > > +}
> > > > +
> > > > +; Function Attrs: nounwind
> > > > +declare void @llvm.aarch64.neon.st2.v4i32.p0i8(<4 x i32>, <4 x
> > > > i32>,
> > > > i8* nocapture)
> > > > +
> > > > +; Function Attrs: nounwind
> > > > +declare void @llvm.aarch64.neon.st3.v4i32.p0i8(<4 x i32>, <4 x
> > > > i32>,
> > > > <4 x i32>, i8* nocapture)
> > > > +
> > > > +; Function Attrs: nounwind readonly declare { <4 x i32>, <4 x
> > > > i32>
> > > > +}
> > > > @llvm.aarch64.neon.ld2.v4i32.p0i8(i8*)
> > > > +
> > > > +; Function Attrs: nounwind readonly declare { <4 x i32>, <4 x
> > > > i32>,
> > > > +<4 x i32> }
> > > > @llvm.aarch64.neon.ld3.v4i32.p0i8(i8*)
> > > > +
> > > > +define internal fastcc <4 x i32> @vaddq_s32(<4 x i32> %__p0,
> > > > <4
> > > > x
> > > > i32> %__p1) {
> > > > +entry:
> > > > + %add = add <4 x i32> %__p0, %__p1  ret <4 x i32> %add }
> > > > 
> > > > Added:
> > > > llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > > > URL:
> > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/Ear
> > > > ly CSE/AArch64/lit.local.cfg?rev=227149&view=auto
> > > > ==================================================================
> > > > ==
> > > > ==========
> > > > --- llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > > > (added)
> > > > +++ llvm/trunk/test/Transforms/EarlyCSE/AArch64/lit.local.cfg
> > > > Mon
> > > > Jan
> > > > 26 16:51:15 2015
> > > > @@ -0,0 +1,5 @@
> > > > +config.suffixes = ['.ll']
> > > > +
> > > > +targets = set(config.root.targets_to_build.split())
> > > > +if not 'AArch64' in targets:
> > > > + config.unsupported = True
> > > > 
> > > > 
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > 
> > > > 
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > 
> > > 
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory