[llvm-commits] [LLVMdev] [PATCH] BasicBlock Autovectorization Pass

Fri Dec 2 09:32:24 PST 2011

On Fri, 2011-12-02 at 17:07 +0100, Tobias Grosser wrote:
> On 11/23/2011 05:52 PM, Hal Finkel wrote:
> > On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
> >> >  On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
> >>> >  >  Tobias,
> >>> >  >
> >>> >  >  I've attached an updated patch. It contains a few bug fixes and many
> >>> >  >  (refactoring and coding-convention) changes inspired by your comments.
> >>> >  >
> >>> >  >  I'm currently trying to fix the bug responsible for causing a compile
> >>> >  >  failure when compiling
> >>> >  >  test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
> >>> >  >  pass begins to fuse instructions in a basic block in this file, the
> >>> >  >  aliasing analysis starts producing different (more pessimistic) query
> >>> >  >  answers. I've never before seen it do this, and I'm not sure why it is
> >>> >  >  happening. Also odd, at the same time, the numeric labels that are
> >>> >  >  assigned to otherwise unnamed instructions, change. I don't think I've
> >>> >  >  seen this happen either (they generally seem to stay fixed once
> >>> >  >  assigned). I don't know if these two things are related, but if anyone
> >>> >  >  can provide some insight, I'd appreciate it.
> >> >
> >> >  I think that I see what is happening in this case (please someone tell
> >> >  me if this does not make sense). In the problematic basic block, there
> >> >  are some loads and stores that are independent. The default aliasing
> >> >  analysis can tell that these loads and stores don't reference the same
> >> >  memory region. Then, some of the inputs to the getelementptr
> >> >  instructions used for these loads and stores are fused by the
> >> >  vectorization. After this happens, the aliasing analysis loses its
> >> >  ability to tell that the loads and stores that make use of those
> >> >  vector-calculated indices are independent.
> > The attached patch works around this issue. Please review.
> 
> Hi Hal,
> 
> thanks for reworking the patch several times. This time the patch looks
> a lot better, with nicely structured helper functions and a lot more
> consistent style. It is a lot better to read and review.
> 
> Unfortunately I had today just two hours to look into this patch. I
> added a lot of remarks, but especially in the tree pruning part I run a
> little out of time such that the remarks are mostly obvious style
> remarks. I hope I have a little bit more time over the weekend to look
> into this again, but I wanted to get out the first junk of comments today.

Thanks! I appreciate you putting in that kind of time. [Also, there is a
more-recent version than the one you're looking at,
llvm_bb_vectorize-20111122.diff, make sure that you look at that one].

> 
> Especially with my style remarks, do not feel forced to follow all of my
> comments. Style stuff is highly personal and often other people may
> suggest other fixes. In case you believe your style is better just check
> with the LLVM coding reference or other parts of the LLVM source code
> and try to match this as good as possible.

I have tried to do that, but other eyes are always useful.

> 
> On the algorithmic point of view I am not yet completely convinced the
> algorithm itself is not worse than N^2. I really want to understand this
> part better and plan to look into problematic behaviour in functions not
> directly visible in the core routines.

Here are my thoughts:

getCandidatePairs - This is O(N^2) in the number of instructions in BB.
[This assumes that all of the external helper functions, especially
queries to alias and scalar-evolution analysis don't make the complexity
worse].

computeConnectedPairs - For each candidate pair, there are O(N^2) in the
worse case, this loops over all pairs of uses of the instructions in the
candidate pair. So if every instruction has U uses, then this has
complexity O(N^2 U^2). In practice, U is small, and so this is just
another O(N^2). Unfortunately, that is only true if determining if a
given use pair is also a candidate pair is sub-linear. It is not in this
case (it is linear because the multimap only indexes the first key, not
both), but that could be changed by using a different (or additional)
data structure to hold the candidate pairs. I should probably add a
candidate-pair DenseSet to clean this up, then it will be O(N^2) [note
that if an instruction has many uses, then it cannot be in many
candidate pairs].

buildDepMap - This records all uses of each pairable instruction; as
implemented, this is also O(N^2).

choosePairs - As you point out, this is the most complicated to figure
out. The reason is that it deals with connected pairs and that, as pairs
are selected, other pairs are dropped from contention. Fundamentally, it
is considering pair-to-pair interactions, and so that seems like O(N^4).
However, it only considers only the first possible tree that can be
built from connected pairs for each remaining candidate pair. This is
important for the following reason: If a block of N instructions had the
maximum number of candidate pairs: N*(N-1)/2, then it must be the case
that all instrucitons are independent, and, thus, none of them are
connected. Thus, in this regime, the algorithm is O(N^2). On the other
extreme, if the block has N candidate pairs, then they could all be
connected, but then there are only N of them. In this extreme, the
algorithm is also O(N^2). Because both extremes are O(N^2), I've
generally figured that this is also O(N^2), but I may want to do more
math for the paper I'm writing.

fuseChosenPairs - This is linear in the number of selected pairs, and
that number scales with the number of instructions in the BB. In the
worst case, after each fusion, a number of instructions need to be moved
(up to almost the total number in the block). Thus, this is also O(N^2).

> 
> Also, you choose a very high bound (4000) to limit the quadratic
> behaviour, because you stated otherwise optimizations will be missed.
> Can you send an example where a value smaller than 4000 would miss
> optimizations?

I may be able to find an example, but I don't have one immediately
available. I had run the test suite with various settings, and chose the
default value (4000) based on the approximate point where the average
performance stopped increasing. Thus, I'm surmising that choosing a
smaller value causes useful optimization opportunities to be missed.

I go through the syntax comments and provide a revised patch.

Thanks again,
Hal

> 
> Cheers
> Tobi
> 
> > diff --git a/lib/Transforms/Makefile b/lib/Transforms/Makefile
> > index e527be2..8b1df92 100644
> > --- a/lib/Transforms/Makefile
> > +++ b/lib/Transforms/Makefile
> > @@ -8,7 +8,7 @@
> >   ##===----------------------------------------------------------------------===##
> >
> >   LEVEL = ../..
> > -PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO Hello
> > +PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO Vectorize Hello
> >
> >   include $(LEVEL)/Makefile.config
> >
> > diff --git a/lib/Transforms/Vectorize/BBVectorize.cpp b/lib/Transforms/Vectorize/BBVectorize.cpp
> > new file mode 100644
> > index 0000000..5505502
> > --- /dev/null
> > +++ b/lib/Transforms/Vectorize/BBVectorize.cpp
> > @@ -0,0 +1,1644 @@
> > +//===- BBVectorize.cpp - A Basic-Block Vectorizer -------------------------===//
> > +//
> > +//                     The LLVM Compiler Infrastructure
> > +//
> > +// This file is distributed under the University of Illinois Open Source
> > +// License. See LICENSE.TXT for details.
> > +//
> > +//===----------------------------------------------------------------------===//
> > +//
> > +// This file implements a basic-block vectorization pass. The algorithm was
> > +// inspired by that used by the Vienna MAP Vectorizor by Franchetti and Kral,
> > +// et al. It works by looking for chains of pairable operations and then
> > +// pairing them.
> > +//
> > +//===----------------------------------------------------------------------===//
> > +
> > +#define BBV_NAME "bb-vectorize"
> > +#define DEBUG_TYPE BBV_NAME
> > +#include "llvm/Constants.h"
> > +#include "llvm/DerivedTypes.h"
> > +#include "llvm/Function.h"
> > +#include "llvm/Instructions.h"
> > +#include "llvm/IntrinsicInst.h"
> > +#include "llvm/Intrinsics.h"
> > +#include "llvm/LLVMContext.h"
> > +#include "llvm/Pass.h"
> > +#include "llvm/Type.h"
> > +#include "llvm/ADT/DenseMap.h"
> > +#include "llvm/ADT/DenseSet.h"
> > +#include "llvm/ADT/SmallVector.h"
> > +#include "llvm/ADT/Statistic.h"
> > +#include "llvm/ADT/StringExtras.h"
> > +#include "llvm/Analysis/AliasAnalysis.h"
> > +#include "llvm/Analysis/AliasSetTracker.h"
> > +#include "llvm/Analysis/ValueTracking.h"
> This comes to lines later, if you want to sort it alphabetically.

What do you mean? I had run the include statements through sort at some
point, so I think they're all sorted.

> 
> > +namespace {
> > +  struct BBVectorize : public BasicBlockPass {
> > +    static char ID; // Pass identification, replacement for typeid
> > +    BBVectorize() : BasicBlockPass(ID) {
> > +      initializeBBVectorizePass(*PassRegistry::getPassRegistry());
> > +    }
> > +
> > +    typedef std::pair<Value *, Value *>  ValuePair;
> > +    typedef std::pair<ValuePair, size_t>  ValuePairWithDepth;
> > +    typedef std::pair<ValuePair, ValuePair>  VPPair; // A ValuePair pair
> > +    typedef std::pair<std::multimap<Value *, Value *>::iterator,
> > +              std::multimap<Value *, Value *>::iterator>  VPIteratorPair;
> > +    typedef std::pair<std::multimap<ValuePair, ValuePair>::iterator,
> > +              std::multimap<ValuePair, ValuePair>::iterator>
> > +                VPPIteratorPair;
> > +
> > +    // FIXME: const correct?
> > +
> > +    bool vectorizePairs(BasicBlock&BB);
> > +
> > +    void getCandidatePairs(BasicBlock&BB,
> > +                       std::multimap<Value *, Value *>  &CandidatePairs,
> > +                       std::vector<Value *>  &PairableInsts);
> > +
> > +    vojd computeConnectedPairs(std::multimap<Value *, Value *>  &CandidatePairs,
> > +                       std::vector<Value *>  &PairableInsts,
> > +                       std::multimap<ValuePair, ValuePair>  &ConnectedPairs);
> > +
> > +    void buildDepMap(BasicBlock&BB,
> > +                       std::multimap<Value *, Value *>  &CandidatePairs,
> > +                       std::vector<Value *>  &PairableInsts,
> > +                       DenseSet<ValuePair>  &PairableInstUsers);
> > +
> > +    void choosePairs(std::multimap<Value *, Value *>  &CandidatePairs,
> > +                        std::vector<Value *>  &PairableInsts,
> > +                        std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                        DenseSet<ValuePair>  &PairableInstUsers,
> > +                        DenseMap<Value *, Value *>&  ChosenPairs);
> > +
> > +    void fuseChosenPairs(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *>&  ChosenPairs);
> > +
> > +    bool isInstVectorizable(Instruction *I, bool&IsSimpleLoadStore);
> > +
> > +    bool areInstsCompatible(Instruction *I, Instruction *J,
> > +                       bool IsSimpleLoadStore);
> > +
> > +    void trackUsesOfI(DenseSet<Value *>  &Users,
> > +                      AliasSetTracker&WriteSet, Instruction *I,
> > +                      Instruction *J, bool&UsesI, bool UpdateUsers = true,
> > +                      std::multimap<Value *, Value *>  *LoadMoveSet = 0);
> > +
> > +    void computePairsConnectedTo(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                      ValuePair P);
> > +
> > +    bool pairsConflict(ValuePair P, ValuePair Q,
> > +                      DenseSet<ValuePair>  &PairableInstUsers);
> > +
> > +    void pruneTreeFor(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                      DenseSet<ValuePair>  &PairableInstUsers,
> > +                      DenseMap<Value *, Value *>  &ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
> > +                      DenseSet<ValuePair>  &PrunedTree, ValuePair J);
> > +
> > +    void buildInitialTreeFor(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                      DenseSet<ValuePair>  &PairableInstUsers,
> > +                      DenseMap<Value *, Value *>  &ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree, ValuePair J);
> > +
> > +    void findBestTreeFor(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                      DenseSet<ValuePair>  &PairableInstUsers,
> > +                      DenseMap<Value *, Value *>  &ChosenPairs,
> > +                      DenseSet<ValuePair>  &BestTree, size_t&BestMaxDepth,
> > +                      size_t&BestEffSize, VPIteratorPair ChoiceRange);
> > +
> > +    Value *getReplacementPointerInput(LLVMContext&  Context, Instruction *I,
> > +                     Instruction *J, unsigned o, bool&FlipMemInputs);
> > +
> > +    Value *getReplacementShuffleMask(LLVMContext&  Context, Instruction *I,
> > +                     Instruction *J, unsigned o);
> > +
> > +    Value *getReplacementInput(LLVMContext&  Context, Instruction *I,
> > +                     Instruction *J, unsigned o, bool FlipMemInputs);
> > +
> > +    void getReplacementInputsForPair(LLVMContext&  Context, Instruction *I,
> > +                     Instruction *J, SmallVector<Value *, 3>  &ReplacedOperands,
> > +                     bool&FlipMemInputs);
> > +
> > +    void replaceOutputsOfPair(LLVMContext&  Context, Instruction *I,
> > +                     Instruction *J, Instruction *K,
> > +                     Instruction *&InsertionPt, Instruction *&K1,
> > +                     Instruction *&K2, bool&FlipMemInputs);
> > +
> > +    void collectPairLoadMoveSet(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *>  &ChosenPairs,
> > +                     std::multimap<Value *, Value *>  &LoadMoveSet,
> > +                     Instruction *I, Instruction *J);
> > +
> > +    void collectLoadMoveSet(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *>  &ChosenPairs,
> > +                     std::multimap<Value *, Value *>  &LoadMoveSet);
> > +
> > +    void moveUsesOfIAfterJ(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *>&  ChosenPairs,
> > +                     std::multimap<Value *, Value *>  &LoadMoveSet,
> > +                     Instruction *&InsertionPt,
> > +                     Instruction *I, Instruction *J);
> 
> Here I am somehow unsatisfied with the indentation, but can currently
> not suggest anything better. The single operands are just too long. ;-)
> 
> > +
> > +    virtual bool runOnBasicBlock(BasicBlock&BB) {
> > +      bool changed = false;
> > +      // Iterate a sufficient number of times to merge types of
> > +      // size 1, then 2, then 4, etc. up to half of the
> > +      // target vector width of the target vector register.
> Use all 80 columns.
> 
> Also. What is a type of size 1? Do you mean 1 bit, one byte or what?
> 
> > +      for (unsigned v = 2, n = 1; v<= VectorBits&&
> > +             (!MaxIter || n<= MaxIter); v *= 2, ++n) {
>                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
>                  This fits in the previous row.
> Also, having the whole condition in one row seems easier to read.
> 
> > +        DEBUG(dbgs()<<  "BBV: fusing loop #"<<  n<<
> > +              " for "<<  BB.getName()<<  " in "<<
> > +              BB.getParent()->getName()<<  "...\n");
> > +        if (vectorizePairs(BB)) {
> > +          changed = true;
> > +        } else {
> > +          break;
> > +        }
> Braces not needed here.
> 
> > +      }
> > +
> > +      DEBUG(dbgs()<<  "BBV: done!\n");
> > +      return changed;
> > +    }
> > +
> > +    virtual void getAnalysisUsage(AnalysisUsage&AU) const {
> > +      BasicBlockPass::getAnalysisUsage(AU);
> > +      AU.addRequired<AliasAnalysis>();
> > +      AU.addRequired<ScalarEvolution>();
> > +      AU.addPreserved<AliasAnalysis>();
> > +      AU.addPreserved<ScalarEvolution>();
> > +    }
> > +
> > +    // This returns the vector type that holds a pair of the provided type.
> > +    // If the provided type is already a vector, then its length is doubled.
> > +    static inline VectorType *getVecType(Type *VElemType) {
> 
> What about naming this getVecTypeForPair()?
> 
> I think ElemType is sufficient. No need for the 'V'.
> 
> > +      if (VElemType->isVectorTy()) {
> > +        unsigned numElem = cast<VectorType>(VElemType)->getNumElements();
> > +        return VectorType::get(VElemType->getScalarType(), numElem*2);
> > +      } else {
> > +        return VectorType::get(VElemType, 2);
> > +      }
> > +    }
> 
> This can be:
> 
> if (VectorType *VectorTy = dyn_cast<VectorType>(ElemType)) {
>    unsigned numElem = VectorTy->getNumElements();
>    return VectorType::get(VElemType->getScalarType(), numElem*2);
> } else {
>    return VectorType::get(VElemType, 2);
> }
> 
> > +    // Returns the weight associated with the provided value. A chain of
> > +    // candidate pairs has a length given by the sum of the weights of its
> > +    // members (one weight per pair; the weight of each member of the pair
> > +    // is assumed to be the same). This length is then compared to the
> > +    // chain-length threshold to determine if a given chain is significant
> > +    // enough to be vectorized. The length is also used in comparing
> > +    // candidate chains where longer chains are considered to be better.
> > +    // Note: when this function returns 0, the resulting instructions are
> > +    // not actually fused.
> > +    static inline size_t getDepthFactor(Value *V) {
> > +      if (isa<InsertElementInst>(V) || isa<ExtractElementInst>(V)) {
> 
> Why have InsertElementInst, ExtractElementInst instructions a weight of
> zero?

Two reasons: First, they cannot be usefully fused. Second, because the
pass generates a lot of these, they can confuse the simple metric used
to compare the trees in the next iteration. Thus, giving them a weight
of zero allows the pass to essentially ignore them in subsequent
iterations when looking for vectorization opportunities while still
tracking dependency chains that flow through those instructions.

> 
> > +        return 0;
> > +      }
> No need for braces.
> 
> > +
> > +      return 1;
> > +    }
> > +
> > +    // This determines the relative offset of two loads or stores, returning
> > +    // true if the offset could be determined to be some constant value.
> > +    // For example, if OffsetInElmts == 1, then J accesses the memory directly
> > +    // after I; if OffsetInElmts == -1 then I accesses the memory
> > +    // directly after J. This function assumes that both instructions
> > +    // have the same type.
> You may want to mention more explicit that this function returns parts
> of the result in OffsetInElmts.
> 
> I think 'Offset' is enough, no need for 'OffsetInElmts'.
> 
> > +    bool getPairPtrInfo(Instruction *I, Instruction *J, const TargetData&TD,
> > +        Value *&IPtr, Value *&JPtr, unsigned&IAlignment, unsigned&JAlignment,
> > +        int64_t&OffsetInElmts) {
> > +      OffsetInElmts = 0;
> > +      if (isa<LoadInst>(I)) {
> > +        IPtr = cast<LoadInst>(I)->getPointerOperand();
> > +        JPtr = cast<LoadInst>(J)->getPointerOperand();
> > +        IAlignment = cast<LoadInst>(I)->getAlignment();
> > +        JAlignment = cast<LoadInst>(J)->getAlignment();
> > +      } else {
> > +        IPtr = cast<StoreInst>(I)->getPointerOperand();
> > +        JPtr = cast<StoreInst>(J)->getPointerOperand();
> > +        IAlignment = cast<StoreInst>(I)->getAlignment();
> > +        JAlignment = cast<StoreInst>(J)->getAlignment();
> > +      }
> > +
> > +      ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> > +      const SCEV *IPtrSCEV = SE.getSCEV(IPtr);
> > +      const SCEV *JPtrSCEV = SE.getSCEV(JPtr);
> > +
> > +      // If this is a trivial offset, then we'll get something like
> > +      // 1*sizeof(type). With target data, which we need anyway, this will get
> > +      // constant folded into a number.
> > +      const SCEV *OffsetSCEV = SE.getMinusSCEV(JPtrSCEV, IPtrSCEV);
> > +      if (const SCEVConstant *ConstOffSCEV =
> > +            dyn_cast<SCEVConstant>(OffsetSCEV)) {
> > +        ConstantInt *IntOff = ConstOffSCEV->getValue();
> > +        int64_t Offset = IntOff->getSExtValue();
> > +
> > +        Type *VTy = cast<PointerType>(IPtr->getType())->getElementType();
> > +        int64_t VTyTSS = (int64_t) TD.getTypeStoreSize(VTy);
> > +
> > +        assert(VTy == cast<PointerType>(JPtr->getType())->getElementType());
> > +
> > +        OffsetInElmts = Offset/VTyTSS;
> > +        return (abs64(Offset) % VTyTSS) == 0;
> > +      }
> > +
> > +      return false;
> > +    }
> > +
> > +    // Returns true if the provided CallInst represents an intrinsic that can
> > +    // be vectorized.
> > +    bool isVectorizableIntrinsic(CallInst* I) {
> > +      bool VIntr = false;
> > +      if (Function *F = I->getCalledFunction()) {
> > +        if (unsigned IID = F->getIntrinsicID()) {
> 
> Use early exit.
> 
> Function *F = I->getCalledFunction();
> 
> if (!F)
>    return false;
> 
> unsigned IID = F->getIntrinsicID();
> 
> if (!IID)
>    return false;
> 
> switch(IID) {
> 	case Intrinsic::sqrt:
> 	case Intrinsic::powi:
> 	case Intrinsic::sin:
> 	case Intrinsic::cos:
> 	case Intrinsic::log:
> 	case Intrinsic::log2:
> 	case Intrinsic::log10:
> 	case Intrinsic::exp:
> 	case Intrinsic::exp2:
> 	case Intrinsic::pow:
> 		return !NoMath;
> 	case Intrinsic::fma:
> 		return !NoFMA;
> }
> 
> Is this switch exhaustive or are you missing a default case?

You're not looking at the most-recent version. You had (correctly)
commented on this last time, and I had corrected it.

> 
> > +    // Returns true if J is the second element in some ValuePair referenced by
> > +    // some multimap ValuePair iterator pair.
> > +    bool isSecondInVPIteratorPair(VPIteratorPair PairRange, Value *J) {
> > +      for (std::multimap<Value *, Value *>::iterator K = PairRange.first;
> > +             K != PairRange.second; ++K) {
> > +        if (K->second == J) return true;
> > +      }
> No need for braces.
> 
> > +
> > +      return false;
> > +    }
> > +  };
> > +
> > +  // This function implements one vectorization iteration on the provided
> > +  // basic block. It returns true if the block is changed.
> > +  bool BBVectorize::vectorizePairs(BasicBlock&BB) {
> > +    std::vector<Value *>  PairableInsts;
> > +    std::multimap<Value *, Value *>  CandidatePairs;
> > +    getCandidatePairs(BB, CandidatePairs, PairableInsts);
> > +    if (!PairableInsts.size()) return false;
> 
> if (PairableInsts.size() == 0) ...
> 
> I think it is more straightforward to compare a size to 0 than to force
> the brain to implicitly convert it to bool.
> 
> > +
> > +    // Now we have a map of all of the pairable instructions and we need to
>                we have a map of all pairable instructions
> 
> > +    // select the best possible pairing. A good pairing is one such that the
> > +    // Users of the pair are also paired. This defines a (directed) forest
> 	  users
>            ^
> > +    // over the pairs such that two pairs are connected iff the second pair
> > +    // uses the first.
> > +
> > +    // Note that it only matters that both members of the second pair use some
> > +    // element of the first pair (to allow for splatting).
> > +
> > +    std::multimap<ValuePair, ValuePair>  ConnectedPairs;
> > +    computeConnectedPairs(CandidatePairs, PairableInsts, ConnectedPairs);
> > +    if (!ConnectedPairs.size()) return false;
> 
> if (ConnectedPairs.size() == 0) ...
> 
> > +    // Build the pairable-instruction dependency map
> > +    DenseSet<ValuePair>  PairableInstUsers;
> > +    buildDepMap(BB, CandidatePairs, PairableInsts, PairableInstUsers);
> > +
> > +    // There is now a graph of the connected pairs. For each variable, pick the
> > +    // pairing with the largest tree meeting the depth requirement on at least
> > +    // one branch. Then select all pairings that are part of that tree and
> > +    // remove them from the list of available parings and pairable variables.
> 						 pairings
> 
> > +
> > +    DenseMap<Value *, Value *>  ChosenPairs;
> > +    choosePairs(CandidatePairs, PairableInsts, ConnectedPairs,
> > +      PairableInstUsers, ChosenPairs);
> > +
> > +    if (!ChosenPairs.size()) return false;
> > +    NumFusedOps += ChosenPairs.size();
> > +
> > +    // A set of chosen pairs has now been selected. It is now necessary to
>         // A set of pairs has been chosen.
> > +    // replace the paired functions with vector functions. For this procedure
> 
> You are only talking about functions. I thought the main objective is to
> pair LLVM-IR _instructions_.

Indeed, bad wording. I do mean instructions.

> 
> > +    // each argument much be replaced with a vector argument. This vector
>                          must
> 
> At least for instructions, I would use the term 'operand' instead of
> 'argument'.
> 
> > +    // is formed by using build_vector on the old arguments. The replaced
> > +    // values are then replaced with a vector_extract on the result.
> > +    // Subsequent optimization passes should coalesce the build/extract
> > +    // combinations.
> > +
> > +    fuseChosenPairs(BB, PairableInsts, ChosenPairs);
> > +
> > +    return true;
> > +  }
> > +
> > +  // This function returns true if the provided instruction is capable of being
> > +  // fused into a vector instruction. This determination is based only on the
> > +  // type and other attributes of the instruction.
> > +  bool BBVectorize::isInstVectorizable(Instruction *I,
> > +                                         bool&IsSimpleLoadStore) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> 
> AA can become a member of the BBVectorize class and it can be
> initialized once per basic block.
> 
> > +    bool VIntr = false;
> > +    if (isa<CallInst>(I)) {
> > +      VIntr = isVectorizableIntrinsic(cast<CallInst>(I));
> > +    }
> Remove VIntr and use early exit.
> 
> if (CallInst *C = dyn_cast<CallInst>(I)) {
>    if(!isVectorizableIntrinsic(cast<CallInst>(I))
>      return false;
> }
> 
> Also, no braces if not necessary in at least one branch.
> 
> > +    // Vectorize simple loads and stores if possbile:
> > +    IsSimpleLoadStore = false;
> > +    if (isa<LoadInst>(I)) {
> > +      IsSimpleLoadStore = cast<LoadInst>(I)->isSimple();
> > +    } else if (isa<StoreInst>(I)) {
> > +      IsSimpleLoadStore = cast<StoreInst>(I)->isSimple();
> > +    }
> Again, no braces needed and use early exit.
> 
> } else if (LoadInst *L = dyn_cast<LoadInst>(I)) {
>    if (!L->isSimple() || NoMemOps) {
>      IsSimpleLoadStore = false;
>      return false;
>    }
> } else if (StorInst *S = dyn_cast<StoreInst>(S)) {
>    if (!S->isSimple() || NoMemOps) {
>      IsSimpleLoadStore = false;
>      return false;
>    }
> }
> 
> > +    // We can vectorize casts, but not casts of pointer types, etc.
> > +    bool IsCast = false;
> > +    if (I->isCast()) {
> > +      IsCast = true;
> > +      if (!cast<CastInst>(I)->getSrcTy()->isSingleValueType()) {
> > +        IsCast = false;
> > +      } else if (!cast<CastInst>(I)->getDestTy()->isSingleValueType()) {
> > +        IsCast = false;
> > +      } else if (cast<CastInst>(I)->getSrcTy()->isPointerTy()) {
> > +        IsCast = false;
> > +      } else if (cast<CastInst>(I)->getDestTy()->isPointerTy()) {
> > +        IsCast = false;
> > +      }
> > +    }
> 
> 
> } else if (CastInst *C = dyn_cast<CastInst>(I)) {
>    if (NoCasts)
>      return false;
> 
>    Type *SrcTy = C->getSrcTy()
> 
>    if (!SrcTy->isSingleValueType() || SrcTy->isPointerTy)
>      return false;
> 
>    Type *DestTy = C->getDestTy()
> 
>    if (!DestTy->isSingleValueType() || DestTy->isPointerTy)
>      return false;
> }
> 
> > +
> > +    if (!(I->isBinaryOp() || isa<ShuffleVectorInst>(I) ||
> > +            isa<ExtractElementInst>(I) || isa<InsertElementInst>(I) ||
> > +            (!NoCasts&&  IsCast) || VIntr ||
> > +            (!NoMemOps&&  IsSimpleLoadStore))) {
> > +      return false;
> > +    }
> 
> else if (I->isBinaryOp
> 	 || isa<ShuffleVectorInst>(I)
>           || isa<ExtractElementInst>(I)
>           || isa<InsertElementInst>(I)) {
>    // Those instructions are always OK.
> } else {
>    // Everything else can not be handled.
>    return false;
> }
> 
> 
> > +
> > +    // We can't vectorize memory operations without target data
> > +    if (AA.getTargetData() == 0&&  IsSimpleLoadStore) {
> > +      return false;
> 
> Is this check needed at this place? I had the feeling the whole pass
> does not work without target data. At least the offset calculation seems
> to require it.

Yes, it is needed. Without target data, no loads or stores are selected
for vectorization, and so no offsets are calculated. The pass should
otherwise work, however, without target data. I tested this at some
point, if it does not degrade gracefully, then I consider that to be a
bug.

> 
> > +    }
> > +
> > +    Type *T1, *T2;
> > +    if (isa<StoreInst>(I)) {
> > +      // For stores, it is the value type, not the pointer type that matters
> > +      // because the value is what will come from a vector register.
> > +
> > +      Value *IVal = cast<StoreInst>(I)->getValueOperand();
> > +      T1 = IVal->getType();
> > +    } else {
> > +      T1 = I->getType();
> > +    }
> 
> > +
> > +    if (I->isCast()) {
> > +      T2 = cast<CastInst>(I)->getSrcTy();
> > +    } else {
> > +      T2 = T1;
> > +    }
> No braces needed.
> 
> > +
> > +    // Not every type can be vectorized...
> > +    if (!(VectorType::isValidElementType(T1) || T1->isVectorTy()) ||
> > +        !(VectorType::isValidElementType(T2) || T2->isVectorTy()))
> > +      return false;
> > +
> > +    if (NoInts&&  (T1->isIntOrIntVectorTy() || T2->isIntOrIntVectorTy()))
> > +      return false;
> > +
> > +    if (NoFloats&&  (T1->isFPOrFPVectorTy() || T2->isFPOrFPVectorTy()))
> > +      return false;
> > +
> > +    if (T1->getPrimitiveSizeInBits()>  VectorBits/2 ||
> > +        T2->getPrimitiveSizeInBits()>  VectorBits/2)
> > +      return false;
> > +
> > +    return true;
> > +  }
> > +
> > +  // This function returns true if the two provided instructions are compatible
> > +  // (meaning that they can be fused into a vector instruction). This assumes
> > +  // that I has already been determined to be vectorizable and that J is not
> > +  // in the use tree of I.
> > +  bool BBVectorize::areInstsCompatible(Instruction *I, Instruction *J,
> > +                       bool IsSimpleLoadStore) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> AA can become a member of the BBVectorize class and it can be
> initialized once per basic block.
> 
> > +    bool IsCompat = J->isSameOperationAs(I);
> > +    // FIXME: handle addsub-type operations!
> > +
> > +    // Loads and stores can be merged if they have different alignments,
> > +    // but are otherwise the same.
> > +    if (!IsCompat&&  isa<LoadInst>(I)&&  isa<LoadInst>(J)) {
> > +      if (I->getType() == J->getType()) {
> > +        if (cast<LoadInst>(I)->getPointerOperand()->getType() ==
> > +              cast<LoadInst>(J)->getPointerOperand()->getType()&&
> > +            cast<LoadInst>(I)->isVolatile() ==
> > +              cast<LoadInst>(J)->isVolatile()&&
> > +            cast<LoadInst>(I)->getOrdering() ==
> > +              cast<LoadInst>(J)->getOrdering()&&
> > +            cast<LoadInst>(I)->getSynchScope() ==
> > +              cast<LoadInst>(J)->getSynchScope()
> > +           ) {
> > +             IsCompat = true;
> > +         }
> > +      }
> > +    } else if (!IsCompat&&  isa<StoreInst>(I)&&  isa<StoreInst>(J)) {
> > +      if (cast<StoreInst>(I)->getValueOperand()->getType() ==
> > +            cast<StoreInst>(J)->getValueOperand()->getType()&&
> > +          cast<StoreInst>(I)->getPointerOperand()->getType() ==
> > +            cast<StoreInst>(J)->getPointerOperand()->getType()&&
> > +          cast<StoreInst>(I)->isVolatile() ==
> > +            cast<StoreInst>(J)->isVolatile()&&
> > +          cast<StoreInst>(I)->getOrdering() ==
> > +            cast<StoreInst>(J)->getOrdering()&&
> > +          cast<StoreInst>(I)->getSynchScope() ==
> > +            cast<StoreInst>(J)->getSynchScope()
> > +         ) {
> > +           IsCompat = true;
> > +       }
> > +    }
> > +
> > +    if (!IsCompat) return false;
> > +
> > +    if (IsSimpleLoadStore) {
> > +      const TargetData&TD = *AA.getTargetData();
> > +
> > +      Value *IPtr, *JPtr;
> > +      unsigned IAlignment, JAlignment;
> > +      int64_t OffsetInElmts = 0;
> > +      if (getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment, JAlignment,
> > +            OffsetInElmts)&&  abs64(OffsetInElmts) == 1) {
> > +        if (AlignedOnly) {
> > +          Type *aType = isa<StoreInst>(I) ?
> > +            cast<StoreInst>(I)->getValueOperand()->getType() : I->getType();
> > +          // An aligned load or store is possible only if the instruction
> > +          // with the lower offset has an alignment suitable for the
> > +          // vector type.
> > +
> > +          unsigned BottomAlignment = IAlignment;
> > +          if (OffsetInElmts<  0) BottomAlignment = JAlignment;
> > +
> > +          Type *VType = getVecType(aType);
> > +          unsigned VecAlignment = TD.getPrefTypeAlignment(VType);
> > +          if (BottomAlignment<  VecAlignment) {
> > +            IsCompat = false;
> > +          }
> > +        }
> > +      } else {
> > +        IsCompat = false;
> > +      }
> > +    } else if (isa<ShuffleVectorInst>(I)) {
> > +      // Only merge two shuffles if they're both constant
> > +      // or both not constant.
> > +      IsCompat = isa<Constant>(I->getOperand(2))&&
> > +                 isa<Constant>(J->getOperand(2));
> > +      // FIXME: We may want to vectorize non-constant shuffles also.
> > +    }
> > +
> > +    return IsCompat;
> 
> Here again for the whole function. Can you try to use early exit to
> simplify the code http://llvm.org/docs/CodingStandards.html#hl_earlyexit
> You might be able to remove the use of IsCompat completely.
> 
> > +  }
> > +
> > +  // Figure out whether or not J uses I and update the users and write-set
> > +  // structures associated with I.
> 
> You may want to give I and J more descriptive names. I remember I asked
> you the last time to rename One and Two to I and J, but in this case it
> seems both have very specific roles which could be used to describe
> them. What about 'Inst' and 'User'?
> 
> Also what are the arguments of this function? Can you explain them a
> little?
> 
> > +  void BBVectorize::trackUsesOfI(DenseSet<Value *>  &Users,
> > +                       AliasSetTracker&WriteSet, Instruction *I,
> > +                       Instruction *J, bool&UsesI, bool UpdateUsers,
> 
> You may want to return UsesI as a function return value not as a
> reference parameter, though you may need to updated the function name
> accordingly.
> 
> > +                       std::multimap<Value *, Value *>  *LoadMoveSet) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    UsesI = false;
> > +
> > +    // This instruction may already be marked as a user due, for example, to
> > +    // being a member of a selected pair.
> > +    if (Users.count(J))
> > +      UsesI = true;
> > +
> > +    if (!UsesI) for (User::op_iterator JU = J->op_begin(), e = J->op_end();
> Use a new line for this 'for'.
> 
> > +           JU != e; ++JU) {
> > +      Value *V = *JU;
> > +      if (I == V || Users.count(V)) {
> > +        UsesI = true;
> > +        break;
> > +      }
> > +    }
> > +    if (!UsesI&&  J->mayReadFromMemory()) {
> > +      if (LoadMoveSet) {
> > +        VPIteratorPair JPairRange = LoadMoveSet->equal_range(J);
> > +        UsesI = isSecondInVPIteratorPair(JPairRange, I);
> > +      }
> > +      else for (AliasSetTracker::iterator W = WriteSet.begin(),
> Use a new line for this 'for'.
> 
> > +            WE = WriteSet.end(); W != WE; ++W) {
> > +        for (AliasSet::iterator A = W->begin(), AE = W->end();
> > +               A != AE; ++A) {
> > +          AliasAnalysis::Location ptrLoc(A->getValue(), A->getSize(),
> > +            A->getTBAAInfo());
> Align this argument with the first argument.
> 
>            AliasAnalysis::Location ptrLoc(A->getValue(), A->getSize(),
>            				 A->getTBAAInfo());
> 
> > +          if (AA.getModRefInfo(J, ptrLoc) != AliasAnalysis::NoModRef) {
> > +            UsesI = true;
> > +            break;
> > +          }
> > +        }
> > +        if (UsesI) break;
> > +      }
> > +    }
> > +    if (UsesI&&  UpdateUsers) {
> > +      if (J->mayWriteToMemory()) WriteSet.add(J);
> > +      Users.insert(J);
> > +    }
> > +  }
> > +
> > +  // This function iterates over all instruction pairs in the provided
> > +  // basic block and collects all candidate pairs for vectorization.
> > +  void BBVectorize::getCandidatePairs(BasicBlock&BB,
> > +                       std::multimap<Value *, Value *>  &CandidatePairs,
> > +                       std::vector<Value *>  &PairableInsts) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    BasicBlock::iterator E = BB.end();
> > +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E; ++I) {
> > +      bool IsSimpleLoadStore;
> > +      if (!isInstVectorizable(I, IsSimpleLoadStore)) continue;
> > +
> > +      // Look for an instruction with which to pair instruction *I...
> > +      DenseSet<Value *>  Users;
> > +      AliasSetTracker WriteSet(AA);
> > +      BasicBlock::iterator J = I; ++J;
> > +      for (unsigned ss = 0; J != E&&  ss<= SearchLimit; ++J, ++ss) {
> > +        // Determine if J uses I, if so, exit the loop.
> > +        bool UsesI;
> > +        trackUsesOfI(Users, WriteSet, I, J, UsesI, !FastDep);
> 
> 'UsesI' can be eliminated if the corresponding value is passed as
> function return value.
> 
> > +        if (FastDep) {
> > +          // Note: For this heuristic to be effective, independent operations
> > +          // must tend to be intermixed. This is likely to be true from some
> > +          // kinds of grouped loop unrolling (but not the generic LLVM pass),
> > +          // but otherwise may require some kind of reordering pass.
> > +
> > +          // When using fast dependency analysis,
> > +          // stop searching after first use:
> > +          if (UsesI) break;
> > +        } else {
> > +          if (UsesI) continue;
> > +        }
> > +
> > +        // J does not use I, and comes before the first use of I, so it can be
> > +        // merged with I if the instructions are compatible.
> > +        if (!areInstsCompatible(I, J, IsSimpleLoadStore)) continue;
> > +
> > +        // J is a candidate for merging with I.
> > +        if (!PairableInsts.size() ||
> > +             PairableInsts[PairableInsts.size()-1] != I) {
> > +          PairableInsts.push_back(I);
> > +        }
> > +        CandidatePairs.insert(ValuePair(I, J));
> > +        DEBUG(dbgs()<<  "BBV: candidate pair "<<  *I<<
> > +                     "<->  "<<  *J<<  "\n");
> > +      }
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: found "<<  PairableInsts.size()
> > +<<  " instructions with candidate pairs\n");
> > +  }
> > +
> > +  // Finds candidate pairs connected to the pair P =<PI, PJ>. This means that
> > +  // it looks for pairs such that both members have an input which is an
> > +  // output of PI or PJ.
> > +  void BBVectorize::computePairsConnectedTo(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                      ValuePair P) {
> > +    // For each possible pairing for this variable, look at the uses of
> > +    // the first value...
> > +    for (Value::use_iterator I = P.first->use_begin(),
> > +         E = P.first->use_end(); I != E; ++I) {
> > +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> > +
> > +      // For each use of the first variable, look for uses of the second
> > +      // variable...
> > +      for (Value::use_iterator J = P.second->use_begin(),
> > +           E2 = P.second->use_end(); J != E2; ++J) {
> > +        VPIteratorPair JPairRange = CandidatePairs.equal_range(*J);
> > +
> > +        // Look for<I, J>:
> > +        if (isSecondInVPIteratorPair(IPairRange, *J))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> > +
> > +        // Look for<J, I>:
> > +        if (isSecondInVPIteratorPair(JPairRange, *I))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*J, *I)));
> > +      }
> > +
> > +      // Look for cases where just the first value in the pair is used by
> > +      // both members of another pair (splatting).
> > +      Value::use_iterator J = P.first->use_begin();
> > +      if (!SplatBreaksChain) for (; J != E; ++J) {
> This can become an early continue. Also please do not put an 'if' and a
> 'for' into the same line.
> 
> > +        if (isSecondInVPIteratorPair(IPairRange, *J))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> > +      }
> > +    }
> > +
> > +    // Look for cases where just the second value in the pair is used by
> > +    // both members of another pair (splatting).
> > +    if (!SplatBreaksChain)
> This can become an early return.
> 
> > +      for (Value::use_iterator I = P.second->use_begin(),
> > +         E = P.second->use_end(); I != E; ++I) {
> > +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> > +
> > +      Value::use_iterator J = P.second->use_begin();
> > +      for (; J != E; ++J) {
> Why not move initialization into 'for'?
> 
> > +        if (isSecondInVPIteratorPair(IPairRange, *J))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> > +      }
> > +    }
> > +  }
> > +
> > +  // This function figures out which pairs are connected.  Two pairs are
> > +  // connected if some output of the first pair forms an input to both members
> > +  // of the second pair.
> > +  void BBVectorize::computeConnectedPairs(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs) {
> > +
> > +    for (std::vector<Value *>::iterator PI = PairableInsts.begin(),
> > +           PE = PairableInsts.end(); PI != PE; ++PI) {
> > +      VPIteratorPair choiceRange = CandidatePairs.equal_range(*PI);
> > +
> > +      for (std::multimap<Value *, Value *>::iterator P = choiceRange.first;
> > +           P != choiceRange.second; ++P) {
> > +        computePairsConnectedTo(CandidatePairs, PairableInsts,
> > +          ConnectedPairs, *P);
> Align CandidatePairs and ConnectedPairs.
> 
> > +      }
> No braces needed here.
> 
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: found "<<  ConnectedPairs.size()
> > +<<  " pair connections.\n");
> > +  }
> > +
> > +  // This function builds a set of use tuples such that<A, B>  is in the set
> > +  // if B is in the use tree of A. If B is in the use tree of A, then B
> > +  // depends on the output of A.
> > +  void BBVectorize::buildDepMap(
> > +                      BasicBlock&BB,
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      DenseSet<ValuePair>  &PairableInstUsers) {
> > +    DenseSet<Value *>  IsInPair;
> > +    for (std::multimap<Value *, Value *>::iterator C = CandidatePairs.begin(),
> > +           E = CandidatePairs.end(); C != E; ++C) {
> > +      IsInPair.insert(C->first);
> > +      IsInPair.insert(C->second);
> > +    }
> > +
> > +    // Iterate through the basic block, recording all Users of each
> > +    // pairable instruction.
> > +
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    BasicBlock::iterator E = BB.end();
> > +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E; ++I) {
> > +      if (IsInPair.find(I) == IsInPair.end()) continue;
> > +
> > +      DenseSet<Value *>  Users;
> > +      AliasSetTracker WriteSet(AA);
> > +      BasicBlock::iterator J = I; ++J;
> > +      for (; J != E; ++J) {
> Move the initialization into the for loop.
> 
> 	for (BasicBlock::iterator J = I + 1; J != E; ++J)
> 
> > +        bool UsesI;
> > +        trackUsesOfI(Users, WriteSet, I, J, UsesI);
> > +      }
> > +
> > +      for (DenseSet<Value *>::iterator U = Users.begin(), E = Users.end();
> > +             U != E; ++U) {
> 	      ^^ Check indentation here.
> 
> > +        PairableInstUsers.insert(ValuePair(I, *U));
> > +      }
> No braces needed.
> 
> > +    }
> > +  }
> > +
> > +  // Returns true if an input to pair P is an output of pair Q and also an
> > +  // input of pair Q is an output of pair P. If this is the case, then these
> > +  // two pairs cannot be simultaneously fused.
> > +  bool BBVectorize::pairsConflict(ValuePair P, ValuePair Q,
> > +                      DenseSet<ValuePair>  &PairableInstUsers) {
> > +    // Two pairs are in conflict if they are mutual Users of eachother.
> > +    bool QUsesP = PairableInstUsers.count(ValuePair(P.first,  Q.first))  ||
> > +                  PairableInstUsers.count(ValuePair(P.first,  Q.second)) ||
> > +                  PairableInstUsers.count(ValuePair(P.second, Q.first))  ||
> > +                  PairableInstUsers.count(ValuePair(P.second, Q.second));
> > +    bool PUsesQ = PairableInstUsers.count(ValuePair(Q.first,  P.first))  ||
> > +                  PairableInstUsers.count(ValuePair(Q.first,  P.second)) ||
> > +                  PairableInstUsers.count(ValuePair(Q.second, P.first))  ||
> > +                  PairableInstUsers.count(ValuePair(Q.second, P.second));
> > +    return (QUsesP&&  PUsesQ);
> > +  }
> > +
> > +  // This function builds the initial tree of connected pairs with the
> > +  // pair J at the root.
> > +  void BBVectorize::buildInitialTreeFor(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                      DenseSet<ValuePair>  &PairableInstUsers,
> > +                      DenseMap<Value *, Value *>  &ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree, ValuePair J) {
> > +    // Each of these pairs is viewed as the root node of a Tree. The Tree
> > +    // is then walked (depth-first). As this happens, we keep track of
> > +    // the pairs that compose the Tree and the maximum depth of the Tree.
> > +    std::vector<ValuePairWithDepth>  Q;
> > +    // General depth-first post-order traversal:
> > +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> > +    while (!Q.empty()) {
> > +      ValuePairWithDepth QTop = Q.back();
> > +
> > +      // Push each child onto the queue:
> > +      bool MoreChildren = false;
> > +      size_t MaxChildDepth = QTop.second;
> > +      VPPIteratorPair qtRange = ConnectedPairs.equal_range(QTop.first);
> > +      for (std::map<ValuePair, ValuePair>::iterator k = qtRange.first;
> > +           k != qtRange.second; ++k) {
> > +        // Make sure that this child pair is still a candidate:
> > +        bool IsStillCand = false;
> > +        VPIteratorPair checkRange =
> > +          CandidatePairs.equal_range(k->second.first);
> > +        for (std::multimap<Value *, Value *>::iterator m = checkRange.first;
> > +             m != checkRange.second; ++m) {
> > +          if (m->second == k->second.second) {
> > +            IsStillCand = true;
> > +            break;
> > +          }
> > +        }
> > +
> > +        if (IsStillCand) {
> > +          DenseMap<ValuePair, size_t>::iterator C = Tree.find(k->second);
> > +          if (C == Tree.end()) {
> > +            size_t d = getDepthFactor(k->second.first);
> > +            Q.push_back(ValuePairWithDepth(k->second, QTop.second+d));
> > +            MoreChildren = true;
> > +          } else {
> > +            MaxChildDepth = std::max(MaxChildDepth, C->second);
> > +          }
> > +        }
> > +      }
> > +
> > +      if (!MoreChildren) {
> > +        // Record the current pair as part of the Tree:
> > +        Tree.insert(ValuePairWithDepth(QTop.first, MaxChildDepth));
> > +        Q.pop_back();
> > +      }
> > +    }
> > +  }
> > +
> > +  // Given some initial tree, prune it by removing conflicting pairs (pairs
> > +  // that cannot be simultaneously chosen for vectorization).
> > +  void BBVectorize::pruneTreeFor(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                      DenseSet<ValuePair>  &PairableInstUsers,
> > +                      DenseMap<Value *, Value *>  &ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
> > +                      DenseSet<ValuePair>  &PrunedTree, ValuePair J) {
> > +    std::vector<ValuePairWithDepth>  Q;
> > +    // General depth-first post-order traversal:
> > +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> > +    while (!Q.empty()) {
> > +      ValuePairWithDepth QTop = Q.back();
> > +      PrunedTree.insert(QTop.first);
> > +      Q.pop_back();
> > +
> > +      // Visit each child, pruning as necessary...
> > +      DenseMap<ValuePair, size_t>  BestChilden;
> > +      VPPIteratorPair QTopRange = ConnectedPairs.equal_range(QTop.first);
> > +      for (std::map<ValuePair, ValuePair>::iterator K = QTopRange.first;
> > +           K != QTopRange.second; ++K) {
> > +        DenseMap<ValuePair, size_t>::iterator C = Tree.find(K->second);
> > +        if (C == Tree.end()) continue;
> > +
> > +        // This child is in the Tree, now we need to make sure it is the
> > +        // best of any conflicting children. There could be multiple
> > +        // conflicting children, so first, determine if we're keeping
> > +        // this child, then delete conflicting children as necessary.
> > +
> > +        // It is also necessary to guard against pairing-induced
> > +        // dependencies. Consider instructions a .. x .. y .. b
> > +        // such that (a,b) are to be fused and (x,y) are to be fused
> > +        // but a is an input to x and b is an output from y. This
> > +        // means that y cannot be moved after b but x must be moved
> > +        // after b for (a,b) to be fused. In other words, after
> > +        // fusing (a,b) we have y .. a/b .. x where y is an input
> > +        // to a/b and x is an output to a/b: x and y can no longer
> > +        // be legally fused. To prevent this condition, we must
> > +        // make sure that a child pair added to the Tree is not
> > +        // both an input and output of an already-selected pair.
> > +
> > +        bool CanAdd = true;
> 
> The use of CanAdd can be improved with early continues.
> 
> > +        for (DenseMap<ValuePair, size_t>::iterator C2
> > +              = BestChilden.begin(), E2 = BestChilden.end();
> > +             C2 != E2; ++C2) {
> > +          if (C2->first.first == C->first.first ||
> > +              C2->first.first == C->first.second ||
> > +              C2->first.second == C->first.first ||
> > +              C2->first.second == C->first.second ||
> > +              pairsConflict(C2->first, C->first, PairableInstUsers)) {
> > +            if (C2->second>= C->second) {
> > +              CanAdd = false;
> > +              break;
> > +            }
> > +          }
> 
> This loop
> 
> > +        }
> > +
> > +        // Even worse, this child could conflict with another node already
> > +        // selected for the Tree. If that is the case, ignore this child.
> > +        if (CanAdd)
> > +          for (DenseSet<ValuePair>::iterator T = PrunedTree.begin(),
> > +                 E2 = PrunedTree.end(); T != E2; ++T) {
> > +            if (T->first == C->first.first ||
> > +                T->first == C->first.second ||
> > +                T->second == C->first.first ||
> > +                T->second == C->first.second ||
> > +                pairsConflict(*T, C->first, PairableInstUsers)) {
> > +              CanAdd = false;
> > +              break;
> > +            }
> > +          }
> 
> and this loop
> 
> > +
> > +        // And check the queue too...
> > +        if (CanAdd)
> > +          for (std::vector<ValuePairWithDepth>::iterator C2 = Q.begin(),
> > +                E2 = Q.end(); C2 != E2; ++C2) {
> > +            if (C2->first.first == C->first.first ||
> > +                C2->first.first == C->first.second ||
> > +                C2->first.second == C->first.first ||
> > +                C2->first.second == C->first.second ||
> > +                pairsConflict(C2->first, C->first, PairableInstUsers)) {
> > +              CanAdd = false;
> > +              break;
> > +            }
> 
> and this loop
> 
> > +          }
> > +
> > +        // Last but not least, check for a conflict with any of the
> > +        // already-chosen pairs.
> > +        if (CanAdd)
> > +          for (DenseMap<Value *, Value *>::iterator C2 =
> > +                ChosenPairs.begin(), E2 = ChosenPairs.end();
> > +               C2 != E2; ++C2) {
> > +            if (pairsConflict(*C2, C->first, PairableInstUsers)) {
> > +              CanAdd = false;
> > +              break;
> 
> (and almost this one)
> 
> > +            }
> > +          }
> > +
> > +        if (CanAdd) {
> > +          // This child can be added, but we may have chosen it in preference
> > +          // to an already-selected child. Check for this here, and if a
> > +          // conflict is found, then remove the previously-selected child
> > +          // before adding this one in its place.
> > +          for (DenseMap<ValuePair, size_t>::iterator C2
> > +                = BestChilden.begin(), E2 = BestChilden.end();
> > +               C2 != E2;) {
> > +            if (C2->first.first == C->first.first ||
> > +                C2->first.first == C->first.second ||
> > +                C2->first.second == C->first.first ||
> > +                C2->first.second == C->first.second ||
> > +                pairsConflict(C2->first, C->first, PairableInstUsers)) {
> > +              BestChilden.erase(C2++);
> > +            } else {
> > +              ++C2;
> and this loop look entirely the same. Can they be moved into a helper
> function?
> 
> 
> > +            }
> > +          }
> > +
> > +          BestChilden.insert(ValuePairWithDepth(C->first, C->second));
> > +        }
> > +      }
> > +
> > +      for (DenseMap<ValuePair, size_t>::iterator C
> > +            = BestChilden.begin(), E2 = BestChilden.end();
> > +           C != E2; ++C) {
> > +        size_t DepthF = getDepthFactor(C->first.first);
> > +        Q.push_back(ValuePairWithDepth(C->first, QTop.second+DepthF));
> > +      }
> > +    }
> > +  }
> > +
> > +  // This function finds the best tree of mututally-compatible connected
> > +  // pairs, given the choice of root pairs as an iterator range.
> > +  void BBVectorize::findBestTreeFor(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                      DenseSet<ValuePair>  &PairableInstUsers,
> > +                      DenseMap<Value *, Value *>  &ChosenPairs,
> > +                      DenseSet<ValuePair>  &BestTree, size_t&BestMaxDepth,
> > +                      size_t&BestEffSize, VPIteratorPair ChoiceRange) {
> > +    for (std::multimap<Value *, Value *>::iterator J = ChoiceRange.first;
> > +         J != ChoiceRange.second; ++J) {
> > +
> > +      // Before going any further, make sure that this pair does not
> > +      // conflict with any already-selected pairs (see comment below
> > +      // near the Tree pruning for more details).
> > +      bool DoesConflict = false;
> > +      for (DenseMap<Value *, Value *>::iterator C = ChosenPairs.begin(),
> > +            E = ChosenPairs.end(); C != E; ++C) {
> > +        if (pairsConflict(*C, *J, PairableInstUsers)) {
> > +          DoesConflict = true;
> > +          break;
> > +        }
> > +      }
> > +      if (DoesConflict) continue;
> > +
> > +      DenseMap<ValuePair, size_t>  Tree;
> > +      buildInitialTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> > +        PairableInstUsers, ChosenPairs, Tree, *J);
> Fix indentation. (Align CandidatePairs and PairableInstUsers)
> 
> > +
> > +      // Because we'll keep the child with the largest depth, the largest
> > +      // depth is still the same in the unpruned Tree.
> > +      size_t MaxDepth = Tree.lookup(*J);
> > +
> > +      DEBUG(dbgs()<<  "BBV: found Tree for pair {"<<  *J->first<<
> > +                   "<->  "<<  *J->second<<  "} of depth "<<
> > +                   MaxDepth<<  " and size "<<  Tree.size()<<  "\n");
> > +
> > +      // At this point the Tree has been constructed, but, may contain
> > +      // contradictory children (meaning that different children of
> > +      // some tree node may be attempting to fuse the same instruction).
> > +      // So now we walk the tree again, in the case of a conflict,
> > +      // keep only the child with the largest depth. To break a tie,
> > +      // favor the first child.
> > +
> > +      DenseSet<ValuePair>  PrunedTree;
> > +      pruneTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> > +        PairableInstUsers, ChosenPairs, Tree, PrunedTree, *J);
> > +
> > +      size_t EffSize = 0;
> > +      for (DenseSet<ValuePair>::iterator S = PrunedTree.begin(),
> > +            E = PrunedTree.end(); S != E; ++S) {
> > +        EffSize += getDepthFactor(S->first);
> > +      }
> No braces.
> 
> > +
> > +      DEBUG(dbgs()<<  "BBV: found pruned Tree for pair {"<<  *J->first<<
> > +             "<->  "<<  *J->second<<  "} of depth "<<
> > +             MaxDepth<<  " and size "<<  PrunedTree.size()<<
> > +            " (effective size: "<<  EffSize<<  ")\n");
> > +      if (MaxDepth>= ReqChainDepth&&  EffSize>  BestEffSize) {
> > +        BestMaxDepth = MaxDepth;
> > +        BestEffSize = EffSize;
> > +        BestTree = PrunedTree;
> > +      }
> > +    }
> > +  }
> > +
> > +  // Given the list of candidate pairs, this function selects those
> > +  // that will be fused into vector instructions.
> > +  void BBVectorize::choosePairs(
> > +                      std::multimap<Value *, Value *>  &CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair>  &ConnectedPairs,
> > +                      DenseSet<ValuePair>  &PairableInstUsers,
> > +                      DenseMap<Value *, Value *>&  ChosenPairs) {
> > +    for (std::vector<Value *>::iterator I = PairableInsts.begin(),
> > +           E = PairableInsts.end(); I != E; ++I) {
> > +      // The number of possible pairings for this variable:
> > +      size_t NumChoices = CandidatePairs.count(*I);
> > +      if (!NumChoices) continue;
> > +
> > +      VPIteratorPair ChoiceRange = CandidatePairs.equal_range(*I);
> > +
> > +      // The best pair to choose and its tree:
> > +      size_t BestMaxDepth = 0, BestEffSize = 0;
> > +      DenseSet<ValuePair>  BestTree;
> > +      findBestTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> > +        PairableInstUsers, ChosenPairs, BestTree, BestMaxDepth,
> > +        BestEffSize, ChoiceRange);
> 
> Fix indentation. Align CandidatePairs, ...
> 
> > +
> > +      // A tree has been chosen (or not) at this point. If no tree was
> > +      // chosen, then this instruction, I, cannot be paired (and is no longer
> > +      // considered).
> > +
> > +      for (DenseSet<ValuePair>::iterator S = BestTree.begin(),
> > +           SE = BestTree.end(); S != SE; ++S) {
> > +        // Insert the members of this tree into the list of chosen pairs.
> > +        ChosenPairs.insert(ValuePair(S->first, S->second));
> > +        DEBUG(dbgs()<<  "BBV: selected pair: "<<  *S->first<<  "<->  "<<
> > +               *S->second<<  "\n");
> > +
> > +        // Remove all candidate pairs that have values in the chosen tree.
> > +        for (std::multimap<Value *, Value *>::iterator K =
> > +               CandidatePairs.begin(),
> > +             E2 = CandidatePairs.end(); K != E2;) {
> > +          if (K->first == S->first || K->second == S->first ||
> > +              K->second == S->second || K->first == S->second) {
> > +            // Don't remove the actual pair chosen so that it can be used
> > +            // in subsequent tree selections.
> > +            if (!(K->first == S->first&&  K->second == S->second)) {
> > +              CandidatePairs.erase(K++);
> > +            } else {
> > +              ++K;
> > +            }
> No braces needed.
> 
> > +          } else {
> > +            ++K;
> > +          }
> These are needed.
> 
> > +        }
> > +      }
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: selected "<<  ChosenPairs.size()<<  " pairs.\n");
> > +  }
> > +
> > +  // Returns the value that is to be used as the pointer input to the vector
> > +  // instruction that fuses I with J.
> > +  Value *BBVectorize::getReplacementPointerInput(LLVMContext&  Context,
> > +                     Instruction *I, Instruction *J, unsigned o,
> > +                     bool&FlipMemInputs) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    const TargetData&TD = *AA.getTargetData();
> > +
> > +    Value *IPtr, *JPtr;
> > +    unsigned IAlignment, JAlignment;
> > +    int64_t OffsetInElmts;
> > +    (void) getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment, JAlignment,
> > +             OffsetInElmts);
> Fix indentation.
> 
> > +
> > +    // The pointer value is taken to be the one with the lowest offset.
> > +    Value *VPtr;
> > +    if (OffsetInElmts>  0) {
> > +      VPtr = IPtr;
> > +    } else {
> > +      FlipMemInputs = true;
> > +      VPtr = JPtr;
> > +    }
> > +
> > +    Type *ArgType = cast<PointerType>(IPtr->getType())->getElementType();
> > +    Type *VArgType = getVecType(ArgType);
> > +    Type *VArgPtrType = PointerType::get(VArgType,
> > +      cast<PointerType>(IPtr->getType())->getAddressSpace());
> > +    return new BitCastInst(VPtr, VArgPtrType,
> > +                        I->hasName() ?
> > +                          I->getName() + ".v.i" + utostr(o) : "",
> > +                        /* insert before */ FlipMemInputs ? J : I);
> > +  }
> > +
> > +  // Returns the value that is to be used as the vector-shuffle mask to the
> > +  // vector instruction that fuses I with J.
> > +  Value *BBVectorize::getReplacementShuffleMask(LLVMContext&  Context,
> > +                     Instruction *I, Instruction *J, unsigned o) {
> > +    // This is the shuffle mask. We need to append the second
> > +    // mask to the first, and the numbers need to be adjusted.
> > +
> > +    Type *ArgType = I->getType();
> > +    Type *VArgType = getVecType(ArgType);
> > +
> > +    // Get the total number of elements in the fused vector type.
> > +    // By definition, this must equal the number of elements in
> > +    // the final mask.
> > +    unsigned NumElem = cast<VectorType>(VArgType)->getNumElements();
> > +    std::vector<Constant*>  Mask(NumElem);
> > +
> > +    Type *OpType = I->getOperand(0)->getType();
> > +    unsigned numInElem = cast<VectorType>(OpType)->getNumElements();
> > +
> > +    // For the mask from the first pair...
> > +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> > +      int m = cast<ShuffleVectorInst>(I)->getMaskValue(v);
> > +      if (m<  0) {
> > +        Mask[v] = UndefValue::get(Type::getInt32Ty(Context));
> > +      } else {
> > +        unsigned vv = v;
> > +        if (m>= (int) numInElem) {
> > +          vv += (int) numInElem;
> > +        }
> No braces needed.
> 
> > +
> > +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), vv);
> > +      }
> > +    }
> > +
> > +    // For the mask from the second pair...
> > +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> > +      int m = cast<ShuffleVectorInst>(J)->getMaskValue(v);
> > +      if (m<  0) {
> > +        Mask[v+NumElem/2] = UndefValue::get(Type::getInt32Ty(Context));
> > +      } else {
> > +        unsigned vv = v + (int) numInElem;
> > +        if (m>= (int) numInElem) {
> > +          vv += (int) numInElem;
> > +        }
> No braces needed.
> 
> > +
> > +        Mask[v+NumElem/2] =
> > +          ConstantInt::get(Type::getInt32Ty(Context), vv);
> > +      }
> > +    }
> 
> Those two for loops look very similar. Any chance they can be extracted
> into a helper function?
> 
> > +
> > +    return ConstantVector::get(Mask);
> > +  }
> > +
> > +  // Returns the value to be used as the specified operand of the vector
> > +  // instruction that fuses I with J.
> > +  Value *BBVectorize::getReplacementInput(LLVMContext&  Context, Instruction *I,
> > +                     Instruction *J, unsigned o, bool FlipMemInputs) {
> 
> Indentation.
> 
> > +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> > +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> > +
> > +      // Compute the fused vector type for this operand
> > +    Type *ArgType = I->getOperand(o)->getType();
> > +    VectorType *VArgType = getVecType(ArgType);
> > +
> > +    Instruction *L = I, *H = J;
> > +    if (FlipMemInputs) {
> > +      L = J;
> > +      H = I;
> > +    }
> > +
> > +    if (ArgType->isVectorTy()) {
> > +      unsigned numElem = cast<VectorType>(VArgType)->getNumElements();
> > +      std::vector<Constant*>  Mask(numElem);
> > +      for (unsigned v = 0; v<  numElem; ++v) {
> > +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), v);
> > +      }
> No braces.
> 
> > +
> > +      Instruction *BV = new ShuffleVectorInst(L->getOperand(o),
> > +                                              H->getOperand(o),
> > +                                              ConstantVector::get(Mask),
> > +                                              I->hasName() ?
> > +                                                I->getName() + ".v.i" +
> > +                                                  utostr(o) : "");
> 
> It may be better to not squeeze
> 
> I->hasName() ?  I->getName() + ".v.i" + utostr(o) : ""
> 
> into this operand.
> 
> > +      BV->insertBefore(J);
> > +      return BV;
> > +    }
> > +
> > +    // If these two inputs are the output of another vector instruction,
> > +    // then we should use that output directly. It might be necessary to
> > +    // permute it first. [When pairings are fused recursively, you can
> > +    // end up with cases where a large vector is decomposed into scalars
> > +    // using extractelement instructions, then built into size-2
> > +    // vectors using insertelement and the into larger vectors using
> > +    // shuffles. InstCombine does not simplify all of these cases well,
> > +    // and so we make sure that shuffles are generated here when possible.
> > +    ExtractElementInst *LEE
> > +      = dyn_cast<ExtractElementInst>(L->getOperand(o));
> > +    ExtractElementInst *HEE
> > +      = dyn_cast<ExtractElementInst>(H->getOperand(o));
> > +
> > +    if (LEE&&  HEE&&
> > +        LEE->getOperand(0)->getType() == VArgType) {
> > +      unsigned LowIndx = cast<ConstantInt>(LEE->getOperand(1))->getZExtValue();
> > +      unsigned HighIndx = cast<ConstantInt>(HEE->getOperand(1))->getZExtValue();
> > +      if (LEE->getOperand(0) == HEE->getOperand(0)) {
> > +        if (LowIndx == 0&&  HighIndx == 1)
> > +          return LEE->getOperand(0);
> > +
> > +        std::vector<Constant*>  Mask(2);
> > +        Mask[0] = ConstantInt::get(Type::getInt32Ty(Context), LowIndx);
> > +        Mask[1] = ConstantInt::get(Type::getInt32Ty(Context), HighIndx);
> > +
> > +        Instruction *BV = new ShuffleVectorInst(LEE->getOperand(0),
> > +                                            UndefValue::get(VArgType),
> > +                                            ConstantVector::get(Mask),
> > +                                            I->hasName() ?
> > +                                              I->getName() + ".v.i" +
> > +                                                utostr(o) : "");
> 
> Do not squeeze
> 
> I->hasName() ?  I->getName() + ".v.i" + utostr(o) : ""
> 
> into the operand.
> 
> > +        BV->insertBefore(J);
> > +        return BV;
> > +      }
> > +
> > +      std::vector<Constant*>  Mask(2);
> > +      HighIndx += VArgType->getNumElements();
> > +      Mask[0] = ConstantInt::get(Type::getInt32Ty(Context), LowIndx);
> > +      Mask[1] = ConstantInt::get(Type::getInt32Ty(Context), HighIndx);
> > +
> > +      Instruction *BV = new ShuffleVectorInst(LEE->getOperand(0),
> > +                                          HEE->getOperand(0),
> > +                                          ConstantVector::get(Mask),
> > +                                          I->hasName() ?
> > +                                            I->getName() + ".v.i" +
> > +                                              utostr(o) : "");
> No squeezing please.
> 
> > +      BV->insertBefore(J);
> > +      return BV;
> > +    }
> > +
> > +    Instruction *BV1 = InsertElementInst::Create(
> > +                                               UndefValue::get(VArgType),
> > +                                               L->getOperand(o), CV0,
> > +                                               I->hasName() ?
> > +                                                 I->getName() + ".v.i" +
> > +                                                   utostr(o) + ".1"
> > +                                               : "");
> No squeezing.
> 
> > +    BV1->insertBefore(I);
> > +    Instruction *BV2 = InsertElementInst::Create(BV1, H->getOperand(o),
> > +                                               CV1, I->hasName() ?
> > +                                                 I->getName() +
> > +                                                   ".v.i" + utostr(o)
> > +                                                   + ".2" : "");
> 
> No squeezing.
> 
> Also, the stuff that you squeeze in looks very similar or even
> identical. Do you really write the whole expression 4 times in this
> function? Is there no better option?
> 
> > +    BV2->insertBefore(J);
> > +    return BV2;
> > +  }
> > +
> > +  // This function creates an array of values that will be used as the inputs
> > +  // to the vector instruction that fuses I with J.
> > +  void BBVectorize::getReplacementInputsForPair(LLVMContext&  Context,
> > +                     Instruction *I, Instruction *J,
> > +                     SmallVector<Value *, 3>  &ReplacedOperands,
> > +                     bool&FlipMemInputs) {
> > +    FlipMemInputs = false;
> > +    unsigned NumOperands = I->getNumOperands();
> > +
> > +    for (unsigned p = 0, o = NumOperands-1; p<  NumOperands; ++p, --o) {
> > +      // Iterate backward so that we look at the store pointer
> > +      // first and know whether or not we need to flip the inputs.
> > +
> > +      if (isa<LoadInst>(I) || (o == 1&&  isa<StoreInst>(I))) {
> > +        // This is the pointer for a load/store instruction.
> > +        ReplacedOperands[o] = getReplacementPointerInput(Context, I, J, o,
> > +                                FlipMemInputs);
> > +        continue;
> > +      } else if (isa<CallInst>(I)&&  o == NumOperands-1) {
> > +        Function *F = cast<CallInst>(I)->getCalledFunction();
> > +        unsigned IID = F->getIntrinsicID();
> > +        BasicBlock&BB = *cast<Instruction>(I)->getParent();
> > +
> > +        Module *M = BB.getParent()->getParent();
> > +        Type *ArgType = I->getType();
> > +        Type *VArgType = getVecType(ArgType);
> > +
> > +        // FIXME: is it safe to do this here?
> > +        ReplacedOperands[o] = Intrinsic::getDeclaration(M,
> > +          (Intrinsic::ID) IID, VArgType);
> > +        continue;
> > +      } else if (isa<ShuffleVectorInst>(I)&&  o == NumOperands-1) {
> > +        ReplacedOperands[o] = getReplacementShuffleMask(Context, I, J, o);
> > +        continue;
> > +      }
> > +
> > +      ReplacedOperands[o] =
> > +        getReplacementInput(Context, I, J, o, FlipMemInputs);
> > +    }
> > +  }
> > +
> > +  // This function creates two values that represent the outputs of the
> > +  // original I and J instructions. These are generally vector shuffles
> > +  // or extracts. In many cases, these will end up being unused and, thus,
> > +  // eliminated by later passes.
> > +  void BBVectorize::replaceOutputsOfPair(LLVMContext&  Context, Instruction *I,
> > +                     Instruction *J, Instruction *K,
> > +                     Instruction *&InsertionPt,
> > +                     Instruction *&K1, Instruction *&K2,
> > +                     bool&FlipMemInputs) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> > +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> > +
> > +    if (isa<StoreInst>(I)) {
> > +      AA.replaceWithNewValue(I, K);
> > +      AA.replaceWithNewValue(J, K);
> > +    } else {
> > +      Type *IType = I->getType();
> > +      Type *VType = getVecType(IType);
> > +
> > +      if (IType->isVectorTy()) {
> > +          unsigned numElem = cast<VectorType>(IType)->getNumElements();
> > +          std::vector<Constant*>  Mask1(numElem), Mask2(numElem);
> > +          for (unsigned v = 0; v<  numElem; ++v) {
> > +            Mask1[v] = ConstantInt::get(Type::getInt32Ty(Context), v);
> > +            Mask2[v] = ConstantInt::get(Type::getInt32Ty(Context), numElem+v);
> > +          }
> > +
> > +          K1 = new ShuffleVectorInst(K, UndefValue::get(VType),
> > +                                       ConstantVector::get(
> > +                                         FlipMemInputs ? Mask2 : Mask1),
> > +                                       K->hasName() ?
> > +                                         K->getName() + ".v.r1" : "");
> > +          K2 = new ShuffleVectorInst(K, UndefValue::get(VType),
> > +                                       ConstantVector::get(
> > +                                         FlipMemInputs ? Mask1 : Mask2),
> > +                                       K->hasName() ?
> > +                                         K->getName() + ".v.r2" : "");
> > +      } else {
> > +        K1 = ExtractElementInst::Create(K, FlipMemInputs ? CV1 : CV0,
> > +                                          K->hasName() ?
> > +                                            K->getName() + ".v.r1" : "");
> > +        K2 = ExtractElementInst::Create(K, FlipMemInputs ? CV0 : CV1,
> > +                                          K->hasName() ?
> > +                                            K->getName() + ".v.r2" : "");
> > +      }
> > +
> > +      K1->insertAfter(K);
> > +      K2->insertAfter(K1);
> > +      InsertionPt = K2;
> > +    }
> > +  }
> > +
> > +  // Move all uses of the function I (including pairing-induced uses) after J.
> > +  void BBVectorize::moveUsesOfIAfterJ(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *>&  ChosenPairs,
> > +                     std::multimap<Value *, Value *>  &LoadMoveSet,
> > +                     Instruction *&InsertionPt,
> > +                     Instruction *I, Instruction *J) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    // Skip to the first instruction past I.
> > +    BasicBlock::iterator L = BB.begin();
> > +    for (; cast<Instruction>(L) != cast<Instruction>(I); ++L);
> Why don't you put the initialization into the 'for'? I do not see any
> need for the cast instructions.
> 
> > +    ++L;
> > +
> > +    DenseSet<Value *>  Users;
> > +    AliasSetTracker WriteSet(AA);
> > +    for (BasicBlock::iterator E = BB.end();
> > +          cast<Instruction>(L) != cast<Instruction>(J);) {
> Why do you need to cast here?
> 
> > +      bool UsesI;
> > +      trackUsesOfI(Users, WriteSet, I, L, UsesI, true,&LoadMoveSet);
> > +
> > +      if (UsesI) {
> > +        // Make sure that the order of selected pairs is not reversed.
> > +        // So if we come to a user instruction that has a pair,
> > +        // then mark the pair's tree as used as well.
> > +        DenseMap<Value *, Value *>::iterator LP = ChosenPairs.find(L);
> > +        if (LP != ChosenPairs.end())
> > +          Users.insert(LP->second);
> > +
> > +        // Move this instruction
> > +        Instruction *InstToMove = L; ++L;
> > +
> > +        DEBUG(dbgs()<<  "BBV: moving: "<<  *InstToMove<<
> > +                        " to after "<<  *InsertionPt<<  "\n");
> > +        InstToMove->removeFromParent();
> > +        InstToMove->insertAfter(InsertionPt);
> > +        InsertionPt = InstToMove;
> > +      } else {
> > +        ++L;
> > +      }
> > +    }
> > +  }
> > +
> > +  // Collect all load instruction that are in the move set of a given pair.
> > +  // These loads depend on the first instruction, I, and so need to be moved
> > +  // after J when the pair is fused.
> > +  void BBVectorize::collectPairLoadMoveSet(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *>  &ChosenPairs,
> > +                     std::multimap<Value *, Value *>  &LoadMoveSet,
> > +                     Instruction *I, Instruction *J) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    // Skip to the first instruction past I.
> > +    BasicBlock::iterator L = BB.begin();
> > +    for (; cast<Instruction>(L) != cast<Instruction>(I); ++L);
> Why not initialize in the 'for' loop. Why cast here?
> 
> > +    ++L;
> > +
> > +    DenseSet<Value *>  Users;
> > +    AliasSetTracker WriteSet(AA);
> > +    for (BasicBlock::iterator E = BB.end();
> This 'E' is never used in the loop.
> 
> > +          cast<Instruction>(L) != cast<Instruction>(J); ++L) {
> Why cast here?
> 
> > +      bool UsesI;
> > +      trackUsesOfI(Users, WriteSet, I, L, UsesI);
> > +
> > +      if (UsesI) {
> > +        // Make sure that the order of selected pairs is not reversed.
> > +        // So if we come to a user instruction that has a pair,
> > +        // then mark the pair's tree as used as well.
> > +        DenseMap<Value *, Value *>::iterator LP = ChosenPairs.find(L);
> > +        if (LP != ChosenPairs.end())
> > +          Users.insert(LP->second);
> > +
> > +        if (L->mayReadFromMemory())
> > +          LoadMoveSet.insert(ValuePair(L, I));
> > +      }
> > +    }
> > +  }
> > +
> > +  // In cases where both load/stores and the computation of their pointers
> > +  // are chosen for vectorization, we can end up in a situation where the
> > +  // aliasing analysis starts returning different query results as the
> > +  // process of fusing instruction pairs continues. Because the algorithm
> > +  // relies on finding the same use trees here as were found earlier, we'll
> > +  // need to precompute the necessary aliasing information here and then
> > +  // manually update it during the fusion process.
> > +  void BBVectorize::collectLoadMoveSet(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *>  &ChosenPairs,
> > +                     std::multimap<Value *, Value *>  &LoadMoveSet) {
> > +    for (std::vector<Value *>::iterator PI = PairableInsts.begin(),
> > +         PIE = PairableInsts.end(); PI != PIE; ++PI) {
> > +      DenseMap<Value *, Value *>::iterator P = ChosenPairs.find(*PI);
> > +      if (P == ChosenPairs.end()) continue;
> > +
> > +      if (getDepthFactor(P->first) == 0) {
> > +        continue;
> > +      }
> No braces.
> 
> > +
> > +      Instruction *I = cast<Instruction>(P->first),
> > +        *J = cast<Instruction>(P->second);
> > +
> > +      collectPairLoadMoveSet(BB, ChosenPairs, LoadMoveSet, I, J);
> > +    }
> > +  }
> > +
> > +  // This function fuses the chosen instruction pairs into vector instructions,
> > +  // taking care preserve any needed scalar outputs and, then, it reorders the
> > +  // remaining instructions as needed (users of the first member of the pair
> > +  // need to be moved to after the location of the second member of the pair
> > +  // because the vector instruction is inserted in the location of the pair's
> > +  // second member).
> > +  void BBVectorize::fuseChosenPairs(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *>  &ChosenPairs) {
> > +    LLVMContext&  Context = BB.getContext();
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> > +
> > +    std::multimap<Value *, Value *>  LoadMoveSet;
> > +    collectLoadMoveSet(BB, PairableInsts, ChosenPairs, LoadMoveSet);
> > +
> > +    DEBUG(dbgs()<<  "BBV: initial: \n"<<  BB<<  "\n");
> > +
> > +    for (std::vector<Value *>::iterator PI = PairableInsts.begin(),
> > +         PIE = PairableInsDenseMap<Vts.end(); PI != PIE; ++PI) {
> > +      DenseMap<Value *, Value *>::iterator P = ChosenPairs.find(*PI);
> > +      if (P == ChosenPairs.end()) continue;
> > +
> > +      if (getDepthFactor(P->first) == 0) {
> > +        // These instructions are not really fused, but are tracked as though
> > +        // they are. Any case in which it would be interesting to fuse them
> > +        // will be taken care of by InstCombine.
> > +        --NumFusedOps;
> > +        continue;
> > +      }
> > +
> > +      Instruction *I = cast<Instruction>(P->first),
> > +        *J = cast<Instruction>(P->second);
> 
> Why do we need to cast here? It seems ChosenPairs should actually be of
> type DenseMap<Instruction *, Instruction *>. Can PariableInsts also be
> of type std::vector<Instruction*>?
> 
> > +
> > +      DEBUG(dbgs()<<  "BBV: fusing: "<<  *I<<
> > +             "<->  "<<  *J<<  "\n");
> > +
> > +      bool FlipMemInputs;
> > +      unsigned NumOperands = I->getNumOperands();
> > +      SmallVector<Value *, 3>  ReplacedOperands(NumOperands);
> > +      getReplacementInputsForPair(Context, I, J, ReplacedOperands,
> > +        FlipMemInputs);
> > +
> > +      // Make a copy of the original operation, change its type to the vector
> > +      // type and replace its operands with the vector operands.
> > +      Instruction *K = I->clone();
> > +      if (I->hasName()) K->takeName(I);
> > +
> > +      if (!isa<StoreInst>(K)) {
> > +        K->mutateType(getVecType(I->getType()));
> > +      }
> No braces.
> 
> > +
> > +      for (unsigned o = 0; o<  NumOperands; ++o) {
> > +        K->setOperand(o, ReplacedOperands[o]);
> > +      }
> No braces.
> 
> > +
> > +      // If we've flipped the memory inputs, make sure that we take the correct
> > +      // alignment.
> > +      if (FlipMemInputs) {
> > +        if (isa<StoreInst>(K))
> > +          cast<StoreInst>(K)->setAlignment(cast<StoreInst>(J)->getAlignment());
> > +        else
> > +          cast<LoadInst>(K)->setAlignment(cast<LoadInst>(J)->getAlignment());
> > +      }
> > +
> > +      K->insertAfter(J);
> > +
> > +      // Instruction insertion point:
> > +      Instruction *InsertionPt = K;
> > +      Instruction *K1 = 0, *K2 = 0;
> > +      replaceOutputsOfPair(Context, I, J, K, InsertionPt, K1, K2,
> > +        FlipMemInputs);
> > +
> > +      // The use tree of the first original instruction must be moved to after
> > +      // the location of the second instruction. The entire use tree of the
> > +      // first instruction is disjoint from the input tree of the second
> > +      // (by definition), and so commutes with it.
> > +
> > +      moveUsesOfIAfterJ(BB, ChosenPairs, LoadMoveSet, InsertionPt,
> > +        cast<Instruction>(I), cast<Instruction>(J));
> > +
> > +      if (!isa<StoreInst>(I)) {
> > +        I->replaceAllUsesWith(K1);
> > +        J->replaceAllUsesWith(K2);
> > +        AA.replaceWithNewValue(I, K1);
> > +        AA.replaceWithNewValue(J, K2);
> > +      }
> > +
> > +      // Instructions that may read from memory may be in the load move set.
> > +      // Once an instruction is fused, we no longer need its move set, and so
> > +      // the values of the map never need to be updated. However, when a load
> > +      // is fused, we need to merge the entries from both instructions in the
> > +      // pair in case those instructions were in the move set of some other
> > +      // yet-to-be-fused pair. The loads in question are the keys of the map.
> > +      if (I->mayReadFromMemory()) {
> > +        std::vector<ValuePair>  NewSetMembers;
> > +        VPIteratorPair IPairRange = LoadMoveSet.equal_range(I);
> > +        VPIteratorPair JPairRange = LoadMoveSet.equal_range(J);
> > +        for (std::multimap<Value *, Value *>::iterator N = IPairRange.first;
> > +               N != IPairRange.second; ++N) {
> Indentation.
> 
> > +          NewSetMembers.push_back(ValuePair(K, N->second));
> > +        }
> No braces.
> 
> > +        for (std::multimap<Value *, Value *>::iterator N = JPairRange.first;
> > +               N != JPairRange.second; ++N) {
> Indentation.
> 
> > +          NewSetMembers.push_back(ValuePair(K, N->second));
> > +        }
> No braces.
> 
> > +        for (std::vector<ValuePair>::iterator A = NewSetMembers.begin(),
> > +               AE = NewSetMembers.end(); A != AE; ++A) {
> Indentation.
> 
> > +          LoadMoveSet.insert(*A);
> > +        }
> No braces.
> 
> > +      }
> > +
> > +      SE.forgetValue(I);
> > +      SE.forgetValue(J);
> > +      I->eraseFromParent();
> > +      J->eraseFromParent();
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: final: \n"<<  BB<<  "\n");
> > +  }
> > +}
> 
> > --- /dev/null
> > +++ b/test/Transforms/BBVectorize/ld1.ll
> > @@ -0,0 +1,28 @@
> > +target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S | FileCheck %s
> > +
> > +define double @test1(double* %a, double* %b, double* %c) nounwind uwtable readonly {
> > +entry:
> > +  %i0 = load double* %a, align 8
> > +  %i1 = load double* %b, align 8
> > +  %mul = fmul double %i0, %i1
> > +  %i2 = load double* %c, align 8
> > +  %add = fadd double %mul, %i2
> > +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> > +  %i3 = load double* %arrayidx3, align 8
> > +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> > +  %i4 = load double* %arrayidx4, align 8
> > +  %mul5 = fmul double %i3, %i4
> > +  %arrayidx6 = getelementptr inbounds double* %c, i64 1
> > +  %i5 = load double* %arrayidx6, align 8
> > +  %add7 = fadd double %mul5, %i5
> > +  %mul9 = fmul double %add, %i1
> > +  %add11 = fadd double %mul9, %i2
> > +  %mul13 = fmul double %add7, %i4
> > +  %add15 = fadd double %mul13, %i5
> > +  %mul16 = fmul double %add11, %add15
> > +  ret double %mul16
> > +; CHECK: @test1
> > +; CHECK:<2 x double>
> 
> For me this CHECK looks very short. Should all instructions be
> vectorized or is it sufficient if one is vectorized. Also, it would be
> nice if I could understand the code that is generated just by looking
> at the test case. Any reason not to CHECK: for the entire sequence of
> expected instructions (This applies to all test cases).
> 
> 
> > +}
> > +
> > diff --git a/test/Transforms/BBVectorize/loop1.ll b/test/Transforms/BBVectorize/loop1.ll
> > new file mode 100644
> > index 0000000..4c60b42
> > --- /dev/null
> > +++ b/test/Transforms/BBVectorize/loop1.ll
> > @@ -0,0 +1,39 @@
> > +target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +target triple = "x86_64-unknown-linux-gnu"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S | FileCheck %s
> > +; RUN: opt<  %s -loop-unroll -unroll-allow-partial -bb-vectorize -bb-vectorize-req-chain-depth=3 -S | FileCheck %s
> Why are you checking here with -loop-unroll? Do you expect something
> special. In case you do it may make sense to use another -check-prefix
> 
> > --- /dev/null
> > +++ b/test/Transforms/BBVectorize/req-depth.ll
> > @@ -0,0 +1,17 @@
> > +target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 3 -S | FileCheck %s -check-prefix=CHECK-RD3
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 2 -S | FileCheck %s -check-prefix=CHECK-RD2
> > +
> > +define double @test1(double %A1, double %A2, double %B1, double %B2) {
> > +	%X1 = fsub double %A1, %B1
> > +	%X2 = fsub double %A2, %B2
> > +	%Y1 = fmul double %X1, %A1
> > +	%Y2 = fmul double %X2, %A2
> > +	%R  = fmul double %Y1, %Y2
> > +	ret double %R
> > +; CHECK-RD3: @test1
> > +; CHECK-RD2: @test1
> > +; CHECK-RD3-NOT:<2 x double>
> > +; CHECK-RD2:<2 x double>
> What about sorting them by check prefix?
> 
> > +++ b/test/Transforms/BBVectorize/simple-ldstr.ll
> > @@ -0,0 +1,69 @@
> > +target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S | FileCheck %s
> > +
> > +; Simple 3-pair chain with loads and stores
> > +define void @test1(double* %a, double* %b, double* %c) nounwind uwtable readonly {
> > +entry:
> > +  %i0 = load double* %a, align 8
> > +  %i1 = load double* %b, align 8
> > +  %mul = fmul double %i0, %i1
> > +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> > +  %i3 = load double* %arrayidx3, align 8
> > +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> > +  %i4 = load double* %arrayidx4, align 8
> > +  %mul5 = fmul double %i3, %i4
> > +  store double %mul, double* %c, align 8
> > +  %arrayidx5 = getelementptr inbounds double* %c, i64 1
> > +  store double %mul5, double* %arrayidx5, align 8
> > +  ret void
> > +; CHECK: @test1
> > +; CHECK:<2 x double>
> 
> Especially here the entire vectorization sequence is interesting as I
> would love to see the alignment you produce for load/store instructions.
> 
> > +; Basic depth-3 chain (last pair permuted)
> > +define double @test2(double %A1, double %A2, double %B1, double %B2) {
> > +	%X1 = fsub double %A1, %B1
> > +	%X2 = fsub double %A2, %B2
> > +	%Y1 = fmul double %X1, %A1
> > +	%Y2 = fmul double %X2, %A2
> > +	%Z1 = fadd double %Y2, %B1
> > +	%Z2 = fadd double %Y1, %B2
> > +	%R  = fmul double %Z1, %Z2
> > +	ret double %R
> > +; CHECK: @test2
> > +; CHECK:<2 x double>
> 
> Again, the entire sequence would show how you express a permutation.
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory