[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Fri Dec 2 11:17:50 PST 2011

On Fri, 2011-12-02 at 19:13 +0100, Tobias Grosser wrote:
> On 12/02/2011 06:32 PM, Hal Finkel wrote:
> > On Fri, 2011-12-02 at 17:07 +0100, Tobias Grosser wrote:
> >> On 11/23/2011 05:52 PM, Hal Finkel wrote:
> >>> On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
> >>>>>   On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
> >>>>>>   >   Tobias,
> >>>>>>   >
> >>>>>>   >   I've attached an updated patch. It contains a few bug fixes and many
> >>>>>>   >   (refactoring and coding-convention) changes inspired by your comments.
> >>>>>>   >
> >>>>>>   >   I'm currently trying to fix the bug responsible for causing a compile
> >>>>>>   >   failure when compiling
> >>>>>>   >   test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
> >>>>>>   >   pass begins to fuse instructions in a basic block in this file, the
> >>>>>>   >   aliasing analysis starts producing different (more pessimistic) query
> >>>>>>   >   answers. I've never before seen it do this, and I'm not sure why it is
> >>>>>>   >   happening. Also odd, at the same time, the numeric labels that are
> >>>>>>   >   assigned to otherwise unnamed instructions, change. I don't think I've
> >>>>>>   >   seen this happen either (they generally seem to stay fixed once
> >>>>>>   >   assigned). I don't know if these two things are related, but if anyone
> >>>>>>   >   can provide some insight, I'd appreciate it.
> >>>>>
> >>>>>   I think that I see what is happening in this case (please someone tell
> >>>>>   me if this does not make sense). In the problematic basic block, there
> >>>>>   are some loads and stores that are independent. The default aliasing
> >>>>>   analysis can tell that these loads and stores don't reference the same
> >>>>>   memory region. Then, some of the inputs to the getelementptr
> >>>>>   instructions used for these loads and stores are fused by the
> >>>>>   vectorization. After this happens, the aliasing analysis loses its
> >>>>>   ability to tell that the loads and stores that make use of those
> >>>>>   vector-calculated indices are independent.
> >>> The attached patch works around this issue. Please review.
> >>
> >> Hi Hal,
> >>
> >> thanks for reworking the patch several times. This time the patch looks
> >> a lot better, with nicely structured helper functions and a lot more
> >> consistent style. It is a lot better to read and review.
> >>
> >> Unfortunately I had today just two hours to look into this patch. I
> >> added a lot of remarks, but especially in the tree pruning part I run a
> >> little out of time such that the remarks are mostly obvious style
> >> remarks. I hope I have a little bit more time over the weekend to look
> >> into this again, but I wanted to get out the first junk of comments today.
> >
> > Thanks! I appreciate you putting in that kind of time. [Also, there is a
> > more-recent version than the one you're looking at,
> > llvm_bb_vectorize-20111122.diff, make sure that you look at that one].
> 
> I looked at the one that was in the email I replied to. This was
> 'llvm_bb_vectorize-20111122.diff'. I could not see any new one or did I 
> overlook anything.

The more-recent version of the patch has that has:
          case Intrinsic::pow:
            VIntr = !NoMath;
            break;
          case Intrinsic::fma:
            VIntr = !NoFMA;
            break;
          }
(which was a fix in response to a bug you pointed out in your first
review).

You can grab it here:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20111121/132206.html

> 
> >> Especially with my style remarks, do not feel forced to follow all of my
> >> comments. Style stuff is highly personal and often other people may
> >> suggest other fixes. In case you believe your style is better just check
> >> with the LLVM coding reference or other parts of the LLVM source code
> >> and try to match this as good as possible.
> >
> > I have tried to do that, but other eyes are always useful.
> 
> You were very successful. The code looks already very nice.
> 
> >> On the algorithmic point of view I am not yet completely convinced the
> >> algorithm itself is not worse than N^2. I really want to understand this
> >> part better and plan to look into problematic behaviour in functions not
> >> directly visible in the core routines.
> >
> > Here are my thoughts:
> >
> > getCandidatePairs - This is O(N^2) in the number of instructions in BB.
> > [This assumes that all of the external helper functions, especially
> > queries to alias and scalar-evolution analysis don't make the complexity
> > worse].
> 
> Yes. Alias analysis and scalar-evolution are queries that may not be 
> fast enough (complexity wise). Dan Gohman proposed to check them and 
> until today I did not find the time to verify they yield worse complexity.
> 
> > computeConnectedPairs - For each candidate pair, there are O(N^2) in the
> > worse case, this loops over all pairs of uses of the instructions in the
> > candidate pair. So if every instruction has U uses, then this has
> > complexity O(N^2 U^2). In practice, U is small, and so this is just
> > another O(N^2). Unfortunately, that is only true if determining if a
> > given use pair is also a candidate pair is sub-linear. It is not in this
> > case (it is linear because the multimap only indexes the first key, not
> > both), but that could be changed by using a different (or additional)
> > data structure to hold the candidate pairs. I should probably add a
> > candidate-pair DenseSet to clean this up, then it will be O(N^2) [note
> > that if an instruction has many uses, then it cannot be in many
> > candidate pairs].
> >
> > buildDepMap - This records all uses of each pairable instruction; as
> > implemented, this is also O(N^2).
> >
> > choosePairs - As you point out, this is the most complicated to figure
> > out. The reason is that it deals with connected pairs and that, as pairs
> > are selected, other pairs are dropped from contention. Fundamentally, it
> > is considering pair-to-pair interactions, and so that seems like O(N^4).
> > However, it only considers only the first possible tree that can be
> > built from connected pairs for each remaining candidate pair. This is
> > important for the following reason: If a block of N instructions had the
> > maximum number of candidate pairs: N*(N-1)/2, then it must be the case
> > that all instrucitons are independent, and, thus, none of them are
> > connected. Thus, in this regime, the algorithm is O(N^2). On the other
> > extreme, if the block has N candidate pairs, then they could all be
> > connected, but then there are only N of them. In this extreme, the
> > algorithm is also O(N^2). Because both extremes are O(N^2), I've
> > generally figured that this is also O(N^2), but I may want to do more
> > math for the paper I'm writing.
> >
> > fuseChosenPairs - This is linear in the number of selected pairs, and
> > that number scales with the number of instructions in the BB. In the
> > worst case, after each fusion, a number of instructions need to be moved
> > (up to almost the total number in the block). Thus, this is also O(N^2).
> 
> Thanks for this more detailed analysis. It definitely helps to 
> understand your complexity calculation.
> 
> >> Also, you choose a very high bound (4000) to limit the quadratic
> >> behaviour, because you stated otherwise optimizations will be missed.
> >> Can you send an example where a value smaller than 4000 would miss
> >> optimizations?
> >
> > I may be able to find an example, but I don't have one immediately
> > available. I had run the test suite with various settings, and chose the
> > default value (4000) based on the approximate point where the average
> > performance stopped increasing. Thus, I'm surmising that choosing a
> > smaller value causes useful optimization opportunities to be missed.
> 
> I was just surprised that 4000 is that value, because I don't believe 
> -partial-unroll would create basic blocks containing so many instructions.

I was also quite surprised.

 -Hal

> 
> >>> +#define BBV_NAME "bb-vectorize"
> >>> +#define DEBUG_TYPE BBV_NAME
> >>> +#include "llvm/Constants.h"
> >>> +#include "llvm/DerivedTypes.h"
> >>> +#include "llvm/Function.h"
> >>> +#include "llvm/Instructions.h"
> >>> +#include "llvm/IntrinsicInst.h"
> >>> +#include "llvm/Intrinsics.h"
> >>> +#include "llvm/LLVMContext.h"
> >>> +#include "llvm/Pass.h"
> >>> +#include "llvm/Type.h"
> >>> +#include "llvm/ADT/DenseMap.h"
> >>> +#include "llvm/ADT/DenseSet.h"
> >>> +#include "llvm/ADT/SmallVector.h"
> >>> +#include "llvm/ADT/Statistic.h"
> >>> +#include "llvm/ADT/StringExtras.h"
> >>> +#include "llvm/Analysis/AliasAnalysis.h"
> >>> +#include "llvm/Analysis/AliasSetTracker.h"
> >>> +#include "llvm/Analysis/ValueTracking.h"
> >> This comes to lines later, if you want to sort it alphabetically.
> >
> > What do you mean? I had run the include statements through sort at some
> > point, so I think they're all sorted.
> 
> #include "llvm/Analysis/AliasAnalysis.h"
> #include "llvm/Analysis/AliasSetTracker.h"
> #include "llvm/Analysis/ValueTracking.h"
> #include "llvm/Analysis/ScalarEvolution.h"
> #include "llvm/Analysis/ScalarEvolutionExpressions.h"
> 
> I removed the last part of the list, so you don't see the problem. All, 
> except 'ValueTracking' are sorted.
> 
> >>> +    // Returns the weight associated with the provided value. A chain of
> >>> +    // candidate pairs has a length given by the sum of the weights of its
> >>> +    // members (one weight per pair; the weight of each member of the pair
> >>> +    // is assumed to be the same). This length is then compared to the
> >>> +    // chain-length threshold to determine if a given chain is significant
> >>> +    // enough to be vectorized. The length is also used in comparing
> >>> +    // candidate chains where longer chains are considered to be better.
> >>> +    // Note: when this function returns 0, the resulting instructions are
> >>> +    // not actually fused.
> >>> +    static inline size_t getDepthFactor(Value *V) {
> >>> +      if (isa<InsertElementInst>(V) || isa<ExtractElementInst>(V)) {
> >>
> >> Why have InsertElementInst, ExtractElementInst instructions a weight of
> >> zero?
> >
> > Two reasons: First, they cannot be usefully fused. Second, because the
> > pass generates a lot of these, they can confuse the simple metric used
> > to compare the trees in the next iteration. Thus, giving them a weight
> > of zero allows the pass to essentially ignore them in subsequent
> > iterations when looking for vectorization opportunities while still
> > tracking dependency chains that flow through those instructions.
> 
> OK. Please put this in a comment. I think it is useful to know.
> 
> >>> +    // Returns true if the provided CallInst represents an intrinsic that can
> >>> +    // be vectorized.
> >>> +    bool isVectorizableIntrinsic(CallInst* I) {
> >>> +      bool VIntr = false;
> >>> +      if (Function *F = I->getCalledFunction()) {
> >>> +        if (unsigned IID = F->getIntrinsicID()) {
> >>
> >> Use early exit.
> >>
> >> Function *F = I->getCalledFunction();
> >>
> >> if (!F)
> >>     return false;
> >>
> >> unsigned IID = F->getIntrinsicID();
> >>
> >> if (!IID)
> >>     return false;
> >>
> >> switch(IID) {
> >> 	case Intrinsic::sqrt:
> >> 	case Intrinsic::powi:
> >> 	case Intrinsic::sin:
> >> 	case Intrinsic::cos:
> >> 	case Intrinsic::log:
> >> 	case Intrinsic::log2:
> >> 	case Intrinsic::log10:
> >> 	case Intrinsic::exp:
> >> 	case Intrinsic::exp2:
> >> 	case Intrinsic::pow:
> >> 		return !NoMath;
> >> 	case Intrinsic::fma:
> >> 		return !NoFMA;
> >> }
> >>
> >> Is this switch exhaustive or are you missing a default case?
> >
> > You're not looking at the most-recent version. You had (correctly)
> > commented on this last time, and I had corrected it.
> 
> I look at llvm_bb_vectorize-20111122.diff, which was according to my 
> mail client posted at Date: Wed, 23 Nov 2011 10:52:37 -0600.
> In your mail from today you said:
> 
> " I know that I've sent a large number of revisions, but I
> think that the version that I posted on 11/23 should be reviewed."
> 
> Did I get something wrong?
> 
> Cheers
> Tobi

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory