[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Fri Dec 2 10:13:45 PST 2011

On 12/02/2011 06:32 PM, Hal Finkel wrote:
> On Fri, 2011-12-02 at 17:07 +0100, Tobias Grosser wrote:
>> On 11/23/2011 05:52 PM, Hal Finkel wrote:
>>> On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
>>>>>   On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
>>>>>>   >   Tobias,
>>>>>>   >
>>>>>>   >   I've attached an updated patch. It contains a few bug fixes and many
>>>>>>   >   (refactoring and coding-convention) changes inspired by your comments.
>>>>>>   >
>>>>>>   >   I'm currently trying to fix the bug responsible for causing a compile
>>>>>>   >   failure when compiling
>>>>>>   >   test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
>>>>>>   >   pass begins to fuse instructions in a basic block in this file, the
>>>>>>   >   aliasing analysis starts producing different (more pessimistic) query
>>>>>>   >   answers. I've never before seen it do this, and I'm not sure why it is
>>>>>>   >   happening. Also odd, at the same time, the numeric labels that are
>>>>>>   >   assigned to otherwise unnamed instructions, change. I don't think I've
>>>>>>   >   seen this happen either (they generally seem to stay fixed once
>>>>>>   >   assigned). I don't know if these two things are related, but if anyone
>>>>>>   >   can provide some insight, I'd appreciate it.
>>>>>
>>>>>   I think that I see what is happening in this case (please someone tell
>>>>>   me if this does not make sense). In the problematic basic block, there
>>>>>   are some loads and stores that are independent. The default aliasing
>>>>>   analysis can tell that these loads and stores don't reference the same
>>>>>   memory region. Then, some of the inputs to the getelementptr
>>>>>   instructions used for these loads and stores are fused by the
>>>>>   vectorization. After this happens, the aliasing analysis loses its
>>>>>   ability to tell that the loads and stores that make use of those
>>>>>   vector-calculated indices are independent.
>>> The attached patch works around this issue. Please review.
>>
>> Hi Hal,
>>
>> thanks for reworking the patch several times. This time the patch looks
>> a lot better, with nicely structured helper functions and a lot more
>> consistent style. It is a lot better to read and review.
>>
>> Unfortunately I had today just two hours to look into this patch. I
>> added a lot of remarks, but especially in the tree pruning part I run a
>> little out of time such that the remarks are mostly obvious style
>> remarks. I hope I have a little bit more time over the weekend to look
>> into this again, but I wanted to get out the first junk of comments today.
>
> Thanks! I appreciate you putting in that kind of time. [Also, there is a
> more-recent version than the one you're looking at,
> llvm_bb_vectorize-20111122.diff, make sure that you look at that one].

I looked at the one that was in the email I replied to. This was
'llvm_bb_vectorize-20111122.diff'. I could not see any new one or did I 
overlook anything.

>> Especially with my style remarks, do not feel forced to follow all of my
>> comments. Style stuff is highly personal and often other people may
>> suggest other fixes. In case you believe your style is better just check
>> with the LLVM coding reference or other parts of the LLVM source code
>> and try to match this as good as possible.
>
> I have tried to do that, but other eyes are always useful.

You were very successful. The code looks already very nice.

>> On the algorithmic point of view I am not yet completely convinced the
>> algorithm itself is not worse than N^2. I really want to understand this
>> part better and plan to look into problematic behaviour in functions not
>> directly visible in the core routines.
>
> Here are my thoughts:
>
> getCandidatePairs - This is O(N^2) in the number of instructions in BB.
> [This assumes that all of the external helper functions, especially
> queries to alias and scalar-evolution analysis don't make the complexity
> worse].

Yes. Alias analysis and scalar-evolution are queries that may not be 
fast enough (complexity wise). Dan Gohman proposed to check them and 
until today I did not find the time to verify they yield worse complexity.

> computeConnectedPairs - For each candidate pair, there are O(N^2) in the
> worse case, this loops over all pairs of uses of the instructions in the
> candidate pair. So if every instruction has U uses, then this has
> complexity O(N^2 U^2). In practice, U is small, and so this is just
> another O(N^2). Unfortunately, that is only true if determining if a
> given use pair is also a candidate pair is sub-linear. It is not in this
> case (it is linear because the multimap only indexes the first key, not
> both), but that could be changed by using a different (or additional)
> data structure to hold the candidate pairs. I should probably add a
> candidate-pair DenseSet to clean this up, then it will be O(N^2) [note
> that if an instruction has many uses, then it cannot be in many
> candidate pairs].
>
> buildDepMap - This records all uses of each pairable instruction; as
> implemented, this is also O(N^2).
>
> choosePairs - As you point out, this is the most complicated to figure
> out. The reason is that it deals with connected pairs and that, as pairs
> are selected, other pairs are dropped from contention. Fundamentally, it
> is considering pair-to-pair interactions, and so that seems like O(N^4).
> However, it only considers only the first possible tree that can be
> built from connected pairs for each remaining candidate pair. This is
> important for the following reason: If a block of N instructions had the
> maximum number of candidate pairs: N*(N-1)/2, then it must be the case
> that all instrucitons are independent, and, thus, none of them are
> connected. Thus, in this regime, the algorithm is O(N^2). On the other
> extreme, if the block has N candidate pairs, then they could all be
> connected, but then there are only N of them. In this extreme, the
> algorithm is also O(N^2). Because both extremes are O(N^2), I've
> generally figured that this is also O(N^2), but I may want to do more
> math for the paper I'm writing.
>
> fuseChosenPairs - This is linear in the number of selected pairs, and
> that number scales with the number of instructions in the BB. In the
> worst case, after each fusion, a number of instructions need to be moved
> (up to almost the total number in the block). Thus, this is also O(N^2).

Thanks for this more detailed analysis. It definitely helps to 
understand your complexity calculation.

>> Also, you choose a very high bound (4000) to limit the quadratic
>> behaviour, because you stated otherwise optimizations will be missed.
>> Can you send an example where a value smaller than 4000 would miss
>> optimizations?
>
> I may be able to find an example, but I don't have one immediately
> available. I had run the test suite with various settings, and chose the
> default value (4000) based on the approximate point where the average
> performance stopped increasing. Thus, I'm surmising that choosing a
> smaller value causes useful optimization opportunities to be missed.

I was just surprised that 4000 is that value, because I don't believe 
-partial-unroll would create basic blocks containing so many instructions.

>>> +#define BBV_NAME "bb-vectorize"
>>> +#define DEBUG_TYPE BBV_NAME
>>> +#include "llvm/Constants.h"
>>> +#include "llvm/DerivedTypes.h"
>>> +#include "llvm/Function.h"
>>> +#include "llvm/Instructions.h"
>>> +#include "llvm/IntrinsicInst.h"
>>> +#include "llvm/Intrinsics.h"
>>> +#include "llvm/LLVMContext.h"
>>> +#include "llvm/Pass.h"
>>> +#include "llvm/Type.h"
>>> +#include "llvm/ADT/DenseMap.h"
>>> +#include "llvm/ADT/DenseSet.h"
>>> +#include "llvm/ADT/SmallVector.h"
>>> +#include "llvm/ADT/Statistic.h"
>>> +#include "llvm/ADT/StringExtras.h"
>>> +#include "llvm/Analysis/AliasAnalysis.h"
>>> +#include "llvm/Analysis/AliasSetTracker.h"
>>> +#include "llvm/Analysis/ValueTracking.h"
>> This comes to lines later, if you want to sort it alphabetically.
>
> What do you mean? I had run the include statements through sort at some
> point, so I think they're all sorted.

#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/AliasSetTracker.h"
#include "llvm/Analysis/ValueTracking.h"
#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"

I removed the last part of the list, so you don't see the problem. All, 
except 'ValueTracking' are sorted.

>>> +    // Returns the weight associated with the provided value. A chain of
>>> +    // candidate pairs has a length given by the sum of the weights of its
>>> +    // members (one weight per pair; the weight of each member of the pair
>>> +    // is assumed to be the same). This length is then compared to the
>>> +    // chain-length threshold to determine if a given chain is significant
>>> +    // enough to be vectorized. The length is also used in comparing
>>> +    // candidate chains where longer chains are considered to be better.
>>> +    // Note: when this function returns 0, the resulting instructions are
>>> +    // not actually fused.
>>> +    static inline size_t getDepthFactor(Value *V) {
>>> +      if (isa<InsertElementInst>(V) || isa<ExtractElementInst>(V)) {
>>
>> Why have InsertElementInst, ExtractElementInst instructions a weight of
>> zero?
>
> Two reasons: First, they cannot be usefully fused. Second, because the
> pass generates a lot of these, they can confuse the simple metric used
> to compare the trees in the next iteration. Thus, giving them a weight
> of zero allows the pass to essentially ignore them in subsequent
> iterations when looking for vectorization opportunities while still
> tracking dependency chains that flow through those instructions.

OK. Please put this in a comment. I think it is useful to know.

>>> +    // Returns true if the provided CallInst represents an intrinsic that can
>>> +    // be vectorized.
>>> +    bool isVectorizableIntrinsic(CallInst* I) {
>>> +      bool VIntr = false;
>>> +      if (Function *F = I->getCalledFunction()) {
>>> +        if (unsigned IID = F->getIntrinsicID()) {
>>
>> Use early exit.
>>
>> Function *F = I->getCalledFunction();
>>
>> if (!F)
>>     return false;
>>
>> unsigned IID = F->getIntrinsicID();
>>
>> if (!IID)
>>     return false;
>>
>> switch(IID) {
>> 	case Intrinsic::sqrt:
>> 	case Intrinsic::powi:
>> 	case Intrinsic::sin:
>> 	case Intrinsic::cos:
>> 	case Intrinsic::log:
>> 	case Intrinsic::log2:
>> 	case Intrinsic::log10:
>> 	case Intrinsic::exp:
>> 	case Intrinsic::exp2:
>> 	case Intrinsic::pow:
>> 		return !NoMath;
>> 	case Intrinsic::fma:
>> 		return !NoFMA;
>> }
>>
>> Is this switch exhaustive or are you missing a default case?
>
> You're not looking at the most-recent version. You had (correctly)
> commented on this last time, and I had corrected it.

I look at llvm_bb_vectorize-20111122.diff, which was according to my 
mail client posted at Date: Wed, 23 Nov 2011 10:52:37 -0600.
In your mail from today you said:

" I know that I've sent a large number of revisions, but I
think that the version that I posted on 11/23 should be reviewed."

Did I get something wrong?

Cheers
Tobi