[LLVMdev] RFC: generation of PSAD instruction

Arnold aschwaighofer at apple.com
Tue Feb 3 07:14:04 PST 2015


Comment below.

> On Feb 2, 2015, at 9:25 PM, Muthyala, Vj <Vj.Muthyala at amd.com> wrote:
> 
> Hi,
> 
> Thank you for your feedback. Fews comments inlined.[VJ]
> 
> 
> -Vijender
> 
> -----Original Message-----
> From: aschwaighofer at apple.com [mailto:aschwaighofer at apple.com] 
> Sent: Monday, February 02, 2015 11:25 PM
> To: Muthyala, Vj; Hal Finkel
> Cc: LLVM Dev; James Molloy; Michael Zolotukhin; Nadav Rotem; suyog sarda; ahmed.bougacha at gmail.com
> Subject: RE: [LLVMdev] RFC: generation of PSAD instruction
> 
> Hi Vijender,
> 
> thanks for working on this! Comments below. 
> 
> 
>> On 02/02/15, "Muthyala, Vj"   wrote:
>> Hi all,
>> 
>> I have done some work on implementing generation of SAD in LoopVectorizer. I am sharing two patches which should help to understand my idea. There are many places where I am not sure about the right way, which I want to put forward for discussion. Let me talk about each patch individually.
>> 
>> LoopVectorizerSAD.patch
>> 
>> This patch contains:
>> -  Introduction of new ReductionKind "RK_IntegerSad”. 
> 
> 
> Sounds good.
> 
>> -  Identify the sad pattern which is a sequence of SUB, ICMP, NEGATE, SELECT and ADD in isSadPattern() function . 
> 
> I suspect that there will be other patterns than SAD. We should probably look for at least one other similar reduction case and then generalize the logic such that it handles both cases. What I mean by this is that most of the code should not talk about isSADPattern, rather, the specific pattern should be hidden behind a generalized function, for example, isSpecialPattern().
> 
> [VJ] I tired looking for other patterns which are similar but couldn’t find. Do let me know if you find any other pattern. 
> 
> But in principle this is fine.
> 
> I think that we should probably only match an SAD pattern if the target supports it.
> 
>> -  Introduction of new list into the reduction Descriptor to store the SAD pattern instructions. This list is needed for cost computations and also needed for removal once the LLVM intrinsic is created.
> This should be a set. We should test this set when we vectorize instructions (no need to vectorize and then erase again) and for cost computation (no need to count and then subtract again).
> [VJ] Sure I will work on it.
> 
> We also need to make sure that instructions in the pattern are not used other than by the pattern (I have not looked closely whether you did that).
> [VJ] yes I have already checked that in the code. 
> 
>> -  Modification of the getWidestType function to return 8 if SAD pattern is present. This is because for SAD in x86, the minimum vectorization factor needed is 8. In the present implementation of cost model, the highest VF which is tested is WidestRegisterSize/WidestType. If WidesrRegisterSize is 128 and if there is any variable in the basic block which is i32 which is not actually related to vector data then also max VF will come to be 4 (128/32). So if I replace widestType with 8 then max VF will come to 16. My thought is that higher vectorization factors like 8 or 16 also needs to be given chance for computing its cost which makes sense for instructions like SAD. To make this happen we can divide the widestRegisterSize with SmallestType. I am not sure about the circumstances of this. We can also get target information and decide.
> 
> 
> In principle fine but this also needs some kind of abstraction: getWidestTypeForPattern or something the like. The pattern can then decide. For an SAD it will be the type of the inputs to the SAD. You don’t need target info for this.
> 
>> -  For the cost calculations, the SAD intrinsic is not yet present in the basic block. So the relevant PHI node is passed to getInstructionCost() function. Inside this function, a dummy SAD intrinsic is passed and cost is computed.
> Keying off the PHI node is fine.
> 
>> -  Right now the Basic Block created by LoopVectorizer is modified .This code works fine for known trip count. If the trip count is not known at compile time then we may need to have multiple versions of the loop body.
> I don't understand why the TripCount is important. The only thing you need is the vectorization factor VF. The loop vectorizer already generates a skeleton that will jump to a scalar copy of the loop if that is not the case.
> 
> [VJ] - I think I need to explain in detail. You are right that it will jump to a scalar copy but the issue here is, if for instance VF comes out to be 16 then all instructions will be widened to 16. Now in this situation if the TripCount comes to be 15 then the whole code is executed in scalar copy. But the right way is to execute a body with VF 8 and then the remaining  iterations in scalar body ( or could be body of VF 4 followed by scalar body). That is why I am talking about multiple versions of the loop, but yes there could be overheads with branching instructions. This problem wont arise if the TripCount is known and considered in selecting VF.
> 


You are right that if the vectorizer decides that VF 16 has the best cost and the loop has a non constant upper bound than we might not execute the vectorized loop. This holds true for not only PSAD type loops. In theory, we could emit multiple versions of vectorized loops but this would further increase code size with diminishing gains.
Selecting VF 8 vs VF 16 vs ... should be done based on the cost model.



More information about the llvm-dev mailing list