[llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Mon Dec 10 04:51:52 PST 2018

Hi Eli,

>>>>
>>>> I have noticed some loops that build a wider element by loading 
>>>> small elements, zero-extending them, shifting them (with different 
>>>> amounts) to then 'or' them all together. They are either equivalent 
>>>> of a wider load, or to that of a byte-swapped one.
>>>>
>>>> DAGCombiner::MatchLoadCombine() will combine this to a single wide 
>>>> load, but only in the scalar cases of i16, i32 and i64. The result 
>>>> is that these loops (I have seen a dozen or so on SPEC) get 
>>>> vectorized with a lot of ugly code.
>>>>
>>>> I have begun to experiment with handling the vectorized loop also, 
>>>> and would like to know if people think this would be a good idea? 
>>>> Also, am I right to assume that it probably should be run before 
>>>> type legalization?
>>>>
>>> You mean, trying to merge some combination of vector loads and 
>>> shuffles into a single vector load in DAGCombine?  That seems sort 
>>> of late, given the cost modeling involved in vectorization.
>>>
>> What happens specifically is that a scalar loop that is written (most 
>> likely at source level) like
>>
>> ///  i8 *a = ...
>> ///  i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
>>
>> becomes during DAGCombine (this is from the comment in 
>> DAGCombiner.cpp:5750)
>>
>> /// =>
>> ///  i32 val = *((i32)a)
>>
>> I also wondered why this is done so late but found in the commit 
>> message (b52af07 / r293036) that there had been a discussion where 
>> this had intentionally been addressed late: "...We came to the 
>> conclusion that we want to do this transformation late in the 
>> pipeline because in presence of atomic loads load widening is 
>> irreversible transformation and it might hinder other optimizations..."
>
> Yes.  (I'm not completely convinced that we need to delay the 
> transform all the way to DAGCombine, though...)
>
>> The loop vectorizer looks at such a loop and thinks the scalar loop 
>> costs 13 and VF=4 costs 5. The cost for the vectorized loop is about 
>> right, but the scalar loop becomes much better with the help of 
>> MatchLoadCombine(): just 2 instructions (load + store). The vector 
>> loop could also have been just 2 instructions: Vector Load + Vector 
>> Store, but instead it does Vector Load + vector shuffles, vector zero 
>> extensions, vector element shifts, vector ors and the Vector Store :-(
>>
>> It would be very far-fetched to correct the cost for the scalar loop 
>> in the LoopVectorizer. It seems like a better idea to implement the 
>> handling for the vectorized loop also in DAGCombiner. As an 
>> alternative, we could just ignore this I guess...
>
> The current vectorizer doesn't really have infrastructure for this, at 
> least.
>
>>
>> My patch does a recursive search starting at an ISD::Or node and then 
>> tries to prove that the sub-DAG with all the ors, shifts and extends 
>> is equivalent to the original Vector Load, possibly in reversed 
>> order. I think that's a somewhat simpler problem than what the scalar 
>> loop handling of MatchLoadCombine() is doing.
>
> Oh, it's simpler because it doesn't actually have to deal with 
> memory?  Makes sense.  Still seems kind of awkward to generate the 
> terrible vector loop and recover it later, though; it's fragile based 
> on the shuffle costs.
>
> If you are going to do this, probably simpler to do in instcombine?
>
I have tried previously to extend instcombine with optimizations of 
vector shuffles. I was told then that instcombine should be very careful 
with optimizing permutations since the backends may then produce worse 
code. I suppose that in this case, if we get just the vector load 
instruction instead of the shifts and ors etc, that might be acceptable?

The reason I chose the DAGCombine was due to the explanation for the 
scalar case to deliberately make this very late. I thought I might as 
well put it in the same spot...

I am not sure I want to do this seriously without at least one other 
person thinking it would be valuable and willing to review, and also not 
before I know what pass should do it. I have not seen any real need for 
this in benchmarks, it's just an observation...

> The existing SLP pass actually could handle this sort of sequence, but 
> we currently suppress it if the result would be too narrow.  See also 
> https://reviews.llvm.org/D48725 .

Could SLP really handle this case so as to get rid of the shifts and ors?

/Jonas