[llvm-dev] MatchLoadCombine(): handling for vectorized loop.
Jonas Paulsson via llvm-dev
llvm-dev at lists.llvm.org
Mon Dec 10 04:51:52 PST 2018
>>>> I have noticed some loops that build a wider element by loading
>>>> small elements, zero-extending them, shifting them (with different
>>>> amounts) to then 'or' them all together. They are either equivalent
>>>> of a wider load, or to that of a byte-swapped one.
>>>> DAGCombiner::MatchLoadCombine() will combine this to a single wide
>>>> load, but only in the scalar cases of i16, i32 and i64. The result
>>>> is that these loops (I have seen a dozen or so on SPEC) get
>>>> vectorized with a lot of ugly code.
>>>> I have begun to experiment with handling the vectorized loop also,
>>>> and would like to know if people think this would be a good idea?
>>>> Also, am I right to assume that it probably should be run before
>>>> type legalization?
>>> You mean, trying to merge some combination of vector loads and
>>> shuffles into a single vector load in DAGCombine? That seems sort
>>> of late, given the cost modeling involved in vectorization.
>> What happens specifically is that a scalar loop that is written (most
>> likely at source level) like
>> /// i8 *a = ...
>> /// i32 val = a | (a << 8) | (a << 16) | (a << 24)
>> becomes during DAGCombine (this is from the comment in
>> /// =>
>> /// i32 val = *((i32)a)
>> I also wondered why this is done so late but found in the commit
>> message (b52af07 / r293036) that there had been a discussion where
>> this had intentionally been addressed late: "...We came to the
>> conclusion that we want to do this transformation late in the
>> pipeline because in presence of atomic loads load widening is
>> irreversible transformation and it might hinder other optimizations..."
> Yes. (I'm not completely convinced that we need to delay the
> transform all the way to DAGCombine, though...)
>> The loop vectorizer looks at such a loop and thinks the scalar loop
>> costs 13 and VF=4 costs 5. The cost for the vectorized loop is about
>> right, but the scalar loop becomes much better with the help of
>> MatchLoadCombine(): just 2 instructions (load + store). The vector
>> loop could also have been just 2 instructions: Vector Load + Vector
>> Store, but instead it does Vector Load + vector shuffles, vector zero
>> extensions, vector element shifts, vector ors and the Vector Store :-(
>> It would be very far-fetched to correct the cost for the scalar loop
>> in the LoopVectorizer. It seems like a better idea to implement the
>> handling for the vectorized loop also in DAGCombiner. As an
>> alternative, we could just ignore this I guess...
> The current vectorizer doesn't really have infrastructure for this, at
>> My patch does a recursive search starting at an ISD::Or node and then
>> tries to prove that the sub-DAG with all the ors, shifts and extends
>> is equivalent to the original Vector Load, possibly in reversed
>> order. I think that's a somewhat simpler problem than what the scalar
>> loop handling of MatchLoadCombine() is doing.
> Oh, it's simpler because it doesn't actually have to deal with
> memory? Makes sense. Still seems kind of awkward to generate the
> terrible vector loop and recover it later, though; it's fragile based
> on the shuffle costs.
> If you are going to do this, probably simpler to do in instcombine?
I have tried previously to extend instcombine with optimizations of
vector shuffles. I was told then that instcombine should be very careful
with optimizing permutations since the backends may then produce worse
code. I suppose that in this case, if we get just the vector load
instruction instead of the shifts and ors etc, that might be acceptable?
The reason I chose the DAGCombine was due to the explanation for the
scalar case to deliberately make this very late. I thought I might as
well put it in the same spot...
I am not sure I want to do this seriously without at least one other
person thinking it would be valuable and willing to review, and also not
before I know what pass should do it. I have not seen any real need for
this in benchmarks, it's just an observation...
> The existing SLP pass actually could handle this sort of sequence, but
> we currently suppress it if the result would be too narrow. See also
> https://reviews.llvm.org/D48725 .
Could SLP really handle this case so as to get rid of the shifts and ors?
More information about the llvm-dev