[PATCH] Calculate vectorization factor using the narrowest type instead of widest type

Wed Apr 15 03:04:17 PDT 2015

On Tue, Apr 14, 2015 at 11:25 AM Cong Hou <congh at google.com> wrote:

> On Tue, Apr 14, 2015 at 8:49 AM, Chandler Carruth <chandlerc at google.com>
> wrote:
>
>> I've replied to some of the higher level concerns already, but I wanted
>> to point out one specific thing:
>>
>> On Fri, Apr 10, 2015 at 3:30 AM Cong Hou <congh at google.com> wrote:
>>
>>> LLVM uses the widest type to calculate the maximum vectorization factor,
>>> which greatly limits the bandwidth of either calculations or loads/stores
>>> from SIMD instructions. One example is converting 8-bit integers to 32-bit
>>> integers from arrays in a loop: currently the VF of this simple loop is
>>> decided by 32-bit integer and for SSE2 it will be 4. Then we will have 1
>>> load and 1 store in every 4 iterations. If we calculate VF based on 8-bit
>>> integer, which will be 16, we will have 1 load and 4 stores in every 16
>>> iterations, saving many loads.
>>>
>>
>> While I'm generally in favor of this kind of change, I think the test
>> case you're looking at is actually a separate issue that I've written up
>> several times w.r.t. our vectorizer.
>>
>
> You mean fp64_to_uint32-cost-model.ll? I think you are right. My patch
> invalidates this test and that is why I need to change the test criteria.
>

No, I meant the benchmarks you're looking at. But I'm guessing which
benchmarks, so completely possible I've guessed incorrectly! =D Anyways, it
seems were on the same page here...

>
>
>>
>> Because C does integer promotion from 8-bit integer types to 32-bit
>> integer types, we very commonly see things that are vectorized with 32-bit
>> integer math when they don't need to.
>>
>
> Yes, the promotion to 32-bit integers are quite annoying to vectorizer:
> too many packing/unpacking instructions will be generated which could be
> eliminated if directly doing operations on 8-bit integers won't affect the
> results.
>

It also causes bandwidth limitation (or register pressure hit). I just
wonder if fixing this issue would also fix the bandwidth issues you've seen.

>
>>
>> The IR can't narrow these operations from 32-bit integer operations to
>> 8-bit integer operations without losing information because in 8-bits the
>> operations might overflow. But when vectorizing, we don't care about this.
>> We should aggressively narrow operations above a trunc which we could hoist
>> the trunc above by stripping overflow flags while building the vectorizable
>> operation tree so that we can fit more operations into a single vector.
>> Does that make sense?
>>
>
> That is also what I am thinking about. If LLVM supports pattern
> recognition (like in GCC), we could recognize this
> type-promotion-then-demotion as a pattern then generate better vectorized
> code. The pattern recognizer can also help generate better SIMD code for
> dot-product/SAD/widening operations. I am not sure how the SAD patch is
> implemented and hope we could have a general way to detect those patterns.
>

I'm not suer what you mean by supporting pattern recognition. We do a great
deal of pattern matching on the IR?

I don't know that you need to specifically match
type-promotion-then-demotion. I think that's unnecessarily narrow. If the
vectorizer sees code:

...
%x = add nsw i32 %a, %a
%y = trunc i32 %x to i8
store i8 %y, i8* %ptr

And it would form:

...
%x.v = add nsw <4 x i32> %a.v, %a.v
%y.v = trunc <4 x i32> %x.v to <4 x i8>
store <4 x i8> %y.v, <4 x i8>* %ptr.v

It seems likely beneficial to instead teach the vectorizer to hoist the
trunc over the "add nsw", removing the "nsw" to preserve semantics. Sure,
you wouldn't want to do this if it would increase the number of trunc
instructions, or if the operations aren't supported on the target (or have
very high cost). But if it doesn't increase the number of instructions
(either because we having a single input, or because it allows the
vectorizer to use a wider vector) it seems generally good. Maybe the case
where there is a matching zext is the only easy case to prove, but it seems
worth looking at from a general perspective.

> Cong
>
>
>>
>> -Chandler
>>
>>
>>>
>>> This patch mainly changes the function getWidestType() to
>>> getNarrowestType(), and uses it to calculate VF.
>>>
>>> http://reviews.llvm.org/D8943
>>>
>>> Files:
>>>   lib/Target/X86/X86TargetTransformInfo.cpp
>>>   lib/Transforms/Vectorize/LoopVectorize.cpp
>>>   test/Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll
>>>   test/Transforms/LoopVectorize/X86/vector_ptr_load_store.ll
>>>
>>> EMAIL PREFERENCES
>>>   http://reviews.llvm.org/settings/panel/emailpreferences/
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150415/0e05d525/attachment.html>