[PATCH] Calculate vectorization factor using the narrowest type instead of widest type

Wed Apr 15 17:16:09 PDT 2015

On Wed, Apr 15, 2015 at 3:04 AM, Chandler Carruth <chandlerc at google.com>
wrote:

> On Tue, Apr 14, 2015 at 11:25 AM Cong Hou <congh at google.com> wrote:
>
>> On Tue, Apr 14, 2015 at 8:49 AM, Chandler Carruth <chandlerc at google.com>
>> wrote:
>>
>>> I've replied to some of the higher level concerns already, but I wanted
>>> to point out one specific thing:
>>>
>>> On Fri, Apr 10, 2015 at 3:30 AM Cong Hou <congh at google.com> wrote:
>>>
>>>> LLVM uses the widest type to calculate the maximum vectorization
>>>> factor, which greatly limits the bandwidth of either calculations or
>>>> loads/stores from SIMD instructions. One example is converting 8-bit
>>>> integers to 32-bit integers from arrays in a loop: currently the VF of this
>>>> simple loop is decided by 32-bit integer and for SSE2 it will be 4. Then we
>>>> will have 1 load and 1 store in every 4 iterations. If we calculate VF
>>>> based on 8-bit integer, which will be 16, we will have 1 load and 4 stores
>>>> in every 16 iterations, saving many loads.
>>>>
>>>
>>> While I'm generally in favor of this kind of change, I think the test
>>> case you're looking at is actually a separate issue that I've written up
>>> several times w.r.t. our vectorizer.
>>>
>>
>> You mean fp64_to_uint32-cost-model.ll? I think you are right. My patch
>> invalidates this test and that is why I need to change the test criteria.
>>
>
> No, I meant the benchmarks you're looking at. But I'm guessing which
> benchmarks, so completely possible I've guessed incorrectly! =D Anyways, it
> seems were on the same page here...
>

OK. Now I know what you mean...

>
>
>>
>>
>>>
>>> Because C does integer promotion from 8-bit integer types to 32-bit
>>> integer types, we very commonly see things that are vectorized with 32-bit
>>> integer math when they don't need to.
>>>
>>
>> Yes, the promotion to 32-bit integers are quite annoying to vectorizer:
>> too many packing/unpacking instructions will be generated which could be
>> eliminated if directly doing operations on 8-bit integers won't affect the
>> results.
>>
>
> It also causes bandwidth limitation (or register pressure hit). I just
> wonder if fixing this issue would also fix the bandwidth issues you've seen.
>

>
>>
>>>
>>> The IR can't narrow these operations from 32-bit integer operations to
>>> 8-bit integer operations without losing information because in 8-bits the
>>> operations might overflow. But when vectorizing, we don't care about this.
>>> We should aggressively narrow operations above a trunc which we could hoist
>>> the trunc above by stripping overflow flags while building the vectorizable
>>> operation tree so that we can fit more operations into a single vector.
>>> Does that make sense?
>>>
>>
>> That is also what I am thinking about. If LLVM supports pattern
>> recognition (like in GCC), we could recognize this
>> type-promotion-then-demotion as a pattern then generate better vectorized
>> code. The pattern recognizer can also help generate better SIMD code for
>> dot-product/SAD/widening operations. I am not sure how the SAD patch is
>> implemented and hope we could have a general way to detect those patterns.
>>
>
> I'm not suer what you mean by supporting pattern recognition. We do a
> great deal of pattern matching on the IR?
>
> I don't know that you need to specifically match
> type-promotion-then-demotion. I think that's unnecessarily narrow. If the
> vectorizer sees code:
>
> ...
> %x = add nsw i32 %a, %a
> %y = trunc i32 %x to i8
> store i8 %y, i8* %ptr
>
> And it would form:
>
> ...
> %x.v = add nsw <4 x i32> %a.v, %a.v
> %y.v = trunc <4 x i32> %x.v to <4 x i8>
> store <4 x i8> %y.v, <4 x i8>* %ptr.v
>
> It seems likely beneficial to instead teach the vectorizer to hoist the
> trunc over the "add nsw", removing the "nsw" to preserve semantics. Sure,
> you wouldn't want to do this if it would increase the number of trunc
> instructions, or if the operations aren't supported on the target (or have
> very high cost). But if it doesn't increase the number of instructions
> (either because we having a single input, or because it allows the
> vectorizer to use a wider vector) it seems generally good. Maybe the case
> where there is a matching zext is the only easy case to prove, but it seems
> worth looking at from a general perspective.
>

If we hoist "trunc" over "add nsw" we need more trunc operations as there
are two operands in "add nsw". This may not always be beneficial. However,
if those operands are obtained from "type promotion from i8 to i32" then we
don't need those trunc operations anymore because we already have both
operands of type <4 x i8> ready. This is can be done by
type-promotion-then-demotion pattern recognition I have mentioned. One
advantage of this method is that it is easier to calculate the more precise
cost in our cost model.

>
>
>> Cong
>>
>>
>>>
>>> -Chandler
>>>
>>>
>>>>
>>>> This patch mainly changes the function getWidestType() to
>>>> getNarrowestType(), and uses it to calculate VF.
>>>>
>>>> http://reviews.llvm.org/D8943
>>>>
>>>> Files:
>>>>   lib/Target/X86/X86TargetTransformInfo.cpp
>>>>   lib/Transforms/Vectorize/LoopVectorize.cpp
>>>>   test/Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll
>>>>   test/Transforms/LoopVectorize/X86/vector_ptr_load_store.ll
>>>>
>>>> EMAIL PREFERENCES
>>>>   http://reviews.llvm.org/settings/panel/emailpreferences/
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150415/b61d6a35/attachment.html>