[PATCH] Calculate vectorization factor using the narrowest type instead of widest type

Wed Apr 15 05:21:59 PDT 2015

On Wed, Apr 15, 2015 at 5:17 AM James Molloy <james at jamesmolloy.co.uk>
wrote:

> Hi,
>
> There seem several ways to implement this. The ones I've seen mooted so
> far are:
>   * Pattern-matching "SAD-style" in the vectorizer - see Intel's SAD patch
> for details.
>   * Demanded-bits analysis allowing type demotion (my idea).
>   * What Chandler just suggested.
>
> Does anyone have views on which is best? I agree that this is a problem
> that really needs a solution.
>

We need all three IMO.

When we have really precise patterns, we need to do that: SAD, CRC, etc.

We should also do demanded-bits analysis to try to do type demotion.

We should also do other strategies to narrow types.

While clearly we should prioritize these based on specific problems that
need solving, I think long term we're really going to need to cover all of
our bases here. Getting the maximum bandwidth out of the vector unit when
we've found code that can run inside the unit is too important to miss
anything significant IMO.

-Chandler

>
> Cheers,
>
> James
>
> On Wed, 15 Apr 2015 at 11:08 Chandler Carruth <chandlerc at google.com>
> wrote:
>
>> On Tue, Apr 14, 2015 at 11:25 AM Cong Hou <congh at google.com> wrote:
>>
>>> On Tue, Apr 14, 2015 at 8:49 AM, Chandler Carruth <chandlerc at google.com>
>>> wrote:
>>>
>>>> I've replied to some of the higher level concerns already, but I wanted
>>>> to point out one specific thing:
>>>>
>>>> On Fri, Apr 10, 2015 at 3:30 AM Cong Hou <congh at google.com> wrote:
>>>>
>>>>> LLVM uses the widest type to calculate the maximum vectorization
>>>>> factor, which greatly limits the bandwidth of either calculations or
>>>>> loads/stores from SIMD instructions. One example is converting 8-bit
>>>>> integers to 32-bit integers from arrays in a loop: currently the VF of this
>>>>> simple loop is decided by 32-bit integer and for SSE2 it will be 4. Then we
>>>>> will have 1 load and 1 store in every 4 iterations. If we calculate VF
>>>>> based on 8-bit integer, which will be 16, we will have 1 load and 4 stores
>>>>> in every 16 iterations, saving many loads.
>>>>>
>>>>
>>>> While I'm generally in favor of this kind of change, I think the test
>>>> case you're looking at is actually a separate issue that I've written up
>>>> several times w.r.t. our vectorizer.
>>>>
>>>
>>> You mean fp64_to_uint32-cost-model.ll? I think you are right. My patch
>>> invalidates this test and that is why I need to change the test criteria.
>>>
>>
>> No, I meant the benchmarks you're looking at. But I'm guessing which
>> benchmarks, so completely possible I've guessed incorrectly! =D Anyways, it
>> seems were on the same page here...
>>
>>
>>>
>>>
>>>>
>>>> Because C does integer promotion from 8-bit integer types to 32-bit
>>>> integer types, we very commonly see things that are vectorized with 32-bit
>>>> integer math when they don't need to.
>>>>
>>>
>>> Yes, the promotion to 32-bit integers are quite annoying to vectorizer:
>>> too many packing/unpacking instructions will be generated which could be
>>> eliminated if directly doing operations on 8-bit integers won't affect the
>>> results.
>>>
>>
>> It also causes bandwidth limitation (or register pressure hit). I just
>> wonder if fixing this issue would also fix the bandwidth issues you've seen.
>>
>>
>>>
>>>>
>>>> The IR can't narrow these operations from 32-bit integer operations to
>>>> 8-bit integer operations without losing information because in 8-bits the
>>>> operations might overflow. But when vectorizing, we don't care about this.
>>>> We should aggressively narrow operations above a trunc which we could hoist
>>>> the trunc above by stripping overflow flags while building the vectorizable
>>>> operation tree so that we can fit more operations into a single vector.
>>>> Does that make sense?
>>>>
>>>
>>> That is also what I am thinking about. If LLVM supports pattern
>>> recognition (like in GCC), we could recognize this
>>> type-promotion-then-demotion as a pattern then generate better vectorized
>>> code. The pattern recognizer can also help generate better SIMD code for
>>> dot-product/SAD/widening operations. I am not sure how the SAD patch is
>>> implemented and hope we could have a general way to detect those patterns.
>>>
>>
>> I'm not suer what you mean by supporting pattern recognition. We do a
>> great deal of pattern matching on the IR?
>>
>> I don't know that you need to specifically match
>> type-promotion-then-demotion. I think that's unnecessarily narrow. If the
>> vectorizer sees code:
>>
>> ...
>> %x = add nsw i32 %a, %a
>> %y = trunc i32 %x to i8
>> store i8 %y, i8* %ptr
>>
>> And it would form:
>>
>> ...
>> %x.v = add nsw <4 x i32> %a.v, %a.v
>> %y.v = trunc <4 x i32> %x.v to <4 x i8>
>> store <4 x i8> %y.v, <4 x i8>* %ptr.v
>>
>> It seems likely beneficial to instead teach the vectorizer to hoist the
>> trunc over the "add nsw", removing the "nsw" to preserve semantics. Sure,
>> you wouldn't want to do this if it would increase the number of trunc
>> instructions, or if the operations aren't supported on the target (or have
>> very high cost). But if it doesn't increase the number of instructions
>> (either because we having a single input, or because it allows the
>> vectorizer to use a wider vector) it seems generally good. Maybe the case
>> where there is a matching zext is the only easy case to prove, but it seems
>> worth looking at from a general perspective.
>>
>>
>>> Cong
>>>
>>>
>>>>
>>>> -Chandler
>>>>
>>>>
>>>>>
>>>>> This patch mainly changes the function getWidestType() to
>>>>> getNarrowestType(), and uses it to calculate VF.
>>>>>
>>>>> http://reviews.llvm.org/D8943
>>>>>
>>>>> Files:
>>>>>   lib/Target/X86/X86TargetTransformInfo.cpp
>>>>>   lib/Transforms/Vectorize/LoopVectorize.cpp
>>>>>   test/Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll
>>>>>   test/Transforms/LoopVectorize/X86/vector_ptr_load_store.ll
>>>>>
>>>>> EMAIL PREFERENCES
>>>>>   http://reviews.llvm.org/settings/panel/emailpreferences/
>>>>> _______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>
>>>>  _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150415/8ab68e6b/attachment.html>