[PATCH] Calculate vectorization factor using the narrowest type instead of widest type

Wed Apr 15 05:17:37 PDT 2015

Hi,

There seem several ways to implement this. The ones I've seen mooted so far
are:
  * Pattern-matching "SAD-style" in the vectorizer - see Intel's SAD patch
for details.
  * Demanded-bits analysis allowing type demotion (my idea).
  * What Chandler just suggested.

Does anyone have views on which is best? I agree that this is a problem
that really needs a solution.

Cheers,

James

On Wed, 15 Apr 2015 at 11:08 Chandler Carruth <chandlerc at google.com> wrote:

> On Tue, Apr 14, 2015 at 11:25 AM Cong Hou <congh at google.com> wrote:
>
>> On Tue, Apr 14, 2015 at 8:49 AM, Chandler Carruth <chandlerc at google.com>
>> wrote:
>>
>>> I've replied to some of the higher level concerns already, but I wanted
>>> to point out one specific thing:
>>>
>>> On Fri, Apr 10, 2015 at 3:30 AM Cong Hou <congh at google.com> wrote:
>>>
>>>> LLVM uses the widest type to calculate the maximum vectorization
>>>> factor, which greatly limits the bandwidth of either calculations or
>>>> loads/stores from SIMD instructions. One example is converting 8-bit
>>>> integers to 32-bit integers from arrays in a loop: currently the VF of this
>>>> simple loop is decided by 32-bit integer and for SSE2 it will be 4. Then we
>>>> will have 1 load and 1 store in every 4 iterations. If we calculate VF
>>>> based on 8-bit integer, which will be 16, we will have 1 load and 4 stores
>>>> in every 16 iterations, saving many loads.
>>>>
>>>
>>> While I'm generally in favor of this kind of change, I think the test
>>> case you're looking at is actually a separate issue that I've written up
>>> several times w.r.t. our vectorizer.
>>>
>>
>> You mean fp64_to_uint32-cost-model.ll? I think you are right. My patch
>> invalidates this test and that is why I need to change the test criteria.
>>
>
> No, I meant the benchmarks you're looking at. But I'm guessing which
> benchmarks, so completely possible I've guessed incorrectly! =D Anyways, it
> seems were on the same page here...
>
>
>>
>>
>>>
>>> Because C does integer promotion from 8-bit integer types to 32-bit
>>> integer types, we very commonly see things that are vectorized with 32-bit
>>> integer math when they don't need to.
>>>
>>
>> Yes, the promotion to 32-bit integers are quite annoying to vectorizer:
>> too many packing/unpacking instructions will be generated which could be
>> eliminated if directly doing operations on 8-bit integers won't affect the
>> results.
>>
>
> It also causes bandwidth limitation (or register pressure hit). I just
> wonder if fixing this issue would also fix the bandwidth issues you've seen.
>
>
>>
>>>
>>> The IR can't narrow these operations from 32-bit integer operations to
>>> 8-bit integer operations without losing information because in 8-bits the
>>> operations might overflow. But when vectorizing, we don't care about this.
>>> We should aggressively narrow operations above a trunc which we could hoist
>>> the trunc above by stripping overflow flags while building the vectorizable
>>> operation tree so that we can fit more operations into a single vector.
>>> Does that make sense?
>>>
>>
>> That is also what I am thinking about. If LLVM supports pattern
>> recognition (like in GCC), we could recognize this
>> type-promotion-then-demotion as a pattern then generate better vectorized
>> code. The pattern recognizer can also help generate better SIMD code for
>> dot-product/SAD/widening operations. I am not sure how the SAD patch is
>> implemented and hope we could have a general way to detect those patterns.
>>
>
> I'm not suer what you mean by supporting pattern recognition. We do a
> great deal of pattern matching on the IR?
>
> I don't know that you need to specifically match
> type-promotion-then-demotion. I think that's unnecessarily narrow. If the
> vectorizer sees code:
>
> ...
> %x = add nsw i32 %a, %a
> %y = trunc i32 %x to i8
> store i8 %y, i8* %ptr
>
> And it would form:
>
> ...
> %x.v = add nsw <4 x i32> %a.v, %a.v
> %y.v = trunc <4 x i32> %x.v to <4 x i8>
> store <4 x i8> %y.v, <4 x i8>* %ptr.v
>
> It seems likely beneficial to instead teach the vectorizer to hoist the
> trunc over the "add nsw", removing the "nsw" to preserve semantics. Sure,
> you wouldn't want to do this if it would increase the number of trunc
> instructions, or if the operations aren't supported on the target (or have
> very high cost). But if it doesn't increase the number of instructions
> (either because we having a single input, or because it allows the
> vectorizer to use a wider vector) it seems generally good. Maybe the case
> where there is a matching zext is the only easy case to prove, but it seems
> worth looking at from a general perspective.
>
>
>> Cong
>>
>>
>>>
>>> -Chandler
>>>
>>>
>>>>
>>>> This patch mainly changes the function getWidestType() to
>>>> getNarrowestType(), and uses it to calculate VF.
>>>>
>>>> http://reviews.llvm.org/D8943
>>>>
>>>> Files:
>>>>   lib/Target/X86/X86TargetTransformInfo.cpp
>>>>   lib/Transforms/Vectorize/LoopVectorize.cpp
>>>>   test/Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll
>>>>   test/Transforms/LoopVectorize/X86/vector_ptr_load_store.ll
>>>>
>>>> EMAIL PREFERENCES
>>>>   http://reviews.llvm.org/settings/panel/emailpreferences/
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>
>>>  _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150415/dff6ffe8/attachment.html>