[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Thu Jun 16 00:20:18 PDT 2016

Sorry, you're right, that really wasn't clear.
When I wrote "for free", I meant "without having code in the vectorizer
dealing specifically with interleaving".

Consider a simple loop, like:

void hot(int *a, int *b) {
#pragma clang loop vectorize_width(4) interleave_count(2)
#pragma nounroll
  for (int i = 0; i < 1000; i++) {
    a[i] += b[i];
  }
  return ;
}

We'll get a vector loop with 4-element vectors, that, when compiling for
SSE, gets lowered to:
.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movdqu -16(%rsi,%rax,4), %xmm0
movdqu (%rsi,%rax,4), %xmm1
movdqu -16(%rdi,%rax,4), %xmm2
movdqu (%rdi,%rax,4), %xmm3
paddd %xmm0, %xmm2
paddd %xmm1, %xmm3
movdqu %xmm2, -16(%rdi,%rax,4)
movdqu %xmm3, (%rdi,%rax,4)
addq $8, %rax
cmpq $1004, %rax             # imm = 0x3EC
jne .LBB0_3

If we instead have
#pragma clang loop vectorize_width(8) interleave_count(1)

We'll get an 8-wide IR vector loop, but end up with almost the same
lowering:
.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movdqu 16(%rsi,%rax,4), %xmm0
movdqu (%rsi,%rax,4), %xmm1
movdqu 16(%rdi,%rax,4), %xmm2
movdqu (%rdi,%rax,4), %xmm3
paddd %xmm1, %xmm3
paddd %xmm0, %xmm2
movdqu %xmm2, 16(%rdi,%rax,4)
movdqu %xmm3, (%rdi,%rax,4)
addq $8, %rax
cmpq $1000, %rax             # imm = 0x3E8
jne .LBB0_3

Legalization splits each 8-wide operation into two 4-wide operations,
achieving almost the same result as vectorizing by a factor of 4 and
unrolling by 2.
The question is whether the legalizer is actually up to doing this well in
general.

On Wed, Jun 15, 2016 at 11:46 PM, Das, Dibyendu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Its not clear how you would get ‘interleaving for free’.
>
>
>
> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Michael
> Kuperstein via llvm-dev
> *Sent:* Thursday, June 16, 2016 4:18 AM
> *To:* Hal Finkel <hfinkel at anl.gov>; Nadav Rotem <nadav.rotem at me.com>;
> Ayal Zaks <ayal.zaks at intel.com>; Demikhovsky, Elena <
> elena.demikhovsky at intel.com>; Adam Nemet <anemet at apple.com>; Sanjoy Das <
> sanjoy at playingwithpointers.com>; James Molloy <james.molloy at arm.com>;
> Matthew Simpson <mssimpso at codeaurora.org>; Sanjay Patel <
> spatel at rotateright.com>; Chandler Carruth <chandlerc at google.com>; David
> Li <davidxl at google.com>; Wei Mi <wmi at google.com>; Dehao Chen <
> dehao at google.com>; Cong Hou <congh at google.com>
> *Cc:* Llvm Dev <llvm-dev at lists.llvm.org>
> *Subject:* [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths
> that generate illegal types
>
>
>
> Hello,
>
>
> Currently the loop vectorizer will, by default, not consider vectorization
> factors that would make it generate types that do not fit into the target
> platform's vector registers. That is, if the widest scalar type in the
> scalar loop is i64, and the platform's largest vector register is 256-bit
> wide, we will not consider a VF above 4.
>
> We have a command line option (-mllvm -vectorizer-maximize-bandwidth),
> that will choose VFs for consideration based on the narrowest scalar type
> instead of the widest one, but I don't believe it has been widely tested.
> If anyone has had an opportunity to play around with it, I'd love to hear
> about the results.
>
> What I'd like to do is:
>
> Step 1: Make -vectorizer-maximize-bandwidth the default. This should
> improve the performance of loops that contain mixed-width types.
> Step 2: Remove the artificial width limitation altogether, and base the
> vectorization factor decision purely on the cost model. This should allow
> us to get rid of the interleaving code in the loop vectorizer, and get
> interleaving for "free" from the legalizer instead.
>
>
>
> There are two potential road-blocks I see - the cost-model, and the
> legalizer. To make this work, we need to:
>
> a) Model the cost of operations on illegal types better. Right now, what
> we get is sometimes completely ridiculous (e.g. see
> http://reviews.llvm.org/D21251).
>
> b) Make sure the cost model actually stops us when the VF becomes too
> large. This is mostly a question of correctly estimating the register
> pressure. In theory, that should not be a issue - we already rely on this
> estimate to choose the interleaving factor, so using the same logic to
> upper-bound the VF directly shouldn't make things worse.
>
> c) Ensure the legalizer is up to the task of emitting good code for overly
> wide vectors. I've talked about this with Chandler, and his opinion
> (Chandler, please correct me if I'm wrong) is that on x86, the legalizer is
> likely to be able to handle this. This may not be true for other platforms.
> So, I'd like to try to make this the default on a platform-by-platform
> basis, starting with x86.
>
>
>
> What do you think? Does this seem like a step in the right direction?
> Anything important I'm missing?
>
>
>
> Thanks,
>
>   Michael
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/5bdf74fb/attachment.html>