[llvm-dev] Fwd: AVX Scheduling and Parallelism

Fri Jun 23 19:25:38 PDT 2017

---------- Forwarded message ----------
From: hameeza ahmed <hahmed2305 at gmail.com>
Date: Sat, Jun 24, 2017 at 7:21 AM
Subject: Re: [llvm-dev] AVX Scheduling and Parallelism
To: Hal Finkel <hfinkel at anl.gov>

int a[100351], b[100351], c[100351];
foo () {
int i;
for (i=0; i<100351; i++) {
a[i] = b[i] + c[i];
}
}

On Sat, Jun 24, 2017 at 7:16 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> It is possible that the issue with scheduling is constrained due to
> pointer-aliasing assumptions. Could you share the source for the loop in
> question?
>
> RIP-relative indexing, as I recall, is a feature of position-independent
> code. Based on what's below, it might cause problems by making the
> instruction encodings large. cc'ing some Intel folks for further comments.
>
>  -Hal
> On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
>
> Hello,
>
> After generating AVX code for large no of iterations i came to realize
> that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
> factor=1024,
>
> i wonder if this register allocation allows operations in parallel?
>
> Also i know all the elements within a single vector instruction are
> computed in parallel but does the elements of multiple instructions
> computed in parallel? like are 2 vmov with different registers executed in
> parallel? it can be because each core has an AVX unit. does compiler
> exploit it?
>
>
> secondly i am generating assembly for intel and there are some offset like
> rip register or some constant addition in memory index. why is that so?
> eg.1
>
> vmovdqu32 zmm0, zmmword ptr [rip + c]
> vpaddd zmm0, zmm0, zmmword ptr [rip + b]
> vmovdqu32 zmmword ptr [rip + a], zmm0
> vmovdqu32 zmm0, zmmword ptr [rip + c+64]
> vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]
>
>
> and
>
> eg. 2
>
> mov rax, -393216
> .p2align 4, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header: Depth=1
> vmovdqu32 zmm1, zmmword ptr [rax + c+401344]             ; load c[401344]
> in zmm1
> vmovdqu32 zmm0, zmmword ptr [rax + c+401280]              ;load b[401280]
> in zmm0
> vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]          ;
> zmm1<-zmm1+b[401344]
> vmovdqu32 zmmword ptr [rax + a+401344], zmm1              ; store zmm1 in
> c[401344]
> vmovdqu32 zmm1, zmmword ptr [rax + c+401216]
> vpaddd zmm0, zmm0, zmmword ptr [rax + b+401280]           ;
> zmm0<-zmm0+b[401280]
> vmovdqu32 zmmword ptr [rax + a+401280], zmm0               ; store zmm0
> in c[401280]
> vmovdqu32 zmm0, zmmword ptr [rax + c+401152]
> ........ in the remaining instructions also there is only zmm0 and zmm1
> used?
>
> As you can see in the above examples there could be multiple registers
> use. also i doubt if the above set of repeating instructions in eg. 2 are
> executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms and
> all in parallel, mean the one w/o dependency. for eg in above example blue
> has dependency in between and red has dependency among each other they cant
> be executed in parallel but blue and red can be executed in parallel?
>
>
>
> Please correct me if I am wrong.
>
>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170624/8fc0bb32/attachment.html>