[llvm-dev] AVX Scheduling and Parallelism
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Fri Jun 23 19:16:34 PDT 2017
It is possible that the issue with scheduling is constrained due to
pointer-aliasing assumptions. Could you share the source for the loop in
question?
RIP-relative indexing, as I recall, is a feature of position-independent
code. Based on what's below, it might cause problems by making the
instruction encodings large. cc'ing some Intel folks for further comments.
-Hal
On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
> Hello,
>
> After generating AVX code for large no of iterations i came to realize
> that it still uses only 2 registers zmm0 and zmm1 when the loop
> urnroll factor=1024,
>
> i wonder if this register allocation allows operations in parallel?
>
> Also i know all the elements within a single vector instruction are
> computed in parallel but does the elements of multiple instructions
> computed in parallel? like are 2 vmov with different registers
> executed in parallel? it can be because each core has an AVX unit.
> does compiler exploit it?
>
>
> secondly i am generating assembly for intel and there are some offset
> like rip register or some constant addition in memory index. why is
> that so?
> eg.1
>
> vmovdqu32zmm0, zmmword ptr [rip + c]
> vpadddzmm0, zmm0, zmmword ptr [rip + b]
> vmovdqu32zmmword ptr [rip + a], zmm0
> vmovdqu32zmm0, zmmword ptr [rip + c+64]
> vpadddzmm0, zmm0, zmmword ptr [rip + b+64]
>
>
> and
>
> eg. 2
>
> movrax, -393216
> .p2align4, 0x90
> .LBB0_1: # %vector.body
> # =>This Inner Loop Header:
> Depth=1
> vmovdqu32zmm1, zmmword ptr [rax + c+401344] ; load
> c[401344] in zmm1
> vmovdqu32zmm0, zmmword ptr [rax + c+401280] ;load
> b[401280] in zmm0
> vpadddzmm1, zmm1, zmmword ptr [rax + b+401344] ;
> zmm1<-zmm1+b[401344]
> vmovdqu32zmmword ptr [rax + a+401344], zmm1 ; store zmm1
> in c[401344]
> vmovdqu32zmm1, zmmword ptr [rax + c+401216]
> vpadddzmm0, zmm0, zmmword ptr [rax + b+401280] ;
> zmm0<-zmm0+b[401280]
> vmovdqu32zmmword ptr [rax + a+401280], zmm0 ; store zmm0
> in c[401280]
> vmovdqu32zmm0, zmmword ptr [rax + c+401152]
> ........ in the remaining instructions also there is only zmm0 and
> zmm1 used?
>
> As you can see in the above examples there could be multiple registers
> use. also i doubt if the above set of repeating instructions in eg. 2
> are executed in parallel? and why repeat zmm0 and zmm1 cant it be more
> zmms and all in parallel, mean the one w/o dependency. for eg in above
> example blue has dependency in between and red has dependency among
> each other they cant be executed in parallel but blue and red can be
> executed in parallel?
>
>
>
> Please correct me if I am wrong.
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170623/a1893134/attachment.html>
More information about the llvm-dev
mailing list