[llvm-dev] AVX Scheduling and Parallelism

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Fri Jun 23 19:16:34 PDT 2017

It is possible that the issue with scheduling is constrained due to 
pointer-aliasing assumptions. Could you share the source for the loop in 

RIP-relative indexing, as I recall, is a feature of position-independent 
code. Based on what's below, it might cause problems by making the 
instruction encodings large. cc'ing some Intel folks for further comments.


On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
> Hello,
> After generating AVX code for large no of iterations i came to realize 
> that it still uses only 2 registers zmm0 and zmm1 when the loop 
> urnroll factor=1024,
> i wonder if this register allocation allows operations in parallel?
> Also i know all the elements within a single vector instruction are 
> computed in parallel but does the elements of multiple instructions 
> computed in parallel? like are 2 vmov with different registers 
> executed in parallel? it can be because each core has an AVX unit. 
> does compiler exploit it?
> secondly i am generating assembly for intel and there are some offset 
> like rip register or some constant addition in memory index. why is 
> that so?
> eg.1
> vmovdqu32zmm0, zmmword ptr [rip + c]
> vpadddzmm0, zmm0, zmmword ptr [rip + b]
> vmovdqu32zmmword ptr [rip + a], zmm0
> vmovdqu32zmm0, zmmword ptr [rip + c+64]
> vpadddzmm0, zmm0, zmmword ptr [rip + b+64]
> and
> eg. 2
> movrax, -393216
> .p2align4, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header: 
> Depth=1
> vmovdqu32zmm1, zmmword ptr [rax + c+401344]             ; load 
> c[401344] in zmm1
> vmovdqu32zmm0, zmmword ptr [rax + c+401280]              ;load 
> b[401280] in zmm0
> vpadddzmm1, zmm1, zmmword ptr [rax + b+401344]          ; 
> zmm1<-zmm1+b[401344]
> vmovdqu32zmmword ptr [rax + a+401344], zmm1              ; store zmm1 
> in c[401344]
> vmovdqu32zmm1, zmmword ptr [rax + c+401216]
> vpadddzmm0, zmm0, zmmword ptr [rax + b+401280]           ; 
> zmm0<-zmm0+b[401280]
> vmovdqu32zmmword ptr [rax + a+401280], zmm0               ; store zmm0 
> in c[401280]
> vmovdqu32zmm0, zmmword ptr [rax + c+401152]
> ........ in the remaining instructions also there is only zmm0 and 
> zmm1 used?
> As you can see in the above examples there could be multiple registers 
> use. also i doubt if the above set of repeating instructions in eg. 2 
> are executed in parallel? and why repeat zmm0 and zmm1 cant it be more 
> zmms and all in parallel, mean the one w/o dependency. for eg in above 
> example blue has dependency in between and red has dependency among 
> each other they cant be executed in parallel but blue and red can be 
> executed in parallel?
> Please correct me if I am wrong.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170623/a1893134/attachment.html>

More information about the llvm-dev mailing list