[llvm-dev] AVX Scheduling and Parallelism

hameeza ahmed via llvm-dev llvm-dev at lists.llvm.org
Fri Jun 23 19:02:22 PDT 2017


Hello,

After generating AVX code for large no of iterations i came to realize that
it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
factor=1024,

i wonder if this register allocation allows operations in parallel?

Also i know all the elements within a single vector instruction are
computed in parallel but does the elements of multiple instructions
computed in parallel? like are 2 vmov with different registers executed in
parallel? it can be because each core has an AVX unit. does compiler
exploit it?


secondly i am generating assembly for intel and there are some offset like
rip register or some constant addition in memory index. why is that so?
eg.1

vmovdqu32 zmm0, zmmword ptr [rip + c]
vpaddd zmm0, zmm0, zmmword ptr [rip + b]
vmovdqu32 zmmword ptr [rip + a], zmm0
vmovdqu32 zmm0, zmmword ptr [rip + c+64]
vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]


and

eg. 2

mov rax, -393216
.p2align 4, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
vmovdqu32 zmm1, zmmword ptr [rax + c+401344]             ; load c[401344]
in zmm1
vmovdqu32 zmm0, zmmword ptr [rax + c+401280]              ;load b[401280]
in zmm0
vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]          ;
zmm1<-zmm1+b[401344]
vmovdqu32 zmmword ptr [rax + a+401344], zmm1              ; store zmm1 in
c[401344]
vmovdqu32 zmm1, zmmword ptr [rax + c+401216]
vpaddd zmm0, zmm0, zmmword ptr [rax + b+401280]           ;
zmm0<-zmm0+b[401280]
vmovdqu32 zmmword ptr [rax + a+401280], zmm0               ; store zmm0 in
c[401280]
vmovdqu32 zmm0, zmmword ptr [rax + c+401152]
........ in the remaining instructions also there is only zmm0 and zmm1
used?

As you can see in the above examples there could be multiple registers use.
also i doubt if the above set of repeating instructions in eg. 2 are
executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms and
all in parallel, mean the one w/o dependency. for eg in above example blue
has dependency in between and red has dependency among each other they cant
be executed in parallel but blue and red can be executed in parallel?



Please correct me if I am wrong.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170624/6a59a42a/attachment.html>


More information about the llvm-dev mailing list