[llvm-dev] AVX Scheduling and Parallelism
Rackover, Zvi via llvm-dev
llvm-dev at lists.llvm.org
Sun Jun 25 05:14:28 PDT 2017
>From what can be seen in the code snippet you provided, the reuse of XMM0 and XMM1 across loop-unroll instances does not inhibit instruction-level parallelism.
Modern X86 processors use register renaming that can eliminate the dependencies in the instruction stream. In the example you provided, the processor should be able to identify the 2-vloads + vadd + vstore sequences as independent and pipeline their execution.
From: Hal Finkel [mailto:hfinkel at anl.gov]
Sent: Saturday, June 24, 2017 05:17
To: hameeza ahmed <hahmed2305 at gmail.com>; llvm-dev at lists.llvm.org
Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com>; Rackover, Zvi <zvi.rackover at intel.com>; Breger, Igor <igor.breger at intel.com>; craig.topper at gmail.com
Subject: Re: [llvm-dev] AVX Scheduling and Parallelism
It is possible that the issue with scheduling is constrained due to pointer-aliasing assumptions. Could you share the source for the loop in question?
RIP-relative indexing, as I recall, is a feature of position-independent code. Based on what's below, it might cause problems by making the instruction encodings large. cc'ing some Intel folks for further comments.
On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
After generating AVX code for large no of iterations i came to realize that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll factor=1024,
i wonder if this register allocation allows operations in parallel?
Also i know all the elements within a single vector instruction are computed in parallel but does the elements of multiple instructions computed in parallel? like are 2 vmov with different registers executed in parallel? it can be because each core has an AVX unit. does compiler exploit it?
secondly i am generating assembly for intel and there are some offset like rip register or some constant addition in memory index. why is that so?
vmovdqu32 zmm0, zmmword ptr [rip + c]
vpaddd zmm0, zmm0, zmmword ptr [rip + b]
vmovdqu32 zmmword ptr [rip + a], zmm0
vmovdqu32 zmm0, zmmword ptr [rip + c+64]
vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]
mov rax, -393216
.p2align 4, 0x90
.LBB0_1: # %vector.body
# =>This Inner Loop Header: Depth=1
vmovdqu32 zmm1, zmmword ptr [rax + c+401344] ; load c in zmm1
vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b in zmm0
vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344] ; zmm1<-zmm1+b
vmovdqu32 zmmword ptr [rax + a+401344], zmm1 ; store zmm1 in c
vmovdqu32 zmm1, zmmword ptr [rax + c+401216]
vpaddd zmm0, zmm0, zmmword ptr [rax + b+401280] ; zmm0<-zmm0+b
vmovdqu32 zmmword ptr [rax + a+401280], zmm0 ; store zmm0 in c
vmovdqu32 zmm0, zmmword ptr [rax + c+401152]
........ in the remaining instructions also there is only zmm0 and zmm1 used?
As you can see in the above examples there could be multiple registers use. also i doubt if the above set of repeating instructions in eg. 2 are executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms and all in parallel, mean the one w/o dependency. for eg in above example blue has dependency in between and red has dependency among each other they cant be executed in parallel but blue and red can be executed in parallel?
Please correct me if I am wrong.
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev