[llvm-dev] AVX Scheduling and Parallelism

Sat Aug 26 11:58:38 PDT 2017

Hello,

I have defined 8 registers in registerinfo.td file in the following order:
R_0, R_1, R_2, R_3, R_4, R_5, R_6, R_7

But the generated assembly code only uses 2 registers. How to enable it to
use all 8? Also can i control the ordering like after R_0 can i use R_5
without changes in registerinfo.td?

What changes are required here? either in scheduling or register allocation
phases?

P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a], R_0
P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b+2048]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c+2048]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+2048], R_0
P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b+4096]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c+4096]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+4096], R_0
P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b+6144]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c+6144]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+6144], R_0

Please help. I am stuck here.

Thank You

On Mon, Jun 26, 2017 at 2:12 PM, hameeza ahmed <hahmed2305 at gmail.com> wrote:

> Thank You
>
> On Sun, Jun 25, 2017 at 7:23 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>> Hi, Zvi,
>>
>> I agree. In the context of targeting the KNL, however, I'm a bit
>> concerned about the addressing, and specifically, the size of the resulting
>> encoding:
>>
>>             vmovdqu32     zmm0, zmmword ptr [rax + c+401280]
>>  ;load b[401280] in zmm0
>>             vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]
>>        ; zmm1<-zmm1+b[401344]
>>
>> The KNL can only deliver 16 bytes per cycle from the icache to the
>> decoder. Essentially all of the instructions in the loop, as we seem to
>> generate it, have 10-byte encodings:
>>
>>   10:    62 f1 7e 48 6f 80 00     vmovdqu32 0x0(%rax),%zmm0
>>   17:    00 00 00
>>             16: R_X86_64_32S    c+0x61f00
>>
>> ...
>>   38:    62 f1 7d 48 fe 80 00     vpaddd 0x0(%rax),%zmm0,%zmm0
>>   3f:    00 00 00
>>             3e: R_X86_64_32S    b+0x61f00
>> ...
>>
>> and since this seems like a generic feature of how we generate code, it
>> seems like we can end up decoder limited (it might even be decoder limited
>> for this loop). We might want to less aggressive in generating complex
>> addressing modes for the KNL. It seems like it would be better to
>> materialize the base array addresses into a register to make the encodings
>> shorter.
>>
>>  -Hal
>>
>>
>> On 06/25/2017 07:14 AM, Rackover, Zvi wrote:
>>
>> Hi Ahmed,
>>
>>
>>
>> From what can be seen in the code snippet you provided, the reuse of XMM0
>> and XMM1 across loop-unroll instances does not inhibit instruction-level
>> parallelism.
>>
>> Modern X86 processors use register renaming that can eliminate the
>> dependencies in the instruction stream. In the example you provided, the
>> processor should be able to identify the 2-vloads + vadd + vstore sequences
>> as independent and pipeline their execution.
>>
>>
>>
>> Thanks, Zvi
>>
>>
>>
>> *From:* Hal Finkel [mailto:hfinkel at anl.gov <hfinkel at anl.gov>]
>> *Sent:* Saturday, June 24, 2017 05:17
>> *To:* hameeza ahmed <hahmed2305 at gmail.com> <hahmed2305 at gmail.com>;
>> llvm-dev at lists.llvm.org
>> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>
>> <elena.demikhovsky at intel.com>; Rackover, Zvi <zvi.rackover at intel.com>
>> <zvi.rackover at intel.com>; Breger, Igor<igor.breger at intel.com>
>> <igor.breger at intel.com>; craig.topper at gmail.com
>> *Subject:* Re: [llvm-dev] AVX Scheduling and Parallelism
>>
>>
>>
>> It is possible that the issue with scheduling is constrained due to
>> pointer-aliasing assumptions. Could you share the source for the loop in
>> question?
>>
>> RIP-relative indexing, as I recall, is a feature of position-independent
>> code. Based on what's below, it might cause problems by making the
>> instruction encodings large. cc'ing some Intel folks for further comments.
>>
>>  -Hal
>>
>> On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
>>
>> Hello,
>>
>>
>>
>> After generating AVX code for large no of iterations i came to realize
>> that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
>> factor=1024,
>>
>>
>>
>> i wonder if this register allocation allows operations in parallel?
>>
>>
>>
>> Also i know all the elements within a single vector instruction are
>> computed in parallel but does the elements of multiple instructions
>> computed in parallel? like are 2 vmov with different registers executed in
>> parallel? it can be because each core has an AVX unit. does compiler
>> exploit it?
>>
>>
>>
>>
>>
>> secondly i am generating assembly for intel and there are some offset
>> like rip register or some constant addition in memory index. why is that so?
>>
>> eg.1
>>
>>
>>
>>             vmovdqu32     zmm0, zmmword ptr [rip + c]
>>
>>             vpaddd            zmm0, zmm0, zmmword ptr [rip + b]
>>
>>             vmovdqu32     zmmword ptr [rip + a], zmm0
>>
>>             vmovdqu32     zmm0, zmmword ptr [rip + c+64]
>>
>>             vpaddd            zmm0, zmm0, zmmword ptr [rip + b+64]
>>
>>
>>
>>
>>
>> and
>>
>>
>>
>> eg. 2
>>
>>
>>
>> mov     rax, -393216
>>
>>             .p2align           4, 0x90
>>
>> .LBB0_1:                                # %vector.body
>>
>>                                         # =>This Inner Loop Header:
>> Depth=1
>>
>>             vmovdqu32     zmm1, zmmword ptr [rax + c+401344]
>> ; load c[401344] in zmm1
>>
>>             vmovdqu32     zmm0, zmmword ptr [rax + c+401280]
>>  ;load b[401280] in zmm0
>>
>>             vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]
>>        ; zmm1<-zmm1+b[401344]
>>
>>             vmovdqu32     zmmword ptr [rax + a+401344], zmm1
>>  ; store zmm1 in c[401344]
>>
>>             vmovdqu32     zmm1, zmmword ptr [rax + c+401216]
>>
>>             vpaddd            zmm0, zmm0, zmmword ptr [rax + b+401280]
>>         ; zmm0<-zmm0+b[401280]
>>
>>             vmovdqu32     zmmword ptr [rax + a+401280], zmm0
>>   ; store zmm0 in c[401280]
>>
>>             vmovdqu32     zmm0, zmmword ptr [rax + c+401152]
>>
>> ........ in the remaining instructions also there is only zmm0 and zmm1
>> used?
>>
>>
>>
>> As you can see in the above examples there could be multiple registers
>> use. also i doubt if the above set of repeating instructions in eg. 2 are
>> executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms and
>> all in parallel, mean the one w/o dependency. for eg in above example blue
>> has dependency in between and red has dependency among each other they cant
>> be executed in parallel but blue and red can be executed in parallel?
>>
>>
>>
>>
>>
>>
>>
>> Please correct me if I am wrong.
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> LLVM Developers mailing list
>>
>> llvm-dev at lists.llvm.org
>>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>>
>> Hal Finkel
>>
>> Lead, Compiler Technology and Programming Languages
>>
>> Leadership Computing Facility
>>
>> Argonne National Laboratory
>>
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>
On Sun, Jun 25, 2017 at 7:23 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> Hi, Zvi,
>
> I agree. In the context of targeting the KNL, however, I'm a bit concerned
> about the addressing, and specifically, the size of the resulting encoding:
>
>             vmovdqu32     zmm0, zmmword ptr [rax + c+401280]
>  ;load b[401280] in zmm0
>             vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]
>      ; zmm1<-zmm1+b[401344]
>
> The KNL can only deliver 16 bytes per cycle from the icache to the
> decoder. Essentially all of the instructions in the loop, as we seem to
> generate it, have 10-byte encodings:
>
>   10:    62 f1 7e 48 6f 80 00     vmovdqu32 0x0(%rax),%zmm0
>   17:    00 00 00
>             16: R_X86_64_32S    c+0x61f00
>
> ...
>   38:    62 f1 7d 48 fe 80 00     vpaddd 0x0(%rax),%zmm0,%zmm0
>   3f:    00 00 00
>             3e: R_X86_64_32S    b+0x61f00
> ...
>
> and since this seems like a generic feature of how we generate code, it
> seems like we can end up decoder limited (it might even be decoder limited
> for this loop). We might want to less aggressive in generating complex
> addressing modes for the KNL. It seems like it would be better to
> materialize the base array addresses into a register to make the encodings
> shorter.
>
>  -Hal
>
>
> On 06/25/2017 07:14 AM, Rackover, Zvi wrote:
>
> Hi Ahmed,
>
>
>
> From what can be seen in the code snippet you provided, the reuse of XMM0
> and XMM1 across loop-unroll instances does not inhibit instruction-level
> parallelism.
>
> Modern X86 processors use register renaming that can eliminate the
> dependencies in the instruction stream. In the example you provided, the
> processor should be able to identify the 2-vloads + vadd + vstore sequences
> as independent and pipeline their execution.
>
>
>
> Thanks, Zvi
>
>
>
> *From:* Hal Finkel [mailto:hfinkel at anl.gov <hfinkel at anl.gov>]
> *Sent:* Saturday, June 24, 2017 05:17
> *To:* hameeza ahmed <hahmed2305 at gmail.com> <hahmed2305 at gmail.com>;
> llvm-dev at lists.llvm.org
> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>
> <elena.demikhovsky at intel.com>; Rackover, Zvi <zvi.rackover at intel.com>
> <zvi.rackover at intel.com>; Breger, Igor <igor.breger at intel.com>
> <igor.breger at intel.com>; craig.topper at gmail.com
> *Subject:* Re: [llvm-dev] AVX Scheduling and Parallelism
>
>
>
> It is possible that the issue with scheduling is constrained due to
> pointer-aliasing assumptions. Could you share the source for the loop in
> question?
>
> RIP-relative indexing, as I recall, is a feature of position-independent
> code. Based on what's below, it might cause problems by making the
> instruction encodings large. cc'ing some Intel folks for further comments.
>
>  -Hal
>
> On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
>
> Hello,
>
>
>
> After generating AVX code for large no of iterations i came to realize
> that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
> factor=1024,
>
>
>
> i wonder if this register allocation allows operations in parallel?
>
>
>
> Also i know all the elements within a single vector instruction are
> computed in parallel but does the elements of multiple instructions
> computed in parallel? like are 2 vmov with different registers executed in
> parallel? it can be because each core has an AVX unit. does compiler
> exploit it?
>
>
>
>
>
> secondly i am generating assembly for intel and there are some offset like
> rip register or some constant addition in memory index. why is that so?
>
> eg.1
>
>
>
>             vmovdqu32     zmm0, zmmword ptr [rip + c]
>
>             vpaddd            zmm0, zmm0, zmmword ptr [rip + b]
>
>             vmovdqu32     zmmword ptr [rip + a], zmm0
>
>             vmovdqu32     zmm0, zmmword ptr [rip + c+64]
>
>             vpaddd            zmm0, zmm0, zmmword ptr [rip + b+64]
>
>
>
>
>
> and
>
>
>
> eg. 2
>
>
>
> mov     rax, -393216
>
>             .p2align           4, 0x90
>
> .LBB0_1:                                # %vector.body
>
>                                         # =>This Inner Loop Header: Depth=1
>
>             vmovdqu32     zmm1, zmmword ptr [rax + c+401344]
> ; load c[401344] in zmm1
>
>             vmovdqu32     zmm0, zmmword ptr [rax + c+401280]
>  ;load b[401280] in zmm0
>
>             vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]
>      ; zmm1<-zmm1+b[401344]
>
>             vmovdqu32     zmmword ptr [rax + a+401344], zmm1
>  ; store zmm1 in c[401344]
>
>             vmovdqu32     zmm1, zmmword ptr [rax + c+401216]
>
>             vpaddd            zmm0, zmm0, zmmword ptr [rax + b+401280]
>       ; zmm0<-zmm0+b[401280]
>
>             vmovdqu32     zmmword ptr [rax + a+401280], zmm0
> ; store zmm0 in c[401280]
>
>             vmovdqu32     zmm0, zmmword ptr [rax + c+401152]
>
> ........ in the remaining instructions also there is only zmm0 and zmm1
> used?
>
>
>
> As you can see in the above examples there could be multiple registers
> use. also i doubt if the above set of repeating instructions in eg. 2 are
> executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms and
> all in parallel, mean the one w/o dependency. for eg in above example blue
> has dependency in between and red has dependency among each other they cant
> be executed in parallel but blue and red can be executed in parallel?
>
>
>
>
>
>
>
> Please correct me if I am wrong.
>
>
>
>
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org
>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> --
>
> Hal Finkel
>
> Lead, Compiler Technology and Programming Languages
>
> Leadership Computing Facility
>
> Argonne National Laboratory
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170826/1231fc1b/attachment-0001.html>