<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>Hi, Zvi,</p>

    <p>I agree. In the context of targeting the KNL, however, I'm a bit

      concerned about the addressing, and specifically, the size of the

      resulting encoding:</p>

    <p>

      <blockquote type="cite">

        <div>

          <p class="MsoNormal"><span style="color:red">           

              vmovdqu32     zmm0, zmmword ptr [rax + c+401280]          

                 ;load b[401280] in zmm0</span><o:p></o:p></p>

        </div>

        <span style="color:blue">            vpaddd            zmm1,

          zmm1, zmmword ptr [rax + b+401344]          ;

          zmm1<-zmm1+b[401344]</span></blockquote>

    </p>

    <p>The KNL can only deliver 16 bytes per cycle from the icache to

      the decoder. Essentially all of the instructions in the loop, as

      we seem to generate it, have 10-byte encodings:</p>

    <p>  10:    62 f1 7e 48 6f 80 00     vmovdqu32 0x0(%rax),%zmm0<br>

        17:    00 00 00 <br>

                  16: R_X86_64_32S    c+0x61f00<br>

    </p>

    <p>...<br>

    </p>

      38:    62 f1 7d 48 fe 80 00     vpaddd 0x0(%rax),%zmm0,%zmm0<br>

      3f:    00 00 00 <br>

                3e: R_X86_64_32S    b+0x61f00<br>

    ...<br>

    <br>

    and since this seems like a generic feature of how we generate code,

    it seems like we can end up decoder limited (it might even be

    decoder limited for this loop). We might want to less aggressive in

    generating complex addressing modes for the KNL. It seems like it

    would be better to materialize the base array addresses into a

    register to make the encodings shorter.<br>

    <br>

     -Hal<br>

    <br>

    <div class="moz-cite-prefix">On 06/25/2017 07:14 AM, Rackover, Zvi

      wrote:<br>

    </div>

    <blockquote

cite="mid:E9B72B0215CBD34C970FA39C0A83AFCA5440E8CD@hasmsx109.ger.corp.intel.com"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <meta name="Generator" content="Microsoft Word 15 (filtered

        medium)">

      <style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Consolas;

        panose-1:2 11 6 9 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman",serif;

        color:black;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

p

        {mso-style-priority:99;

        mso-margin-top-alt:auto;

        margin-right:0cm;

        mso-margin-bottom-alt:auto;

        margin-left:0cm;

        font-size:12.0pt;

        font-family:"Times New Roman",serif;

        color:black;}

pre

        {mso-style-priority:99;

        mso-style-link:"HTML Preformatted Char";

        margin:0cm;

        margin-bottom:.0001pt;

        font-size:10.0pt;

        font-family:"Courier New";

        color:black;}

span.HTMLPreformattedChar

        {mso-style-name:"HTML Preformatted Char";

        mso-style-priority:99;

        mso-style-link:"HTML Preformatted";

        font-family:Consolas;

        color:black;}

span.EmailStyle20

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:72.0pt 72.0pt 72.0pt 72.0pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal"><span

style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Hi

            Ahmed,<o:p></o:p></span></p>

        <p class="MsoNormal"><span

style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span

style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">From

            what can be seen in the code snippet you provided, the reuse

            of XMM0 and XMM1 across loop-unroll instances does not

            inhibit instruction-level parallelism.<o:p></o:p></span></p>

        <p class="MsoNormal"><span

style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Modern

            X86 processors use register renaming that can eliminate the

            dependencies in the instruction stream. In the example you

            provided, the processor should be able to identify the

            2-vloads + vadd + vstore sequences as independent and

            pipeline their execution.<o:p></o:p></span></p>

        <p class="MsoNormal"><span

style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span

style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Thanks,

            Zvi<o:p></o:p></span></p>

        <p class="MsoNormal"><a moz-do-not-send="true"

            name="_MailEndCompose"><span

style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></a></p>

        <div>

          <div style="border:none;border-top:solid #E1E1E1

            1.0pt;padding:3.0pt 0cm 0cm 0cm">

            <p class="MsoNormal"><a moz-do-not-send="true"

                name="_____replyseparator"></a><b><span

style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext">From:</span></b><span

style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext">

                Hal Finkel [<a class="moz-txt-link-freetext" href="mailto:hfinkel@anl.gov">mailto:hfinkel@anl.gov</a>]

                <br>

                <b>Sent:</b> Saturday, June 24, 2017 05:17<br>

                <b>To:</b> hameeza ahmed <a class="moz-txt-link-rfc2396E" href="mailto:hahmed2305@gmail.com"><hahmed2305@gmail.com></a>;

                <a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

                <b>Cc:</b> Demikhovsky, Elena

                <a class="moz-txt-link-rfc2396E" href="mailto:elena.demikhovsky@intel.com"><elena.demikhovsky@intel.com></a>; Rackover, Zvi

                <a class="moz-txt-link-rfc2396E" href="mailto:zvi.rackover@intel.com"><zvi.rackover@intel.com></a>; Breger, Igor

                <a class="moz-txt-link-rfc2396E" href="mailto:igor.breger@intel.com"><igor.breger@intel.com></a>; <a class="moz-txt-link-abbreviated" href="mailto:craig.topper@gmail.com">craig.topper@gmail.com</a><br>

                <b>Subject:</b> Re: [llvm-dev] AVX Scheduling and

                Parallelism<o:p></o:p></span></p>

          </div>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p>It is possible that the issue with scheduling is constrained

          due to pointer-aliasing assumptions. Could you share the

          source for the loop in question?<o:p></o:p></p>

        <p>RIP-relative indexing, as I recall, is a feature of

          position-independent code. Based on what's below, it might

          cause problems by making the instruction encodings large.

          cc'ing some Intel folks for further comments.<o:p></o:p></p>

        <p> -Hal<o:p></o:p></p>

        <div>

          <p class="MsoNormal">On 06/23/2017 09:02 PM, hameeza ahmed via

            llvm-dev wrote:<o:p></o:p></p>

        </div>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <div>

            <p class="MsoNormal">Hello, <o:p></o:p></p>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">After generating AVX code for large

                no of iterations i came to realize that it still uses

                only 2 registers zmm0 and zmm1 when the loop urnroll

                factor=1024,<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">i wonder if this register allocation

                allows operations in parallel?<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">Also i know all the elements within a

                single vector instruction are computed in parallel but

                does the elements of multiple instructions computed in

                parallel? like are 2 vmov with different registers

                executed in parallel? it can be because each core has an

                AVX unit. does compiler exploit it?<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">secondly i am generating assembly for

                intel and there are some offset like rip register or

                some constant addition in memory index. why is that so?<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal">eg.1<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <div>

                <p class="MsoNormal">            vmovdqu32     zmm0,

                  zmmword ptr [rip + c]<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal">            vpaddd            zmm0,

                  zmm0, zmmword ptr [rip + b]<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal">            vmovdqu32     zmmword

                  ptr [rip + a], zmm0<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal">            vmovdqu32     zmm0,

                  zmmword ptr [rip + c+64]<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal">            vpaddd            zmm0,

                  zmm0, zmmword ptr [rip + b+64]<o:p></o:p></p>

              </div>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">and <o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">eg. 2<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <div>

                <p class="MsoNormal">mov     rax, -393216<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal">            .p2align           4,

                  0x90<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal">.LBB0_1:                          

                       # %vector.body<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal">                                   

                      # =>This Inner Loop Header: Depth=1<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal">            <span

                    style="color:blue">vmovdqu32     zmm1, zmmword ptr

                    [rax + c+401344]             ; load c[401344] in

                    zmm1</span><o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal"><span style="color:red">           

                    vmovdqu32     zmm0, zmmword ptr [rax + c+401280]    

                             ;load b[401280] in zmm0</span><o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal"><span style="color:blue">           

                    vpaddd            zmm1, zmm1, zmmword ptr [rax +

                    b+401344]          ; zmm1<-zmm1+b[401344]</span><o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal"><span style="color:blue">           

                    vmovdqu32     zmmword ptr [rax + a+401344], zmm1    

                             ; store zmm1 in c[401344]</span><o:p></o:p></p>

              </div>

            </div>

            <div>

              <div>

                <p class="MsoNormal">            vmovdqu32     zmm1,

                  zmmword ptr [rax + c+401216]<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal"><span style="color:red">           

                    vpaddd            zmm0, zmm0, zmmword ptr [rax +

                    b+401280]           ; zmm0<-zmm0+b[401280]</span><o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal"><span style="color:red">           

                    vmovdqu32     zmmword ptr [rax + a+401280], zmm0    

                              ; store zmm0 in c[401280]</span><o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal">            vmovdqu32     zmm0,

                  zmmword ptr [rax + c+401152]<o:p></o:p></p>

              </div>

            </div>

            <div>

              <p class="MsoNormal">........ in the remaining

                instructions also there is only zmm0 and zmm1 used?<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">As you can see in the above examples

                there could be multiple registers use. also i doubt if

                the above set of repeating instructions in eg. 2 are

                executed in parallel? and why repeat zmm0 and zmm1 cant

                it be more zmms and all in parallel, mean the one w/o

                dependency. for eg in above example blue has dependency

                in between and red has dependency among each other they

                cant be executed in parallel but blue and red can be

                executed in parallel?<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">Please correct me if I am wrong.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

          </div>

          <p class="MsoNormal"><br>

            <br>

            <br>

            <o:p></o:p></p>

          <pre>_______________________________________________<o:p></o:p></pre>

          <pre>LLVM Developers mailing list<o:p></o:p></pre>

          <pre><a moz-do-not-send="true" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>

          <pre><a moz-do-not-send="true" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>

        </blockquote>

        <p class="MsoNormal"><br>

          <br>

          <o:p></o:p></p>

        <pre>-- <o:p></o:p></pre>

        <pre>Hal Finkel<o:p></o:p></pre>

        <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>

        <pre>Leadership Computing Facility<o:p></o:p></pre>

        <pre>Argonne National Laboratory<o:p></o:p></pre>

      </div>

      <p>---------------------------------------------------------------------<br>

        Intel Israel (74) Limited</p>

      <p>This e-mail and any attachments may contain confidential

        material for<br>

        the sole use of the intended recipient(s). Any review or

        distribution<br>

        by others is strictly prohibited. If you are not the intended<br>

        recipient, please contact the sender and delete all copies.</p>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>