<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>It is possible that the issue with scheduling is constrained due
to pointer-aliasing assumptions. Could you share the source for
the loop in question?</p>
<p>RIP-relative indexing, as I recall, is a feature of
position-independent code. Based on what's below, it might cause
problems by making the instruction encodings large. cc'ing some
Intel folks for further comments.</p>
<p> -Hal<br>
</p>
<div class="moz-cite-prefix">On 06/23/2017 09:02 PM, hameeza ahmed
via llvm-dev wrote:<br>
</div>
<blockquote
cite="mid:CAFMPKeamGYw9fynq5vE7_KFRy3sHBG+cz7uQVGEK_mhTbSmQsg@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<div dir="ltr">Hello,
<div><br>
</div>
<div>After generating AVX code for large no of iterations i came
to realize that it still uses only 2 registers zmm0 and zmm1
when the loop urnroll factor=1024,</div>
<div><br>
</div>
<div>i wonder if this register allocation allows operations in
parallel?</div>
<div><br>
</div>
<div>Also i know all the elements within a single vector
instruction are computed in parallel but does the elements of
multiple instructions computed in parallel? like are 2 vmov
with different registers executed in parallel? it can be
because each core has an AVX unit. does compiler exploit it?</div>
<div><br>
</div>
<div><br>
</div>
<div>secondly i am generating assembly for intel and there are
some offset like rip register or some constant addition in
memory index. why is that so?</div>
<div>eg.1</div>
<div><br>
</div>
<div>
<div><span style="white-space:pre"> </span>vmovdqu32<span style="white-space:pre"> </span>zmm0,
zmmword ptr [rip + c]</div>
<div><span style="white-space:pre"> </span>vpaddd<span style="white-space:pre"> </span>zmm0,
zmm0, zmmword ptr [rip + b]</div>
<div><span style="white-space:pre"> </span>vmovdqu32<span style="white-space:pre"> </span>zmmword
ptr [rip + a], zmm0</div>
<div><span style="white-space:pre"> </span>vmovdqu32<span style="white-space:pre"> </span>zmm0,
zmmword ptr [rip + c+64]</div>
<div><span style="white-space:pre"> </span>vpaddd<span style="white-space:pre"> </span>zmm0,
zmm0, zmmword ptr [rip + b+64]</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div>and </div>
<div><br>
</div>
<div>eg. 2</div>
<div><br>
</div>
<div>
<div>mov<span style="white-space:pre"> </span>rax, -393216</div>
<div><span style="white-space:pre"> </span>.p2align<span style="white-space:pre"> </span>4,
0x90</div>
<div>.LBB0_1: # %vector.body</div>
<div> # =>This Inner
Loop Header: Depth=1</div>
<div><span style="white-space:pre"> </span><font
color="#0000ff">vmovdqu32<span style="white-space:pre"> </span>zmm1,
zmmword ptr [rax + c+401344] ; load c[401344]
in zmm1</font></div>
<div><font color="#ff0000"><span style="white-space:pre"> </span>vmovdqu32<span style="white-space:pre"> </span>zmm0,
zmmword ptr [rax + c+401280] ;load b[401280]
in zmm0</font></div>
<div><font color="#0000ff"><span style="white-space:pre"> </span>vpaddd<span style="white-space:pre"> </span>zmm1,
zmm1, zmmword ptr [rax + b+401344] ;
zmm1<-zmm1+b[401344]</font></div>
<div><font color="#0000ff"><span style="white-space:pre"> </span>vmovdqu32<span style="white-space:pre"> </span>zmmword
ptr [rax + a+401344], zmm1 ; store zmm1 in
c[401344]</font></div>
</div>
<div>
<div><font color="#000000"><span style="white-space:pre"> </span>vmovdqu32<span style="white-space:pre"> </span>zmm1,
zmmword ptr [rax + c+401216]</font></div>
<div><font color="#ff0000"><span style="white-space:pre"> </span>vpaddd<span style="white-space:pre"> </span>zmm0,
zmm0, zmmword ptr [rax + b+401280] ;
zmm0<-zmm0+b[401280]</font></div>
<div><font color="#ff0000"><span style="white-space:pre"> </span>vmovdqu32<span style="white-space:pre"> </span>zmmword
ptr [rax + a+401280], zmm0 ; store zmm0 in
c[401280]</font></div>
<div><font color="#000000"><span style="white-space:pre"> </span>vmovdqu32<span style="white-space:pre"> </span>zmm0,
zmmword ptr [rax + c+401152]</font></div>
</div>
<div>........ in the remaining instructions also there is only
zmm0 and zmm1 used?</div>
<div><br>
</div>
<div>As you can see in the above examples there could be
multiple registers use. also i doubt if the above set of
repeating instructions in eg. 2 are executed in parallel? and
why repeat zmm0 and zmm1 cant it be more zmms and all in
parallel, mean the one w/o dependency. for eg in above example
blue has dependency in between and red has dependency among
each other they cant be executed in parallel but blue and red
can be executed in parallel?</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>Please correct me if I am wrong.</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>