<div dir="ltr">Further, I need to understand it with putting actual values since it is very confusing...<div><br></div><div>vmovdqa64<span style="white-space:pre"> </span>zmm22, zmmword ptr [rip + .LCPI0_0] ; i am supposing this will move 64 bit values from mentioned indexes though i still believe each value is required to be 32 bit. Now the indexes are [8, 9, 10, 11, 12, 13, 14, 15]. now when these indexes are added with rip it points to the value actually present at these locations so zmm22 will contain values not indexes. suppose [8]={1}, [9]={5}, [10]={4}...... so zmm22 will become zmm22={1, 5, 4, 3, 8, 7, 6, 2}......these are those 64 bit values loaded from memory indexes. </div><div><br></div><div>vpbroadcastq<span style="white-space:pre"> </span>zmm2, qword ptr [rip + .LCPI0_2]; here .LCPI0_2=4000 means broadcast value at this index for eg this location contains 2 so zmm2={2,2,2,2.....2}.<br></div><div><br></div><div>vpmuludq<span style="white-space:pre"> </span>zmm14, zmm10, zmm2 ; this step is value multiplication not index, there seems no point in multiplying these values here since we havent used A and B yet???<br></div><div><br></div><div><br></div><div><br></div><div>Please clarify my understanding about these initial steps; if these get cleared then only i will be able to move forward.....</div><div><br></div><div><br></div><div>Thank You</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Jul 1, 2017 at 3:47 AM, hameeza ahmed <span dir="ltr"><<a href="mailto:hahmed2305@gmail.com" target="_blank">hahmed2305@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail-HOEnZb"><div class="gmail-h5"><div dir="ltr"><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">hameeza ahmed</b> <span dir="ltr"><<a href="mailto:hahmed2305@gmail.com" target="_blank">hahmed2305@gmail.com</a>></span><br>Date: Sat, Jul 1, 2017 at 3:46 AM<br>Subject: Re: [llvm-dev] KNL Assembly Code for Matrix Multiplication<br>To: Craig Topper <<a href="mailto:craig.topper@gmail.com" target="_blank">craig.topper@gmail.com</a>><br><br><br><div dir="ltr">Thank You. <div><br></div><div>in this step;</div><span><div><div><span style="white-space:pre-wrap"> </span>vmovdqa64<span style="white-space:pre-wrap"> </span>zmm22, zmmword ptr [rip + .LCPI0_0] # zmm22 = [8,9,10,11,12,13,14,15]</div></div></span><div>the indexes are 64 bit but the element stored at these position is 32 bit since we are dealing with integers and ir also shows this.</div><div>here we are loading 32 bit value from those 64 bit indexes which means zmm22 will hold values 32 bit from these 64 bit position so there is capacity of 16 32 bit elements then why all this??</div><div><br></div><div>this is mentioned in IR as</div><div><br></div><div><div> %5 = getelementptr inbounds [1000 x i32], [1000 x i32]* %0, i64 %indvars.iv34, i64 %4</div><div> %6 = bitcast i32* %5 to <16 x i32>*</div><div> %wide.load = load <16 x i32>, <16 x i32>* %6, align 4, !tbaa !1</div></div><div><br></div><div><br></div><div>here indvars are 64 bit values but the values loaded from these indexes (step 3) is 32 bit???</div><div><br></div><div>Please correct me.</div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail-m_-5512755164335353765HOEnZb"><div class="gmail-m_-5512755164335353765h5"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 30, 2017 at 8:59 PM, Craig Topper <span dir="ltr"><<a href="mailto:craig.topper@gmail.com" target="_blank">craig.topper@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Some comments inline, I'll need to look more later.<div class="gmail_extra"><br clear="all"><div><div class="gmail-m_-5512755164335353765m_156371529393160872m_-7887813073697790105gmail_signature">~Craig</div></div>
<br><div class="gmail_quote"><div><div class="gmail-m_-5512755164335353765m_156371529393160872h5">On Fri, Jun 30, 2017 at 5:28 AM, hameeza ahmed via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello, I want some help in understanding knl intel assembly of matrix multiplication code. some of the things are not clear;<div><br></div><div>here .c file:</div><div><br></div><div><div>#include <stdio.h></div><div>#define N 1000</div><div> </div><div>// This function multiplies A[][] and B[][], and stores</div><div>// the result in C[][]</div><div>void multiply(int A[][N], int B[][N], int C[][N])</div><div>{</div><div> int i, j, k, r;</div><div> for (i = 0; i < N; i++)</div><div> {</div><div> for (j = 0; j < N; j++)</div><div> {</div><div> r = 0;</div><div> for (k = 0; k < N; k++) {</div><div> r += A[i][k]*B[k][j];}</div><div> C[i][j] = r;</div><div><br></div><div> }</div><div> </div><div> }</div><div>}</div><div> </div></div><div>here .s file: <font color="#ff0000"><b> the code that i want to ask is in red color.</b></font></div><div><br></div><div><div><span style="white-space:pre-wrap"> </span>.text</div><div><span style="white-space:pre-wrap"> </span>.intel_syntax noprefix</div><div><span style="white-space:pre-wrap"> </span>.file<span style="white-space:pre-wrap"> </span>"matn_o3.ll"</div><div><span style="white-space:pre-wrap"> </span>.section<span style="white-space:pre-wrap"> </span>.rodata,"a",@progbits</div><div><span style="white-space:pre-wrap"> </span>.p2align<span style="white-space:pre-wrap"> </span>6</div><div>.LCPI0_0:</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>8 # 0x8</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>9 # 0x9</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>10 # 0xa</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>11 # 0xb</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>12 # 0xc</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>13 # 0xd</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>14 # 0xe</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>15 # 0xf</div><div>.LCPI0_1:</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>0 # 0x0</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>1 # 0x1</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>2 # 0x2</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>3 # 0x3</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>4 # 0x4</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>5 # 0x5</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>6 # 0x6</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>7 # 0x7</div><div><span style="white-space:pre-wrap"> </span>.section<span style="white-space:pre-wrap"> </span>.rodata.cst8,"aM",@progbits,8</div><div><span style="white-space:pre-wrap"> </span>.p2align<span style="white-space:pre-wrap"> </span>3</div><div>.LCPI0_2:</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>4000 # 0xfa0</div><div>.LCPI0_3:</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>64000 # 0xfa00</div><div>.LCPI0_4:</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>128000 # 0x1f400</div><div>.LCPI0_5:</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>192000 # 0x2ee00</div><div>.LCPI0_6:</div><div><span style="white-space:pre-wrap"> </span>.quad<span style="white-space:pre-wrap"> </span>64 # 0x40</div><div><span style="white-space:pre-wrap"> </span>.text</div><div><span style="white-space:pre-wrap"> </span>.globl<span style="white-space:pre-wrap"> </span>multiply</div><div><span style="white-space:pre-wrap"> </span>.p2align<span style="white-space:pre-wrap"> </span>4, 0x90</div><div><span style="white-space:pre-wrap"> </span>.type<span style="white-space:pre-wrap"> </span>multiply,@function</div><div>multiply: # @multiply</div><div><span style="white-space:pre-wrap"> </span>.cfi_startproc</div><div># BB#0:</div><div><span style="white-space:pre-wrap"> </span>push<span style="white-space:pre-wrap"> </span>rbp</div><div>.Lcfi0:</div><div><span style="white-space:pre-wrap"> </span>.cfi_def_cfa_offset 16</div><div><span style="white-space:pre-wrap"> </span>push<span style="white-space:pre-wrap"> </span>r15</div><div>.Lcfi1:</div><div><span style="white-space:pre-wrap"> </span>.cfi_def_cfa_offset 24</div><div><span style="white-space:pre-wrap"> </span>push<span style="white-space:pre-wrap"> </span>r14</div><div>.Lcfi2:</div><div><span style="white-space:pre-wrap"> </span>.cfi_def_cfa_offset 32</div><div><span style="white-space:pre-wrap"> </span>push<span style="white-space:pre-wrap"> </span>r12</div><div>.Lcfi3:</div><div><span style="white-space:pre-wrap"> </span>.cfi_def_cfa_offset 40</div><div><span style="white-space:pre-wrap"> </span>push<span style="white-space:pre-wrap"> </span>rbx</div><div>.Lcfi4:</div><div><span style="white-space:pre-wrap"> </span>.cfi_def_cfa_offset 48</div><div>.Lcfi5:</div><div><span style="white-space:pre-wrap"> </span>.cfi_offset rbx, -48</div><div>.Lcfi6:</div><div><span style="white-space:pre-wrap"> </span>.cfi_offset r12, -40</div><div>.Lcfi7:</div><div><span style="white-space:pre-wrap"> </span>.cfi_offset r14, -32</div><div>.Lcfi8:</div><div><span style="white-space:pre-wrap"> </span>.cfi_offset r15, -24</div><div>.Lcfi9:</div><div><span style="white-space:pre-wrap"> </span>.cfi_offset rbp, -16</div><div><span style="white-space:pre-wrap"> </span>lea<span style="white-space:pre-wrap"> </span>r8, [rdi + 3856]</div><div><span style="white-space:pre-wrap"> </span>xor<span style="white-space:pre-wrap"> </span>r9d, r9d</div><div><span style="white-space:pre-wrap"> </span>vmovdqa64<span style="white-space:pre-wrap"> </span>zmm22, zmmword ptr [rip + .LCPI0_0] # zmm22 = [8,9,10,11,12,13,14,15]</div><div><span style="white-space:pre-wrap"> </span>vmovdqa64<span style="white-space:pre-wrap"> </span>zmm23, zmmword ptr [rip + .LCPI0_1] # zmm23 = [0,1,2,3,4,5,6,7]</div><div><span style="white-space:pre-wrap"> </span>vpbroadcastq<span style="white-space:pre-wrap"> </span>zmm2, qword ptr [rip + .LCPI0_2]</div><div><span style="white-space:pre-wrap"> </span>vpbroadcastq<span style="white-space:pre-wrap"> </span>zmm3, rsi</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>rsi, 3856000</div><div><span style="white-space:pre-wrap"> </span>vpbroadcastq<span style="white-space:pre-wrap"> </span>zmm4, qword ptr [rip + .LCPI0_3]</div><div><span style="white-space:pre-wrap"> </span>vpbroadcastq<span style="white-space:pre-wrap"> </span>zmm5, qword ptr [rip + .LCPI0_4]</div><div><span style="white-space:pre-wrap"> </span>vpbroadcastq<span style="white-space:pre-wrap"> </span>zmm6, qword ptr [rip + .LCPI0_5]</div><div><span style="white-space:pre-wrap"> </span>kxnorw<span style="white-space:pre-wrap"> </span>k1, k0, k0</div><div><span style="white-space:pre-wrap"> </span>kshiftrw<span style="white-space:pre-wrap"> </span>k1, k1, 8</div><div><span style="white-space:pre-wrap"> </span>vpbroadcastq<span style="white-space:pre-wrap"> </span>zmm7, qword ptr [rip + .LCPI0_6]</div><div><span style="white-space:pre-wrap"> </span>.p2align<span style="white-space:pre-wrap"> </span>4, 0x90</div><div>.LBB0_1: # %.preheader26</div><div> # =>This Loop Header: Depth=1</div><div> # Child Loop BB0_2 Depth 2</div><div> # Child Loop BB0_3 Depth 3</div><div> # Child Loop BB0_5 Depth 3</div><div><span style="white-space:pre-wrap"> </span>xor<span style="white-space:pre-wrap"> </span>r11d, r11d</div><div><span style="white-space:pre-wrap"> </span>.p2align<span style="white-space:pre-wrap"> </span>4, 0x90</div><div>.LBB0_2: # %.preheader</div><div> # Parent Loop BB0_1 Depth=1</div><div> # => This Loop Header: Depth=2</div><div> # Child Loop BB0_3 Depth 3</div><div> # Child Loop BB0_5 Depth 3</div><div><span style="white-space:pre-wrap"> </span>vpxord<span style="white-space:pre-wrap"> </span>zmm8, zmm8, zmm8</div><div><span style="white-space:pre-wrap"> </span>mov<span style="white-space:pre-wrap"> </span>ecx, 960</div><div><span style="white-space:pre-wrap"> </span>vmovdqa64<span style="white-space:pre-wrap"> </span>zmm9, zmm23</div><div><span style="white-space:pre-wrap"> </span>vmovdqa64<span style="white-space:pre-wrap"> </span>zmm10, zmm22</div><div><span style="white-space:pre-wrap"> </span>vpxord<span style="white-space:pre-wrap"> </span>zmm11, zmm11, zmm11</div><div><span style="white-space:pre-wrap"> </span>vpxord<span style="white-space:pre-wrap"> </span>zmm12, zmm12, zmm12</div><div><span style="white-space:pre-wrap"> </span>vpxord<span style="white-space:pre-wrap"> </span>zmm13, zmm13, zmm13</div><div><span style="white-space:pre-wrap"> </span>.p2align<span style="white-space:pre-wrap"> </span>4, 0x90</div><div>.LBB0_3: # %vector.body</div><div> # Parent Loop BB0_1 Depth=1</div><div> # Parent Loop BB0_2 Depth=2</div><div> # => This Inner Loop Header: Depth=3</div><div> # this bb will run 15 times</div><div><span style="white-space:pre-wrap"> </span>vmovq<span style="white-space:pre-wrap"> </span>rax, xmm9</div><div><span style="white-space:pre-wrap"> </span>imul<span style="white-space:pre-wrap"> </span>r10, r9, 4000</div><div><span style="white-space:pre-wrap"> </span>lea<span style="white-space:pre-wrap"> </span>rbx, [rdi + r10]</div><div><span style="white-space:pre-wrap"> </span><b><font color="#ff0000">vpmuludq</font><span style="color:rgb(255,0,0);white-space:pre-wrap"> </span><font color="#ff0000">zmm14, zmm10, zmm2 ; </font><font color="#0000ff">this is BB for vector here we have to do gather for B due to arbitrary addresses so here zmm10=[8,9,10,11,12,13,14,15]. it means zmm10 contains 8 values present in these indexes? and zmm2=[4000, 4000,.....4000]. these are the indexes for B we need to multiple indexes with stride=4000. i know here these indexes are 64 bit but the values stored in these locations are 32 bits then the load using zmm10 index will give 8 elements of 32 bits present in these locations, so do the registers contain 8 elements of 32 bits present at specified indexes?? so after multiplication we get indexes for higher 8 elements of B i.e [3200,3600,40000,.......54000]<wbr>.</font></b></div><div><b><font color="#0000ff"><br></font></b></div><div><b><span style="color:rgb(255,0,0);white-space:pre-wrap"> </span><font color="#ff0000">vpsrlq</font><span style="color:rgb(255,0,0);white-space:pre-wrap"> </span><font color="#ff0000">zmm15, zmm10, 32 ; </font><font color="#0000ff">i dont understand the need for this step, please explain the purpose of all these steps. here vpsrlq will shift right zmm10 values by 256 bits (32*8)....zmmm10 initially=</font></b><b><font color="#0000ff">[8,9,10,11,12,13,14,<wbr>15]. it will now become [0,0,0,0,8,9,10,11]...Am I correct? Please explain me the purpose of this step.</font></b></div><div><b><span style="color:rgb(255,0,0);white-space:pre-wrap"> </span><font color="#ff0000">vpmuludq</font><span style="color:rgb(255,0,0);white-space:pre-wrap"> </span><font color="#ff0000">zmm15, zmm15, zmm2 ; </font><font color="#0000ff">similarly </font></b><b><font color="#0000ff">dont understand the need for this step.</font></b><b><font color="#0000ff"> </font><font color="#ff0000"> </font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpsllq<span style="white-space:pre-wrap"> </span>zmm15, zmm15, 32 ; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm14, zmm14, zmm3 ; </b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm14, zmm15, zmm14 ; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div></div></div></blockquote><div><br></div></div></div><div>vpsrlq zmm15, zmm10, 32 shifts every 64-bit element in zmm10 right by 32 bits. I believe this effectively taking every odd numbered 32-bit element and moving them to the next lowest even numbered 32-bit element.</div><div><br></div><div>vmuludq multiplies all even numbered 32-bit elements and creates 64-bit results.</div><div><br></div><div>The combination of the shifts, vpmuludq, and vpaddq is to multiply 64-bit elements and create a 64-bit elements result. We don't have an instruction for this so we have to multiply the low 32-bits of each element and the high 32-bits of each element separately and add the results together. Looks like we determined that the high 32-bits of one of the inputs is all zeros so we skipped 1 of the multiplies and adds that would normally be required for this operation.</div><span><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpbroadcastq<span style="white-space:pre-wrap"> </span>zmm15, r11 ; </b></font><b><font color="#0000ff">r11 changes when loop variable j changes whats the need of this step?</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpsllq<span style="white-space:pre-wrap"> </span>zmm15, zmm15, 2 ; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm14, zmm14, zmm15 ; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpmuludq<span style="white-space:pre-wrap"> </span>zmm16, zmm9, zmm2 ; </b></font><b><font color="#0000ff">here same as before the lower 8 elements of B indexes are computed as Zmm16=[0,4000,8000,.......2800<wbr>0]</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpsrlq<span style="white-space:pre-wrap"> </span>zmm17, zmm9, 32 </b></font><font color="#ff0000"><b>; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpmuludq<span style="white-space:pre-wrap"> </span>zmm17, zmm17, zmm2 </b></font><font color="#ff0000"><b>; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpsllq<span style="white-space:pre-wrap"> </span>zmm17, zmm17, 32 </b></font><font color="#ff0000"><b>; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm16, zmm16, zmm3 </b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm16, zmm17, zmm16 </b></font><font color="#ff0000"><b>; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm15, zmm16, zmm15 </b></font><font color="#ff0000"><b>; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm16, zmm15, zmm4</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm17, zmm14, zmm4</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm18, zmm15, zmm5</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm19, zmm14, zmm5</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm20, zmm15, zmm6</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm21, zmm14, zmm6</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>kmovw<span style="white-space:pre-wrap"> </span>k2, k1 </b></font><font color="#ff0000"><b>; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div></div></div></blockquote><div><br></div></span><div>The gather instruction requires a mask of which elements to read. When the gather completes, if there are no faults it will have written the mask register to 0. So it needs to reloaded for each gather.</div><div><div class="gmail-m_-5512755164335353765m_156371529393160872h5"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><b><span style="color:rgb(255,0,0);white-space:pre-wrap"> </span><font color="#ff0000">vpgatherqd</font><span style="color:rgb(255,0,0);white-space:pre-wrap"> </span><font color="#ff0000">ymm0 {k2}, zmmword ptr [zmm14] ; </font><font color="#0000ff">since zmm14 contains 8 indexes ( or values at these 8 indexes???) so it will load 8 elements not 16. here it should be zmm14</font></b><b><font color="#0000ff">=[3200,3600,40000,.......<wbr>54000]. but by the above computation these indexes are changes??</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>kxnorw<span style="white-space:pre-wrap"> </span>k2, k0, k0 </b></font><font color="#ff0000"><b>; </b></font><b><font color="#0000ff">dont understand the need for this step</font></b></div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpgatherqd<span style="white-space:pre-wrap"> </span>ymm14 {k2}, zmmword ptr [zmm15] </b></font><font color="#ff0000"><b>; </b></font><b><font color="#0000ff">here again issues with index zmm15. it should be </font></b><b><font color="#0000ff">[0,4000,8000,.......28000] but its different due to above computation.</font></b></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vinserti64x4<span style="white-space:pre-wrap"> </span>zmm0, zmm14, ymm0, 1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>kmovw<span style="white-space:pre-wrap"> </span>k2, k1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpgatherqd<span style="white-space:pre-wrap"> </span>ymm14 {k2}, zmmword ptr [zmm17]</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>kxnorw<span style="white-space:pre-wrap"> </span>k2, k0, k0</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpgatherqd<span style="white-space:pre-wrap"> </span>ymm15 {k2}, zmmword ptr [zmm16]</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vinserti64x4<span style="white-space:pre-wrap"> </span>zmm14, zmm15, ymm14, 1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>kmovw<span style="white-space:pre-wrap"> </span>k2, k1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpgatherqd<span style="white-space:pre-wrap"> </span>ymm15 {k2}, zmmword ptr [zmm19]</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>kxnorw<span style="white-space:pre-wrap"> </span>k2, k0, k0</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpgatherqd<span style="white-space:pre-wrap"> </span>ymm16 {k2}, zmmword ptr [zmm18]</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vinserti64x4<span style="white-space:pre-wrap"> </span>zmm15, zmm16, ymm15, 1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>kmovw<span style="white-space:pre-wrap"> </span>k2, k1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpgatherqd<span style="white-space:pre-wrap"> </span>ymm1 {k2}, zmmword ptr [zmm21]</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>kxnorw<span style="white-space:pre-wrap"> </span>k2, k0, k0</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpgatherqd<span style="white-space:pre-wrap"> </span>ymm16 {k2}, zmmword ptr [zmm20]</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vinserti64x4<span style="white-space:pre-wrap"> </span>zmm1, zmm16, ymm1, 1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpmulld<span style="white-space:pre-wrap"> </span>zmm0, zmm0, zmmword ptr [rbx + 4*rax]</b></font></div><div><span style="white-space:pre-wrap"> </span>vpmulld<span style="white-space:pre-wrap"> </span>zmm14, zmm14, zmmword ptr [rbx + 4*rax + 64]</div><div><span style="white-space:pre-wrap"> </span>vpmulld<span style="white-space:pre-wrap"> </span>zmm15, zmm15, zmmword ptr [rbx + 4*rax + 128]</div><div><span style="white-space:pre-wrap"> </span>vpmulld<span style="white-space:pre-wrap"> </span>zmm1, zmm1, zmmword ptr [rbx + 4*rax + 192]</div><div><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm8, zmm0, zmm8</div><div><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm11, zmm14, zmm11</div><div><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm12, zmm15, zmm12</div><div><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm13, zmm1, zmm13</div><div><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm9, zmm9, zmm7 #zmm7=64</div><div><span style="white-space:pre-wrap"> </span>vpaddq<span style="white-space:pre-wrap"> </span>zmm10, zmm10, zmm7</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>rcx, -64 #decrement counter by 64</div><div><span style="white-space:pre-wrap"> </span>jne<span style="white-space:pre-wrap"> </span>.LBB0_3 # if rcx not equal to zero goto .lbbo_3</div><div># BB#4: # %middle.block</div><div> # in Loop: Header=BB0_2 Depth=2</div><div><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm0, zmm11, zmm8</div><div><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm0, zmm12, zmm0</div><div><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm0, zmm13, zmm0</div><div><span style="white-space:pre-wrap"> </span><b><font color="#ff0000">vshufi64x2</font><span style="color:rgb(255,0,0);white-space:pre-wrap"> </span><font color="#ff0000">zmm1, zmm0, zmm0, 14 # zmm1 = zmm0[4,5,6,7,0,1,0,1] </font><font color="#0000ff">; please explain how shuffle instructions work here. i know of llvm ir shuffle, but these assembly ones are difficult for me to understand</font></b></div></div></div></blockquote><div><br></div></div></div><div>You have to look at the size of the register being mentioned and the number of elements in brackets. In this case the regsiter is 512-bits and the number of elements is 8. 512/8 is 64. So its a shuffle of a v8i64 vector. Then we read the element numbers from left to write just like the shuffle IR instruction.</div><div><br></div><div>So element 0 of zmm1 gets the value of element 4 of zmm0. Element 1 of zmm1 gets the value of element 5 of zmm5, etc.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div class="gmail-m_-5512755164335353765m_156371529393160872h5"><div dir="ltr"><div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm0, zmm0, zmm1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vshufi64x2<span style="white-space:pre-wrap"> </span>zmm1, zmm0, zmm0, 1 # zmm1 = zmm0[2,3,0,1,0,1,0,1]</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm0, zmm0, zmm1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpshufd<span style="white-space:pre-wrap"> </span>zmm1, zmm0, 238 # zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,<wbr>11,14,15,14,15]</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm0, zmm0, zmm1</b></font></div><div><font color="#ff0000"><b><span style="white-space:pre-wrap"> </span>vpshufd<span style="white-space:pre-wrap"> </span>zmm1, zmm0, 229 # zmm1 = zmm0[1,1,2,3,5,5,6,7,9,9,10,11<wbr>,13,13,14,15]</b></font></div><div><span style="white-space:pre-wrap"> </span>vpaddd<span style="white-space:pre-wrap"> </span>zmm0, zmm0, zmm1</div><div><span style="white-space:pre-wrap"> </span>vmovd<span style="white-space:pre-wrap"> </span>ebx, xmm0</div><div><span style="white-space:pre-wrap"> </span>mov<span style="white-space:pre-wrap"> </span>rax, r8</div><div><span style="white-space:pre-wrap"> </span>xor<span style="white-space:pre-wrap"> </span>r14d, r14d</div><div><span style="white-space:pre-wrap"> </span>.p2align<span style="white-space:pre-wrap"> </span>4, 0x90</div><div>.LBB0_5: # Parent Loop BB0_1 Depth=1</div><div> # Parent Loop BB0_2 Depth=2</div><div> # => This Inner Loop Header: Depth=3</div><div><span style="white-space:pre-wrap"> </span>lea<span style="white-space:pre-wrap"> </span>r15, [rsi + r14]</div><div><span style="white-space:pre-wrap"> </span>mov<span style="white-space:pre-wrap"> </span>r12d, dword ptr [r15 + 4*r11 - 16000]</div><div><span style="white-space:pre-wrap"> </span>imul<span style="white-space:pre-wrap"> </span>r12d, dword ptr [rax - 16]</div><div><span style="white-space:pre-wrap"> </span>mov<span style="white-space:pre-wrap"> </span>ecx, dword ptr [r15 + 4*r11 - 12000]</div><div><span style="white-space:pre-wrap"> </span>imul<span style="white-space:pre-wrap"> </span>ecx, dword ptr [rax - 12]</div><div><span style="white-space:pre-wrap"> </span>mov<span style="white-space:pre-wrap"> </span>ebp, dword ptr [r15 + 4*r11 - 8000]</div><div><span style="white-space:pre-wrap"> </span>imul<span style="white-space:pre-wrap"> </span>ebp, dword ptr [rax - 8]</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>r12d, ebx</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>ecx, r12d</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>ebp, ecx</div><div><span style="white-space:pre-wrap"> </span>mov<span style="white-space:pre-wrap"> </span>ecx, dword ptr [r15 + 4*r11 - 4000]</div><div><span style="white-space:pre-wrap"> </span>imul<span style="white-space:pre-wrap"> </span>ecx, dword ptr [rax - 4]</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>ecx, ebp</div><div><span style="white-space:pre-wrap"> </span>mov<span style="white-space:pre-wrap"> </span>ebx, dword ptr [r15 + 4*r11]</div><div><span style="white-space:pre-wrap"> </span>imul<span style="white-space:pre-wrap"> </span>ebx, dword ptr [rax]</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>ebx, ecx</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>r14, 20000</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>rax, 20</div><div><span style="white-space:pre-wrap"> </span>cmp<span style="white-space:pre-wrap"> </span>r14, 160000</div><div><span style="white-space:pre-wrap"> </span>jne<span style="white-space:pre-wrap"> </span>.LBB0_5</div><div># BB#6: # %.loopexit</div><div> # in Loop: Header=BB0_2 Depth=2</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>r10, rdx #rdx is c[][]</div><div><span style="white-space:pre-wrap"> </span>mov<span style="white-space:pre-wrap"> </span>dword ptr [r10 + 4*r11], ebx</div><div><span style="white-space:pre-wrap"> </span>inc<span style="white-space:pre-wrap"> </span>r11</div><div><span style="white-space:pre-wrap"> </span>cmp<span style="white-space:pre-wrap"> </span>r11, 1000</div><div><span style="white-space:pre-wrap"> </span>jne<span style="white-space:pre-wrap"> </span>.LBB0_2</div><div># BB#7: # in Loop: Header=BB0_1 Depth=1</div><div><span style="white-space:pre-wrap"> </span>inc<span style="white-space:pre-wrap"> </span>r9</div><div><span style="white-space:pre-wrap"> </span>add<span style="white-space:pre-wrap"> </span>r8, 4000</div><div><span style="white-space:pre-wrap"> </span>cmp<span style="white-space:pre-wrap"> </span>r9, 1000</div><div><span style="white-space:pre-wrap"> </span>jne<span style="white-space:pre-wrap"> </span>.LBB0_1</div><div># BB#8:</div><div><span style="white-space:pre-wrap"> </span>pop<span style="white-space:pre-wrap"> </span>rbx</div><div><span style="white-space:pre-wrap"> </span>pop<span style="white-space:pre-wrap"> </span>r12</div><div><span style="white-space:pre-wrap"> </span>pop<span style="white-space:pre-wrap"> </span>r14</div><div><span style="white-space:pre-wrap"> </span>pop<span style="white-space:pre-wrap"> </span>r15</div><div><span style="white-space:pre-wrap"> </span>pop<span style="white-space:pre-wrap"> </span>rbp</div><div><span style="white-space:pre-wrap"> </span>ret</div><div><br></div><div><br></div><div>Looking forward to your reply</div><div><br></div><div>Thank You</div><div><br></div></div><div><br></div></div>
<br></div></div>______________________________<wbr>_________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>
<br></blockquote></div><br></div></div>
</blockquote></div><br></div>
</div></div></div><br></div>
</div></div></blockquote></div><br></div></div></div>