<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - Poor vectorization with -march=skylake compared to -march=haswell"
   href="https://bugs.llvm.org/show_bug.cgi?id=37819">37819</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>Poor vectorization with -march=skylake compared to -march=haswell
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>clang
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>6.0
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Linux
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>-New Bugs
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedclangbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>jed@59a2.org
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Created <span class=""><a href="attachment.cgi?id=20432" name="attach_20432" title="Source exhibiting optimizer oddity.">attachment 20432</a> <a href="attachment.cgi?id=20432&action=edit" title="Source exhibiting optimizer oddity.">[details]</a></span>
Source exhibiting optimizer oddity.

The attached code optimizes well for Haswell and runs nearly optimally on both
Haswell and Skylake.

$ clang -Wall -O3 -march=haswell -ffast-math -c mm-clang.c

00000000000000e0 <mult+0xe0> vmovapd ymm9,ymm6
00000000000000e4 <mult+0xe4> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8-0x800]
00000000000000ee <mult+0xee> vmovupd ymm6,YMMWORD PTR [rax-0x20]
00000000000000f3 <mult+0xf3> vmovupd ymm11,YMMWORD PTR [rax]
00000000000000f7 <mult+0xf7> vfmadd231pd ymm1,ymm6,ymm10
00000000000000fc <mult+0xfc> vfmadd231pd ymm7,ymm11,ymm10
0000000000000101 <mult+0x101> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8-0x400]
000000000000010b <mult+0x10b> vfmadd231pd ymm8,ymm6,ymm10
0000000000000110 <mult+0x110> vfmadd231pd ymm5,ymm11,ymm10
0000000000000115 <mult+0x115> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8]
000000000000011b <mult+0x11b> vfmadd231pd ymm2,ymm6,ymm10
0000000000000120 <mult+0x120> vfmadd231pd ymm3,ymm11,ymm10
0000000000000125 <mult+0x125> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8+0x400]
000000000000012f <mult+0x12f> vfmadd213pd ymm6,ymm10,ymm9
0000000000000134 <mult+0x134> vfmadd231pd ymm4,ymm11,ymm10
0000000000000139 <mult+0x139> add    rax,0x400
000000000000013f <mult+0x13f> add    rbx,0x1
0000000000000143 <mult+0x143> jne    00000000000000e0 <mult+0xe0>

It is much worse when optimized for Skylake.

$ clang -Wall -O3 -march=skylake -ffast-math -c mm-clang.c

0000000000000caf <mult+0xcaf> vmovapd YMMWORD PTR [rsp],ymm2           
0000000000000cb4 <mult+0xcb4> vmovapd ymm2,YMMWORD PTR [rsp+0x20]            
0000000000000cba <mult+0xcba> vmovapd ymm3,YMMWORD PTR [rsp+0x400]           
0000000000000cc3 <mult+0xcc3> vfmadd231pd ymm2,ymm3,ymm0                     
0000000000000cc8 <mult+0xcc8> vmovapd YMMWORD PTR [rsp+0x20],ymm2        
0000000000000cce <mult+0xcce> vmovapd ymm2,YMMWORD PTR [rsp+0x40]             
0000000000000cd4 <mult+0xcd4> vfmadd231pd ymm2,ymm7,ymm0                      
0000000000000cd9 <mult+0xcd9> vmovapd YMMWORD PTR [rsp+0x40],ymm2      
0000000000000cdf <mult+0xcdf> vmovapd ymm2,YMMWORD PTR [rsp+0x60]            
0000000000000ce5 <mult+0xce5> vfmadd231pd ymm2,ymm5,ymm0                    
0000000000000cea <mult+0xcea> vmovapd YMMWORD PTR [rsp+0x60],ymm2      
0000000000000cf0 <mult+0xcf0> vmovapd ymm2,YMMWORD PTR [rsp+0x80]             
0000000000000cf9 <mult+0xcf9> vfmadd231pd ymm2,ymm4,ymm0                     
0000000000000cfe <mult+0xcfe> vmovapd YMMWORD PTR [rsp+0x80],ymm2         
0000000000000d07 <mult+0xd07> vmovapd ymm2,YMMWORD PTR [rsp+0xa0]            
0000000000000d10 <mult+0xd10> vfmadd231pd ymm2,ymm15,ymm0                    


If we drop -ffast-math, FMA instructions are no longer used (for either
-march=haswell or -march=skylake).

0000000000000107 <mult+0x107> vbroadcastsd ymm9,QWORD PTR [rdi+rbx*8-0x400]
0000000000000111 <mult+0x111> vmulpd ymm12,ymm9,ymm10
0000000000000116 <mult+0x116> vaddpd ymm8,ymm8,ymm12
000000000000011b <mult+0x11b> vmulpd ymm9,ymm9,ymm11
0000000000000120 <mult+0x120> vaddpd ymm6,ymm6,ymm9

I don't think -ffast-math should be needed to use FMA instructions here. It
certainly isn't needed for this code with GCC or Intel compilers.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>