<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - [X86][AVX512] 128-bit shuffles of 256-bit vectors prefer AVX2 instructions instead of AVX512 preventing combining masks into the instruction"
   href="https://bugs.llvm.org/show_bug.cgi?id=34359">34359</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>[X86][AVX512] 128-bit shuffles of 256-bit vectors prefer AVX2 instructions instead of AVX512 preventing combining masks into the instruction
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libraries
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>Backend: X86
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>ayman.musa@intel.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>128-bit shuffles of 256-bit vectors prefer the perm2f128 and perm2i128 AVX2
instructions which cannot accept a merge-mask.

Selecting AVX512 instructions like vshufi64x2, vshufi32x4, vshuff64x2 would
enable combining the mask into them replacing the extra blend instruction from
the AVX2 sequence.

Reproducer (for <4 x double> type):

define <4 x double> @test_4xdouble_masked_shuff_mask0(<4 x double> %vec1, <4 x
double> %vec2, <4 x double> %vec3) {
   %shuf = shufflevector <4 x double> %vec1, <4 x double> %vec2, <4 x i32> <i32
2, i32 3, i32 4, i32 5>
   %res = select <4 x i1> <i1 0, i1 1, i1 1, i1 1>, <4 x double> %shuf, <4 x
double> %vec3
   ret <4 x double> %res 
 }

<span class="quote">>> llc -mcpu=skx <file-name> -o out.s</span >

LLVM emits (showing 3.81 throughput on IACA tool):
  vperm2f128  $33, %ymm1, %ymm0, %ymm0
  movb $14, %al
  kmovd %eax, %k1
  vblendmpd %ymm0, %ymm2, %ymm0 {%k1}
  retq 

While it can be replaced with (showing 2.86 throughput on IACA tool):
  movb $126, %al
  kmovd %eax, %k1
  vshuff64x2 $1, %ymm1, %ymm0, %ymm2 {%k1}
  vmovapd %ymm2, %ymm0

** Throughput results from IACA tool => lower is better.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>