<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [X86][AVX512] 128-bit shuffles of 256-bit vectors prefer AVX2 instructions instead of AVX512 preventing combining masks into the instruction"

   href="https://bugs.llvm.org/show_bug.cgi?id=34359">34359</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[X86][AVX512] 128-bit shuffles of 256-bit vectors prefer AVX2 instructions instead of AVX512 preventing combining masks into the instruction

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>ayman.musa@intel.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>128-bit shuffles of 256-bit vectors prefer the perm2f128 and perm2i128 AVX2

instructions which cannot accept a merge-mask.

Selecting AVX512 instructions like vshufi64x2, vshufi32x4, vshuff64x2 would

enable combining the mask into them replacing the extra blend instruction from

the AVX2 sequence.

Reproducer (for <4 x double> type):

define <4 x double> @test_4xdouble_masked_shuff_mask0(<4 x double> %vec1, <4 x

double> %vec2, <4 x double> %vec3) {

   %shuf = shufflevector <4 x double> %vec1, <4 x double> %vec2, <4 x i32> <i32

2, i32 3, i32 4, i32 5>

   %res = select <4 x i1> <i1 0, i1 1, i1 1, i1 1>, <4 x double> %shuf, <4 x

double> %vec3

   ret <4 x double> %res 

 }

<span class="quote">>> llc -mcpu=skx <file-name> -o out.s</span >

LLVM emits (showing 3.81 throughput on IACA tool):

  vperm2f128  $33, %ymm1, %ymm0, %ymm0

  movb $14, %al

  kmovd %eax, %k1

  vblendmpd %ymm0, %ymm2, %ymm0 {%k1}

  retq 

While it can be replaced with (showing 2.86 throughput on IACA tool):

  movb $126, %al

  kmovd %eax, %k1

  vshuff64x2 $1, %ymm1, %ymm0, %ymm2 {%k1}

  vmovapd %ymm2, %ymm0

** Throughput results from IACA tool => lower is better.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>