<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - Conversion from int to XMM is handled inefficiently on SSE4"
   href="https://bugs.llvm.org/show_bug.cgi?id=41512">41512</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>Conversion from int to XMM is handled inefficiently on SSE4
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libraries
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Windows NT
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>Backend: X86
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>spreis@yandex-team.ru
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Created <span class=""><a href="attachment.cgi?id=21786" name="attach_21786" title="Proposed fix">attachment 21786</a> <a href="attachment.cgi?id=21786&action=edit" title="Proposed fix">[details]</a></span>
Proposed fix

In attempt to swicth all our builds to SSE4 from SSSE3 we found out that code
as simple as

    const __m128i lo = _mm_cvtsi32_si128(d0[value]);
    const __m128i hi = _mm_cvtsi32_si128(d0[value+1024]);
    val = _mm_add_epi64(val, _mm_unpacklo_epi64(lo, hi));

or

    const __m128i all = _mm_set_epi32(0, d0[value], 0, d0[value+1024]);
    val = _mm_add_epi64(val, all);


When inlined into loop performs worse when compiled with -sse4.1 than with just
-ssse3.

The problem is that _mm_cvtsi32_si128() and _mm_set_epi32() both modeled via
INSERT_VECTOR_ELT, and 

  %13 = insertelement <4 x i32> <i32 undef, i32 0, i32 undef, i32 0>, i32 %12,
i32 0, !dbg !287

Lowered to single movd instruction prior to SSE4 and to xor+pinsrd on SSE4.
<a href="https://gcc.godbolt.org/z/qY8nkO">https://gcc.godbolt.org/z/qY8nkO</a>

* Notice that in a kernel fucntion in 2nd case there are couple of movd's, but
when used in loop it results in pair of pinsrd from memory into same register.

This seems to me like poor instruction selection both from performance and code
size standpopints.

I suggset steering instruction selection for this idiomatic case of
INSERT_VECTOR_ELT to SCALAR_TO_VECTOR. This will directly lead to movd
emission.

Proposed change to lib/Target/X86/X86ISelLowering.cpp is attached.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>