<html>

    <head>

      <base href="http://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - inefficient code generation for 128-bit->256-bit typecast intrinsics"

   href="http://llvm.org/bugs/show_bug.cgi?id=15712">15712</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>inefficient code generation for 128-bit->256-bit typecast intrinsics

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>katya_romanova@playstation.sony.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvmbugs@cs.uiuc.edu

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>LLVM generates two additional instructions for 128->256 bit typecasts 

(e.g. _mm256_castsi128_si256, _mm256_castps128_ps256, _mm256_castpd128_pd256)

to clear out the upper 128 bits of YMM register corresponding to source XMM

register.

    vxorps xmm2,xmm2,xmm2

    vinsertf128 ymm0,ymm2,xmm0,0x0

Most of the industry-standard C/C++ compilers (GCC, Intel’s compiler, Visual

Studio compiler) don’t generate any extra moves for 128-bit->256-bit typecast

intrinsics. None of these compilers zero-extend the upper 128 bits of the

256-bit YMM register. Intel’s documentation for the _mm256_castsi128_si256

intrinsic explicitly states that “the upper bits of the resulting vector are

undefined” and that “this intrinsic does not introduce extra moves to the

generated code”. 

<a href="http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_castsi128_si256.htm">http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_castsi128_si256.htm</a>

Clang implements these typecast intrinsics differently. I suspect that this was

done on purpose to avoid a hardware penalty caused by partial register writes. 

I think that the overall cost of 2 additional instructions (vxor + vinsertf128)

for *every* 128-bit->256-bit typecast intrinsic much higher than the hardware

penalty caused by partial register writes for *rare* cases when the upper part

of YMM register corresponding to a source XMM register is not cleared already.

Katya.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>