<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"MS Mincho";
panose-1:2 2 6 9 4 2 5 8 3 4;}
@font-face
{font-family:"MS Mincho";
panose-1:2 2 6 9 4 2 5 8 3 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@MS Mincho";
panose-1:2 2 6 9 4 2 5 8 3 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Courier New";}
span.yshortcuts
{mso-style-name:yshortcuts;}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<pre style="background:white"><span style="color:black">Hello,<o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black"><o:p> </o:p></span></pre>
<pre style="margin-bottom:12.0pt;background:white"><span style="color:black">LLVM generates two additional instructions for 128->256 bit typecasts <br>(e.g. _mm256_castsi128_si256()) to clear out the upper 128 bits of YMM register corresponding to source XMM register.<o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black"> vxorps xmm2,xmm2,xmm2<o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black"> vinsertf128 ymm0,ymm2,xmm0,0x0<br><br>Most of the industry-standard C/C++ compilers (GCC, <span class="yshortcuts">Intel</span>’s compiler, Visual Studio compiler) don’t<o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black">generate any extra moves for 128-bit->256-bit typecast intrinsics.<o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black">None of these compilers zero-extend the upper 128 bits of the 256-bit YMM register. Intel’s<o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black">documentation for the _mm256_castsi128_si256 intrinsic explicitly states that “the upper bits of the<o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black">resulting vector are undefined” and that “this intrinsic does not introduce extra moves to the<o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black">generated code”. <o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black"><o:p> </o:p></span></pre>
<pre style="background:white"><span style="color:black"><a href="http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_castsi128_si256.htm" target="_blank"><span class="yshortcuts"><span style="color:blue">http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_castsi128_si256.htm</span></span></a><o:p></o:p></span></pre>
<pre style="background:white"><span style="color:black"><o:p> </o:p></span></pre>
<pre style="margin-bottom:12.0pt;background:white"><span style="color:black">Clang implements these typecast intrinsics differently. Is this intentional? I suspect that this was done to avoid a hardware penalty caused by partial register writes. But, isn’t the overall cost of 2 additional instructions (vxor + vinsertf128) for *<b>every</b>* 128-bit->256-bit typecast intrinsic higher than the hardware penalty caused by partial register writes for *<b>rare</b>* cases when the upper part of YMM register corresponding to a source XMM register is not cleared already? <o:p></o:p></span></pre>
<pre style="margin-bottom:12.0pt;background:white"><span style="color:black">Thanks!<o:p></o:p></span></pre>
<pre style="margin-bottom:12.0pt;background:white"><span style="color:black">Katya.<o:p></o:p></span></pre>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>