<div dir="ltr"><div>Hi Nadav,</div><div><br></div><div>I don't know if I'm understanding what you're asking, so I'm just going</div><div>to dump a bunch of information to make sure we're on the same page. It</div>

<div>might get a bit long.</div><div><br></div><div>The _mm{256,}_blend* intrinsics in clang are emitting llvm x86 intrinsics</div><div>directly. That might make us not optimize a bunch of cases, especially</div><div>when functions get inlined.</div>

<div><br></div><div>I looked at lowervector_shuffle on the x86 backend and we, in fact, lower</div><div>the appropriate vectorshuffles to blend instructions: LowerVECTOR_SHUFFLEtoBlend</div><div>@ X86ISelLowering.cpp:6307</div>

<div><br></div><div>Since we know we lower the shufflevectors to an appropriate blend</div><div>instruction (we explicitly check for this exact pattern), I figured the patch</div><div>is safe to be applied to clang. But to be absolutely sure we don't regress</div>

<div>in the future (in llvm nor clang), I decided to also write tests to verify that</div><div>we actually emit the blend instructions. These tests should have already</div><div>been in place, since it's a special case of a shuffle vector and we want to</div>

<div>be sure we emit blends when appropriate, and not a bunch of mov + pshuf,</div><div>but they aren't there yet.</div><div><br></div><div>It gets even worse when you realize that the lowering of a select-based</div>

<div>blend operation is worse than the equivalent vectorshuffle. Take the</div><div>following code:</div><div><br></div><div>define <4 x float> @aaa(<4 x float> %a, <4 x float> %b) {</div><div>  %1 = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 5, i32 6, i32 3></div>

<div>  ret <4 x float> %1</div><div>}</div><div><br></div><div>define <4 x float> @bbb(<4 x float> %a, <4 x float> %b) {</div><div>  %1 = select <4 x i1> <i1 false, i1 true, i1 true, i1 false>, <4 x float> %a, <4 x float> %b</div>

<div>  ret <4 x float> %1</div><div>}</div><div><br></div><div>;; Compile with: llc -O3 -mattr=avx</div><div><br></div><div>The @aaa function gets compiled to</div><div>_aaa:                                   ## @aaa</div>

<div><span class="" style="white-space:pre">    </span>vblendps<span class="" style="white-space:pre">  </span>$6, %xmm1, %xmm0, %xmm0</div><div><span class="" style="white-space:pre">    </span>retq</div><div><br></div><div>While the @bbb function generates the following:</div>

<div><br></div><div>LCPI1_0:</div><div><span class="" style="white-space:pre">    </span>.long<span class="" style="white-space:pre">     </span>0                       ## 0x0</div><div><span class="" style="white-space:pre">  </span>.long<span class="" style="white-space:pre">     </span>4294967295              ## 0xffffffff</div>

<div><span class="" style="white-space:pre">    </span>.long<span class="" style="white-space:pre">     </span>4294967295              ## 0xffffffff</div><div><span class="" style="white-space:pre">       </span>.long<span class="" style="white-space:pre">     </span>0                       ## 0x0</div>

<div>...</div><div>_bbb:                                   ## @bbb</div><div><span class="" style="white-space:pre">   </span>vmovaps<span class="" style="white-space:pre">   </span>LCPI1_0(%rip), %xmm2</div><div><span class="" style="white-space:pre">       </span>vblendvps<span class="" style="white-space:pre"> </span>%xmm2, %xmm0, %xmm1, %xmm0</div>

<div><span class="" style="white-space:pre">    </span>retq</div><div><br></div><div>This happens because the vselect DAG node is set to Expand, which will end</div><div>up making it generate the 128bit constant, and ends up using the</div>

<div>VBLENDPSrr instruction. While the shufflevector code will go through</div><div>LowerVECTOR_SHUFFLEtoBlend and generate the mask for the immediate,</div><div>picking the VBLENDPSrri version, and not touching any memory.</div>

<div><br></div><div>The <8 x float> is similar. The non-avx, non-sse4 version is also much</div><div>worse on the select case.</div><div><br></div><div>As for adding the builtin to clang, I have no idea about how receptive</div>

<div>they will be to it, I think we should discuss that possibility on the</div><div>clang part of the patch.</div><div><br></div><div>But adding the __builtin_select to clang seems to me like it's the wrong</div><div>

way to go. As far as optimizations go, it seems like it would be much</div><div>easier to turn that a select with a ConstantInt vector as a mask into a</div><div>shufflevector than the other way around.</div><div><br></div>

<div>If you'd still prefer to make clang emit select instructions and make</div><div>__builtin_select (or similar) available to programs, please reply to the</div><div>clang part of the patch too: <a href="http://reviews.llvm.org/D3601">http://reviews.llvm.org/D3601</a></div>

<div><br></div><div>Sorry about the long text,</div><div><br></div><div>  Filipe</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, May 2, 2014 at 9:52 PM, Nadav Rotem <span dir="ltr"><<a href="mailto:nrotem@apple.com" target="_blank">nrotem@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div class=""><br><div><div>On May 2, 2014, at 6:35 PM, Filipe Cabecinhas <<a href="mailto:filcab+llvm.phabricator@gmail.com" target="_blank">filcab+llvm.phabricator@gmail.com</a>> wrote:</div>

<br><blockquote type="cite"><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important">I can't find any __builtin_select that I can use in clang's intrinsics headers.</span><br>

</blockquote></div><br></div><div>Can you add a new builtin?</div></div>

</blockquote></div><br></div>