<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>Hi Dmitry, </div><div><br></div><div>Yes, this is a known problem with legalizing vector masks. The type <8 x i1> is legalized to 8 x i16, on SSE, but your operands are legalized to <4 x i32>.  Type-legalization is performed per-node and we don’t have a good way to support instructions that mix the mask and operand type.  Why does ISPC generate illegal vector types ?  Does ISPC rely on the LLVM codegen to split the vectors to increase ILP ? In that case ISPC should generate two vectors operations. </div><div> </div><div>Thanks,</div><div>Nadav</div><div><br></div><br><div><div>On Oct 25, 2013, at 2:16 PM, Dmitry Babokin <<a href="mailto:babokin@gmail.com">babokin@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr">Nadav,<div><br></div><div>The problem appears only for vectors longer than available hardware register (in doubleword elements, i.e. more than 4 on SSE4 and more than 8 on AVX). Select does weird thing. <8 x i1> mask comes as two XMM registers, select converts them to a single XMM registers (i.e. 8 x 16 bit), immediately after it converts back to two XMM registers and does blend. Conversion forth and back has huge overhead.</div>

<div><br></div><div>I'm attaching 3 files with vectors of length 4, 8 and 16. Try 4 on SEE4 and you'll see that both cases work well, 8 demonstrates the difference on SSE4. The same on AVX (8 vs 16).</div><div><br>

</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Oct 23, 2013 at 1:41 AM, Nadav Rotem <span dir="ltr"><<a href="mailto:nrotem@apple.com" target="_blank">nrotem@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div class="im"><br><div><div>On Oct 21, 2013, at 12:09 PM, Dmitry Babokin <<a href="mailto:babokin@gmail.com" target="_blank">babokin@gmail.com</a>> wrote:</div>

<br><blockquote type="cite"><div style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">

By the way, I'm curious, is the any reason why you focus on SSE4, not AVX? Seems that vectorizer should care the most about the latest silicon.</div><br></blockquote></div><br></div><div>I am interested in looking at the SSE4 code because lowering of AVX code is more complicated, especially for masks.  The problem that <8 x i1> can be legalized to <8 x i32> for YMM, or <8 x i16> for XMM.  ISPC worked around this limitation by explicitly extending the mask. The SEXT canonicalization reverted the code pattern that ISPC generated. </div>

<div><br></div><div>Thanks,</div><div>Nadav   </div></div></blockquote></div><br></div>

<span><v4.ll></span><span><v8.ll></span><span><v16.ll></span></blockquote></div><br></body></html>