<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Dec 7, 2014 at 6:34 PM, Simon Pilgrim <span dir="ltr"><<a href="mailto:llvm-dev@redking.me.uk" target="_blank">llvm-dev@redking.me.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><div>On 7 Dec 2014, at 16:50, Chandler Carruth <<a href="mailto:chandlerc@gmail.com" target="_blank">chandlerc@gmail.com</a>> wrote:</div><blockquote type="cite"><div dir="ltr"><div class="gmail_extra">On Sun, Dec 7, 2014 at 5:35 PM, Simon Pilgrim <span dir="ltr"><<a href="mailto:llvm-dev@redking.me.uk" target="_blank">llvm-dev@redking.me.uk</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="overflow:hidden">Added X86vzmovl folded loads tests.<br>

<br>

I looked at using a pand with a constant mask as an alternative and saw a minimal regression (nearly in the noise) compared to the movq/movss versions I was already testing against. I'm worried about pursuing that route though - it adds addiitonal memory access</div></blockquote><div><br></div><div>I would be *really* surprised if pand is actually slower, especially if we get a chance to hoist the memory access into a variable within a loop.</div><div><br></div><div>pand should be 2 uops, and according to agner's has a throughput of 2/cycle. the pshufd alone is 1 uop and 2/cycle. the movq is also 1 uop with 2/cycle. But because they're in a chain, the critical path is 2 cycles instead of 1 cycle here (assuming the load doesn't stall, but I think that's usually a safe assumption in real world code).</div><div><br></div><div>On x86, loads (especially of constants) are *crazy* fast in my experience.</div></div></div></div></blockquote><div><br></div></span><div>Yes if I hoist the mask loading its definitely faster - but if I leave it folded in the pand I don’t see any difference.</div></blockquote><div><br></div><div>Sure. But by lowering with pand, it should allow coalescing the constant load and hoisting it out a loop no? And if it doesn't happen to get hoisted, as you say, no different. That's why I would prefer the pand lowering I think.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br><blockquote type="cite"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="overflow:hidden"> and the mask approach might make it more difficult for future optimizations of the multiple pshufd ops that are still in the vector-shuffle-128-v4.ll tests.</div></blockquote><div><br></div></span><div>? If we want to fold things, it should happen before we're doing ISel pattern expansion… </div></div></div></div>

</blockquote><div><br></div>Yes you’re right - its just that last shuffle/byteshift in lowerVectorShuffleAsElementInsertion isn’t doing us any favours as we’re making no attempt to fold it with the VZEXT_MOVL that we’ve just generated. I could just modify lowerVectorShuffleAsElementInsertion to try and have it create something more suitable - overriding most of the patterns (doing much of the VZEXT_MOVL work in code). Any better ideas? I wasn’t intending to spend so long on this pre-SSE4.1 code…….</blockquote></div><div class="gmail_extra"><br></div>Sure, I'm not asking you to fix this in this patch. I'm just saying I don't think the pand lowering really makes this better or worse -- if we want to improve it, it'll have to happen in the lowering code.<br><br></div></div>