<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Dec 4, 2014 at 8:50 PM, Bruno Cardoso Lopes <span dir="ltr"><<a href="mailto:bruno.cardoso@gmail.com" target="_blank">bruno.cardoso@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div id=":2jy" class="" style="overflow:hidden">Chandler,<br>
<br>
Thanks for the help, the assembly for v8i32-new/old:<br>
<a href="http://pastebin.com/4gnd41Je" target="_blank">http://pastebin.com/4gnd41Je</a><br>
<br>
About the principled split: I rather go the other way around, i.e., since SelectionDAGLegalize::ExpandBitCount already emits the bit-math for scalarized versions it makes more sense to custom split to other known vector types only when we already know it's profitable.<br></div></blockquote><div><br></div><div>I'm fine with that, but we should know *why* those vector types are profitable.</div><div><br></div><div>From looking at the assembly it is almost certainly the extract/insert pattern that is so terribly slow here. Those instructions cause really frustrating stalls in dense code.</div><div><br></div><div>From your timings, it looks like it at least makes sense to do this for *all* vectors with 8 or more elements (including v8i16, v16i16, etc.). The timings on 4 elements are really close though.</div><div><br></div><div>I think you can improve the timings for the avx1 code by interleaving the operations so that they exploit ILP.</div><div><br></div><div>However, you can probably improve them much more by using the 16-entry lookup-table-in-a-register trick with pshufb outlined here: <a href="http://wm.ite.pl/articles/sse-popcount.html">http://wm.ite.pl/articles/sse-popcount.html</a></div><div>For i64 element types, the psadbw accumulation trick works perfectly. For smaller element types, you can shift, mask, and sum to widen from a byte popcount to the desired width. For i32, that's only 2 iterations. All combined, v8i32 will actually be the worst case, but it still looks an instruction shorter (not sure if it will be faster of course). For i16 and i8 vectors it should certainly be a huge win, and i64 also looks promising thanks to psadbw.</div><div><br></div><div>I'm curious if that trick makes even the v4 cases profitable? Maybe even the v2?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div id=":2jy" class="" style="overflow:hidden">
<br>
Nadav and Hal,<br>
<br>
There are potential benefits for other targets I believe, but this customisation generates a bunch of vector instructions and I'm afraid that if one or other vector instruction isn't well supported on a target, that could lead to a lot of scalarized instructions which may lead to worse code than before? I might be wrong though. I just rather go into the direction that if other targets implement it and succeed, we than move it to target independent code. Additional thoughts?<br></div></blockquote><div><br></div><div>I think the ideal strategy is for targets to select the types for which we scalarize and use a scalar popcnt instruction vs. legalizing to vector bit math. I'm not sure what the best strategy is for exposing such a hook. Note that you probably will need additional hooks to use the pshufb trick in the lowering -- implementing a 16-way lookup table via a shuffle may not work out well on all targets. =] psadbw is also terribly specific.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div id=":2jy" class="" style="overflow:hidden">
<br>
Actually, back to x86, if popcnt isn't supported by some x86 target it currently leads to this bitmath scalarized code for each element and it would be always profitable to emit the vectorized code instead - tested it for v4i32, v2i64, v4i64 and v8i32 and it performs better. Gonna update the patch to reflect that. For instance "-arch x86_64" doesn't assume popcnt by default, since it is a separate feature, in cases like this we would always win.</div></blockquote></div><br>Yes, for *any* time where the scalar op isn't legal, this is vastly superior lowering.</div></div>