<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Dec 4, 2014 at 8:23 AM, Bruno Cardoso Lopes <span dir="ltr"><<a href="mailto:bruno.cardoso@gmail.com" target="_blank">bruno.cardoso@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":ced" class="a3s" style="overflow:hidden">Hi nadav, chandlerc, andreadb, delena,<br>

<br>

This patch adds x86 custom lowering for the @llvm.ctpop.v8i32 intrinsic.<br>

<br>

Currently, the expansion of @llvm.ctpop.v8i32 uses vector element extractions,<br>

insertions and individual calls to @llvm.ctpop.i32. Local haswell measurements<br>

show that @llvm.ctpop.v8i32 gets faster by using vector parallel bit twiddling approaches<br>

than using @llvm.ctpop.i32 for each element, based on:<br>

<br>

v = v - ((v >> 1) & 0x55555555);<br>

v = (v & 0x33333333) + ((v >> 2) & 0x33333333);<br>

v = ((v + (v >> 4) & 0xF0F0F0F)<br>

v = v + (v >> 8)<br>

v = v + (v >> 16)<br>

v = v & 0x0000003F<br>

(from <a href="http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel" target="_blank">http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel</a>)<br>

<br>

Some toy microbenchmark presented a ~2x speedup, whereas vector types with smaller number of elements<br>

are still better with the old approach (see results below). Hence this<br>

patch only implements it for v8i32 type. The results indicate it might also be profitable<br>

to implement this approach for v32i8 and v16i16, but I haven't measured that yet.<br>

<br>

AVX1 ctpop.v8i32 is broken into two ctpop.v4i32, which is only slightly better than old expansion. However,<br>

this patch does not implement custom lowering for the general ctpop.v4i32 type, since it's not profitable.</div></blockquote></div><br>These timings are pretty strange.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Can you post the code produced for the old lowering and the new lowering? I'm wondering if there is something about the old lowering that makes it unreasonably slow.</div><div class="gmail_extra"><br></div><div class="gmail_extra">It would be really nice to have a more principled split here such as using the bit-math version when a scalarized form would require extracting from multiple 128-bit lanes, or when there are more than N vector elements. </div></div>