<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Hi Chandler,<div><br></div><div><div style="margin: 0px;">I don’t see any scary changes in performance outside of noise (1%) when I set "all shuffles as legal” to true. I haven’t tested the vector widening yet.</div><div style="margin: 0px; min-height: 13px;"><br></div><div style="margin: 0px;">A minor concern: in several situations I’m seeing vpermilps has been replaced with shufps with repeated inputs (see below). I’m also seeing cases where 2 x shufps are being used where a single shufps/permilps was occurring. I’ll try to reduce some examples for bugzilla.</div><div style="margin: 0px; min-height: 13px;"><br></div><div style="margin: 0px; min-height: 13px;">old:</div><div style="margin: 0px;"> vshufps $19, %xmm3, %xmm2, %xmm2 # xmm2 = xmm2[3,0],xmm3[1,0]</div><div style="margin: 0px;"> vpermilps $45, %xmm2, %xmm2 # xmm2 = xmm2[1,3,2,0]</div><div style="margin: 0px; min-height: 13px;"><br></div><div style="margin: 0px; min-height: 13px;">new:</div><div style="margin: 0px;"> vshufps $76, %xmm3, %xmm2, %xmm2 # xmm2 = xmm2[0,3],xmm3[0,1]</div><div style="margin: 0px;"> vshufps $120, %xmm2, %xmm2, %xmm2 # xmm2 = xmm2[0,2,3,1]</div><div style="margin: 0px; min-height: 13px;"><br></div><div style="margin: 0px;">This may be fixable with improvements to DAGCombiner::visitVECTOR_SHUFFLE - it uses shuffle mask legality to check whether to commute candidate shuffle masks, which we never do with always legal shuffles. Better optimisation may be possible if we can optimise more shuffle(shuffle(A, B, M0), shuffle(C, D, M1), M2) patterns as well.</div></div><div><br></div><div>Cheers, Simon.</div><div><br><div><div>On 25 Jan 2015, at 22:15, Sanjay Patel <<a href="mailto:spatel@rotateright.com">spatel@rotateright.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr">I ran the benchmarking subset of test-suite on a btver2 machine and optimizing for btver2 (so enabling AVX codegen).<br><br>I don't see anything outside of the noise with x86-experimental-vector-shuffle-legality=1.</div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jan 23, 2015 at 5:19 AM, Andrea Di Biagio <span dir="ltr"><<a href="mailto:andrea.dibiagio@gmail.com" target="_blank">andrea.dibiagio@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto;">Hi Chandler,<br>
<span class=""><br>
On Fri, Jan 23, 2015 at 8:15 AM, Chandler Carruth <<a href="mailto:chandlerc@gmail.com">chandlerc@gmail.com</a>> wrote:<br>
> Greetings LLVM hackers and x86 vector shufflers!<br>
><br>
> I would like to flip on another chunk of the new vector shuffling,<br>
> specifically the logic to mark ~all shuffles as "legal".<br>
><br>
> This can be tested today with the flag<br>
> "-x86-experimental-vector-shuffle-legality". I would essentially like to<br>
> make this the default (by removing the "false" path). Doing this will allow<br>
> me to completely delete the old vector shuffle lowering.<br>
><br>
> I've got the patches prepped and ready to go, but it will likely have a<br>
> significant impact on performance. Notably, a bunch of the remaining domain<br>
> crossing bugs I'm seeing are due to this. The key thing to realize is that<br>
> vector shuffle combining is *much* more powerful when we think all of these<br>
> are legal, and so we combine away bad shuffles that would trigger domain<br>
> crosses.<br>
<br>
</span>That's good news!<br>
Also, I really like your idea of making all shuffles legal by default.<br>
I remember I did some experiments disabling the checks for legal<br>
shuffles in the DAGCombiner to see how well the new shuffle lowering<br>
coped with 'overly' aggressive shuffle combining. I was surprised to<br>
see that from eyeballing the generated code it looked much cleaner<br>
(although I didn't test it extensively). Our target is btver2, so I<br>
also didn't look at what could have been codegen for targets with no<br>
AVX/SSE4.1 where there might be fewer opportunities to match a shuffle<br>
with a single target instruction during legalization.<br>
<span class=""><br>
><br>
> All of my benchmarks have come back performance neutral overall with a few<br>
> benchmarks improving. However, there may be some regressions that folks want<br>
> to track down first. I'd really like to get those reported and prioritize<br>
> among the vector shuffle work so we can nuke several *thousand* lines of<br>
> code from X86ISelLowering.cpp. =D<br>
<br>
</span>I'll see if I can get some numbers from our internal codebase and help<br>
with reporting potential regressions.<br>
<br>
Thanks,<br>
Andrea<br>
<div class="HOEnZb"><div class="h5"><br>
><br>
> Thanks!<br>
> -Chandler<br>
><br>
><br>
> PS: If you're feeling adventurous, the next big mode flip flag I want to see<br>
> changed is -x86-experimental-vector-widening-legalization, but this is a<br>
> much more deep change to the entire vector legalization strategy, so I want<br>
> to do it second and separately.<br>
</div></div></blockquote></div><br></div>
</blockquote></div><br></div></body></html>