<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Sun, Feb 15, 2015 at 5:19 AM, Simon Pilgrim <span dir="ltr"><<a href="mailto:llvm-dev@redking.me.uk" target="_blank">llvm-dev@redking.me.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Author: rksimon<br>
Date: Sun Feb 15 07:19:52 2015<br>
New Revision: 229311<br>
<br>
URL: <a href="http://llvm.org/viewvc/llvm-project?rev=229311&view=rev" target="_blank">http://llvm.org/viewvc/llvm-project?rev=229311&view=rev</a><br>
Log:<br>
[X86][AVX2] vpslldq/vpsrldq byte shifts for AVX2<br>
<br>
This patch refactors the existing lowerVectorShuffleAsByteShift function to add support for 256-bit vectors on AVX2 targets.<br></blockquote><div><br></div><div>Unfortunately, this patch introduces a nasty miscompile. See below.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
It also fixes a tablegen issue that prevented the lowering of vpslldq/vpsrldq vec256 instructions.<br></blockquote><div><br></div><div>And because of this, it's a bit hard to disentangle. It would be better to keep these separate in the future (and yea, i know, i should have said this in the review...)</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">+ int NumElts = VT.getVectorNumElements();<br>
+ int NumLanes = VT.getSizeInBits() / 128;<br>
+ int NumLaneElts = NumElts / NumLanes;<br>
+ int Scale = 16 / NumLaneElts;<br>
+ MVT ShiftVT = MVT::getVectorVT(MVT::i64, 2 * NumLanes);<br>
+<br>
+ // PSLLDQ : (little-endian) left byte shift<br>
+ // [ zz, 0, 1, 2, 3, 4, 5, 6]<br>
+ // [ zz, zz, -1, -1, 2, 3, 4, -1]<br>
+ // [ zz, zz, zz, zz, zz, zz, -1, 1]<br>
+ // PSRLDQ : (little-endian) right byte shift<br>
+ // [ 5, 6, 7, zz, zz, zz, zz, zz]<br>
+ // [ -1, 5, 6, 7, zz, zz, zz, zz]<br>
+ // [ 1, 2, -1, -1, -1, -1, zz, zz]<br>
+ auto MatchByteShift = [&](int Shift) -> SDValue {<br>
+ bool MatchLeft = true, MatchRight = true;<br>
+ for (int l = 0; l < NumElts; l += NumLaneElts) {<br>
+ for (int i = 0; i < Shift; ++i)<br>
+ MatchLeft &= Zeroable[l + i];<br>
+ for (int i = NumLaneElts - Shift; i < NumLaneElts; ++i)<br>
+ MatchRight &= Zeroable[l + i];<br>
}<br>
+ if (!(MatchLeft || MatchRight))<br>
+ return SDValue();<br></blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">+ bool MatchV1 = true, MatchV2 = true;<br>
+ for (int l = 0; l < NumElts; l += NumLaneElts) {<br>
+ unsigned Pos = MatchLeft ? Shift + l : l;<br>
+ unsigned Low = MatchLeft ? l : Shift + l;<br></blockquote><div><br></div><div>What happens if both MatchLeft and MatchRight are true? You don't assert that this can't happen, and in fact, it can. We only test one of them here.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
+ unsigned Len = NumLaneElts - Shift;<br>
+ MatchV1 &= isSequentialOrUndefInRange(Mask, Pos, Len, Low);<br>
+ MatchV2 &= isSequentialOrUndefInRange(Mask, Pos, Len, Low + NumElts);<br>
}<br>
- }<br>
+ if (!(MatchV1 || MatchV2))<br>
+ return SDValue();<br>
+<br>
+ int ByteShift = Shift * Scale;<br>
+ unsigned Op = MatchRight ? X86ISD::VSRLDQ : X86ISD::VSHLDQ;<br></blockquote><div><br></div><div>And then here, we use MatchRight. Boom.</div><div><br></div><div>Here is a test case:</div><div><br></div><div><div>define internal fastcc <8 x i16> @test(<8 x i16> %s.0.0, <8 x i16> %s.0.1) noinline nounwind {</div><div>entry:</div><div> %s.1.0 = shufflevector <8 x i16> %s.0.0, <8 x i16> %s.0.1, <8 x i32> <i32 15, i32 undef, i32 3, i32 undef, i32 12, i32 12, i32 5, i32 undef></div><div><br></div><div> ret <8 x i16> %s.1.0</div><div>}</div></div><div><br></div><div>This gets miscompiled at top-of-tree.</div><div><br></div><div>I tried reverting this patch, and it's a mess. Doesn't revert cleanly. =[ When I manually tried to fix the revert, the above test case got miscompiled *differently* by shifting every byte out. =[</div><div><br></div><div>I'm probably just going to rip this function out until you or I have time to look at re-adding it and ensuring it doesn't have any holes. I don't see any other clean way of "reverting to green" here because the patches don't really back out cleanly.</div><div><br></div><div>Sorry for the trouble, but miscompiled shuffles are *really* hard to track down so I want to get back to a green baseline.</div><div><br></div><div>If you want to test stuff yourself, my shuffle fuzz tester is commited to LLVM. Here is how I run it:</div><div><br></div><div>% seq 1 48 | xargs -P 48 -n 1 zsh -c 'repeat 10000 (python2.7 utils/shuffle_fuzz.py -v --fixed-bit-width 128 --fixed-element-type i16 --triple x86_64-unknown-unknown --max-shuffle-height 1 | llc -mcpu=x86-64 -mattr=+sse2 | clang -x assembler - -o fuzz.$0; ./fuzz.$0 && echo "Success!") >>fuzz.$0.log 2>&1'<br></div><div><br></div><div>You can then 'grep FAIL fuzz.*.log' to find failures. You can reproduce them by taking the UUID in the FAIL line and passing it as the --seed option to shuffle_fuzz.py (along with the exact options you pass it above).</div><div><br></div><div>You can change "48" in the two places above to the number of cores on your machine. And you can increase the total number of fuzz tests by growing the argument to "repeat". The above 480000 runs found the test case I gave in this email and one other failure that I haven't investigated. I'm going to back out the code for byte shift lowering now and then start fuzz testing again.</div><div><br></div><div>-Chandler</div></div></div></div>