[LLVMdev] branch on vector compare?

Wed Sep 5 09:07:48 PDT 2012

Am 05.09.2012 00:24, schrieb Stephen:
> Roland Scheidegger <sroland <at> vmware.com> writes:
>> This looks quite similar to something I filed a bug on (12312). Michael
>> Liao submitted fixes for this, so I think
>> if you change it to
>>   %16 = fcmp ogt <4 x float> %15, %cr
>>   %17 = sext <4 x i1> %16 to <4 x i32>
>>   %18 = bitcast <4 x i32> %17 to i128
>>   %19 = icmp ne i128 %18, 0
>>   br i1 %19, label %true1, label %false2
>>
>> should do the trick (one cmpps + one ptest + one br instruction).
>> This, however, requires sse41 which I don't know if you have - you say
>> the extractelements go through memory which I've never seen then again
>> our code didn't try to extract the i1 directly (even without fixes for
>> ptest the above sequence will result in only 2 extraction steps instead
>> of 4 if you're on x64 and the cpu supports sse41 but I guess without
>> sse41 and hence no pextrd/q it probably also will go through memory).
>> Though on altivec this sequence might not produce anything good, the
>> free sext requires llvm 2.7 on x86 to work at all (certainly shouldn't
>> be a problem nowadays but on other backends it might be different) and
>> for the ptest sequence very recent svn is required.
>> I don't think the current code can generate movmskps + test (probably
>> the next best thing without sse41) instead of ptest though if you only
>> got sse.
> 
> 
> Thanks Roland, sign extending gets me part of the way at least.
> I'm on version 3.1 and as you say in bug report, there are a
> few extraneous instructions. For the record, casting to a <4 x i8>
> seems to do a better job for x86 (shuffle, movd, test, jump). Using
> <4 x i32> seems to issue a pextrd for each element. For x64, it seems
> to be the same for either. I suppose it's all academic seeing as the
> ptest patch looks good.

Yes <4 x i8> cast looks like a good idea. Just be careful though if you
also need to target cpus without ssse3, IIRC without pshufb this will
create some horrible code (could have been with older llvm version
though). Though if you don't have ssse3 you also won't have pextrd,
which means more shuffling to extract the values if you sign-extend them
to <4 x i32> too (if you're targeting altivec, probably no such issue as
I think it doesn't have such blatantly missing shuffle instructions).
But yes ptest looks like the obvious winner. For cpus not having sse41
(and there's tons of them still in use not to mention still sold) it
would be nice if llvm could come up with pmovmskb/movmskps/movmskpd +
test (these instructions look like they were intended for exactly that
use case after all). But the <4 x i8> sign-extend solution shouldn't
hurt performance too much neither, if you've got ssse3.

Roland