[LLVMdev] branch on vector compare?

Wed Sep 5 22:17:47 PDT 2012

On Wed, Sep 5, 2012 at 9:07 AM, Roland Scheidegger <sroland at vmware.com> wrote:
> Am 05.09.2012 00:24, schrieb Stephen:
>> Roland Scheidegger <sroland <at> vmware.com> writes:
>>> This looks quite similar to something I filed a bug on (12312). Michael
>>> Liao submitted fixes for this, so I think
>>> if you change it to
>>>   %16 = fcmp ogt <4 x float> %15, %cr
>>>   %17 = sext <4 x i1> %16 to <4 x i32>
>>>   %18 = bitcast <4 x i32> %17 to i128
>>>   %19 = icmp ne i128 %18, 0
>>>   br i1 %19, label %true1, label %false2
>>>
>>> should do the trick (one cmpps + one ptest + one br instruction).
>>> This, however, requires sse41 which I don't know if you have - you say
>>> the extractelements go through memory which I've never seen then again
>>> our code didn't try to extract the i1 directly (even without fixes for
>>> ptest the above sequence will result in only 2 extraction steps instead
>>> of 4 if you're on x64 and the cpu supports sse41 but I guess without
>>> sse41 and hence no pextrd/q it probably also will go through memory).
>>> Though on altivec this sequence might not produce anything good, the
>>> free sext requires llvm 2.7 on x86 to work at all (certainly shouldn't
>>> be a problem nowadays but on other backends it might be different) and
>>> for the ptest sequence very recent svn is required.
>>> I don't think the current code can generate movmskps + test (probably
>>> the next best thing without sse41) instead of ptest though if you only
>>> got sse.
>>
>>
>> Thanks Roland, sign extending gets me part of the way at least.
>> I'm on version 3.1 and as you say in bug report, there are a
>> few extraneous instructions. For the record, casting to a <4 x i8>
>> seems to do a better job for x86 (shuffle, movd, test, jump). Using
>> <4 x i32> seems to issue a pextrd for each element. For x64, it seems
>> to be the same for either. I suppose it's all academic seeing as the
>> ptest patch looks good.
>
> Yes <4 x i8> cast looks like a good idea. Just be careful though if you
> also need to target cpus without ssse3, IIRC without pshufb this will
> create some horrible code (could have been with older llvm version
> though). Though if you don't have ssse3 you also won't have pextrd,
> which means more shuffling to extract the values if you sign-extend them
> to <4 x i32> too (if you're targeting altivec, probably no such issue as
> I think it doesn't have such blatantly missing shuffle instructions).
> But yes ptest looks like the obvious winner. For cpus not having sse41
> (and there's tons of them still in use not to mention still sold) it
> would be nice if llvm could come up with pmovmskb/movmskps/movmskpd +
> test (these instructions look like they were intended for exactly that
> use case after all). But the <4 x i8> sign-extend solution shouldn't
> hurt performance too much neither, if you've got ssse3.

If all you need is to test all flags are the same among elements, we
could add a pseudo PTEST support on CPU without SSE4.1, i.e.

we could replace

cmpltps %xmm0, %xmm1
ptest %xmm1, %xmm1
jz LABEL

to

cmpltps %xmm0, %xmm1
movmskps %xmm0, %r8d
test %r8d, %r8d
jz LABEL

It looks to me much more efficient and only relies on SSE. But, we
have to ensure the 2 operands to PTEST are the same and it's generated
from packed CMP.

I am figuring out how to simplify the checking of these pre-conditions.

Just off-topic issue, most vector IR so far operates on element-wise
or vertically. The generalized issue from here and PR12312 is that we
don't have simply way to express horizontal operations easily, like
primitives

float %s = reduce fadd <N x float> %x
i32 %m = reduce max <N x i32> %x
i1 %c = any <N x i1> %x or i1 %c = reduce or <N x i1>
i1 %c = all <N x i1> %x or i1 %c = reduce and <N x i1>

one more interesting example would be scan, horizontal operation but
still generate vector

<N x i32> %s = scan add <N x i32> %x, 0 ; exclusive scan
<N x i32> %s = scan add <N x i32> %x, 1; inclusive scan

With these primitives, some workloads may be simplified in IR and
backend (like X86) could support some directly.

- michael

>
> Roland
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev