[cfe-dev] [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Fri Feb 17 11:03:53 PST 2017

I think this is caused by a front-end change (cc'ing clang-dev) because the
IR with "-Xclang -disable-llvm-optzns" shows the difference.

But independently of that, there's a missing IR canonicalization -
instcombine doesn't currently do anything with either version.

And the version where we trunc later survives through the backend and
produces worse code even for x86 with AVX2:
before:
    vmovd    %edi, %xmm1
    vpmovzxwq    %xmm1, %xmm1
    vpsraw    %xmm1, %xmm0, %xmm0
    retq

after:
    vmovd    %edi, %xmm1
    vpbroadcastd    %xmm1, %ymm1
    vmovdqa    LCPI1_0(%rip), %ymm2
    vpshufb    %ymm2, %ymm1, %ymm1
    vpermq    $232, %ymm1, %ymm1
    vpmovzxwd    %xmm1, %ymm1
    vpmovsxwd    %xmm0, %ymm0
    vpsravd    %ymm1, %ymm0, %ymm0
    vpshufb    %ymm2, %ymm0, %ymm0
    vpermq    $232, %ymm0, %ymm0
    vzeroupper

So this example may have won the bug lottery by exposing all of front-,
middle-, back-end bugs. :)

On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Correction in the C snippet:
>
> typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));
>
> v8i16_t foo (v8i16_t a, int n)
> {
>    return a >> n;
> }
>
> Best regards
> Saurabh
>
>
>
> On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com>
> wrote:
>
>> Hello,
>>
>> We are investigating a difference in code generation for vector splat
>> instructions between llvm-3.9 and llvm-4.0, which could lead to a
>> performance regression for our target. Here is the C snippet
>>
>> typedef signed v8i16_t __attribute__((ext_vector_type(8)))
>>
>> v8i16_t foo (v8i16 a, int n)
>> {
>>    return result = a >> n;
>> }
>>
>> With llvm-3.9, the generated sequence does a trunc followed by splat, but
>> with llvm-4.0 it is reversed to a splat to a bigger vector followed by a
>> v8i32->v8i16 trunc. Is this by design? The earlier code sequence is
>> definitely better for our target, but are there known scenarios where the
>> new sequence would lead to better code?
>>
>> Here are the instruction sequences generated in the two cases:
>>
>> With llvm 3.9:
>>
>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>   %3 = trunc i32 %1 to i16
>>   %4 = insertelement <8 x i16> undef, i16 %3, i32 0
>>   %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32>
>> zeroinitializer
>>   %6 = ashr <8 x i16> %0, %5
>>   ret <8 x i16> %6
>> }
>>
>>
>> With llvm 4.0:
>>
>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>   %3 = insertelement <8 x i32> undef, i32 %1, i32 0
>>   %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32>
>> zeroinitializer
>>   %5 = trunc <8 x i32> %4 to <8 x i16>
>>   %6 = ashr <8 x i16> %0, %5
>>   ret <8 x i16> %6
>> }
>>
>> Best regards
>> Saurabh Verma
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170217/b66f5f67/attachment.html>