[cfe-dev] [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Sat Feb 18 08:11:29 PST 2017

Yes, there is an IR difference between clang 3.9.1 and clang trunk before
any IR transforms are done:
https://godbolt.org/g/FuBqIb

We can't solve this problem (moving a trunc ahead of other vector ops) in
general in IR because we take a conservative approach to vector transforms
in IR. That means the burden for solving the general problem falls on the
front-end or the back-end. If you can bisect to find the clang commit where
this changed, that would be very helpful.

However, I think we can handle a very specific case (a too fat splat) in IR
in instcombine, and it will resolve this exact example. This will take a
couple of patches to restore your example. Here's a proposal for the first
one:
https://reviews.llvm.org/D30123

On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <saurabh.verma at movidius.com>
wrote:

> Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a
> difference in the way the shift was handled. Does the initial IR generated
> for you show this difference when the option is passed?
>
> Best regards
> Saurabh
>
>
> On 17 February 2017 at 19:03, Sanjay Patel <spatel at rotateright.com> wrote:
>
>> I think this is caused by a front-end change (cc'ing clang-dev) because
>> the IR with "-Xclang -disable-llvm-optzns" shows the difference.
>>
>> But independently of that, there's a missing IR canonicalization -
>> instcombine doesn't currently do anything with either version.
>>
>> And the version where we trunc later survives through the backend and
>> produces worse code even for x86 with AVX2:
>> before:
>>     vmovd    %edi, %xmm1
>>     vpmovzxwq    %xmm1, %xmm1
>>     vpsraw    %xmm1, %xmm0, %xmm0
>>     retq
>>
>> after:
>>     vmovd    %edi, %xmm1
>>     vpbroadcastd    %xmm1, %ymm1
>>     vmovdqa    LCPI1_0(%rip), %ymm2
>>     vpshufb    %ymm2, %ymm1, %ymm1
>>     vpermq    $232, %ymm1, %ymm1
>>     vpmovzxwd    %xmm1, %ymm1
>>     vpmovsxwd    %xmm0, %ymm0
>>     vpsravd    %ymm1, %ymm0, %ymm0
>>     vpshufb    %ymm2, %ymm0, %ymm0
>>     vpermq    $232, %ymm0, %ymm0
>>     vzeroupper
>>
>>
>> So this example may have won the bug lottery by exposing all of front-,
>> middle-, back-end bugs. :)
>>
>>
>>
>> On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Correction in the C snippet:
>>>
>>> typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));
>>>
>>> v8i16_t foo (v8i16_t a, int n)
>>> {
>>>    return a >> n;
>>> }
>>>
>>> Best regards
>>> Saurabh
>>>
>>>
>>>
>>> On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are investigating a difference in code generation for vector splat
>>>> instructions between llvm-3.9 and llvm-4.0, which could lead to a
>>>> performance regression for our target. Here is the C snippet
>>>>
>>>> typedef signed v8i16_t __attribute__((ext_vector_type(8)))
>>>>
>>>> v8i16_t foo (v8i16 a, int n)
>>>> {
>>>>    return result = a >> n;
>>>> }
>>>>
>>>> With llvm-3.9, the generated sequence does a trunc followed by splat,
>>>> but with llvm-4.0 it is reversed to a splat to a bigger vector followed by
>>>> a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is
>>>> definitely better for our target, but are there known scenarios where the
>>>> new sequence would lead to better code?
>>>>
>>>> Here are the instruction sequences generated in the two cases:
>>>>
>>>> With llvm 3.9:
>>>>
>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>   %3 = trunc i32 %1 to i16
>>>>   %4 = insertelement <8 x i16> undef, i16 %3, i32 0
>>>>   %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32>
>>>> zeroinitializer
>>>>   %6 = ashr <8 x i16> %0, %5
>>>>   ret <8 x i16> %6
>>>> }
>>>>
>>>>
>>>> With llvm 4.0:
>>>>
>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>   %3 = insertelement <8 x i32> undef, i32 %1, i32 0
>>>>   %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32>
>>>> zeroinitializer
>>>>   %5 = trunc <8 x i32> %4 to <8 x i16>
>>>>   %6 = ashr <8 x i16> %0, %5
>>>>   ret <8 x i16> %6
>>>> }
>>>>
>>>> Best regards
>>>> Saurabh Verma
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170218/18284a9a/attachment.html>