[PATCH] Teach the DAGCombiner how to fold a OR of two shufflevector into a single shufflevector node

Thu Mar 6 10:55:57 PST 2014

Hi Hal,
thanks for your review!

Nadav, are you happy with this too?

On Thu, Mar 6, 2014 at 5:50 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Andrea Di Biagio" <andrea.dibiagio at gmail.com>
>> To: "Tim Northover" <t.p.northover at gmail.com>
>> Cc: "llvm-commits at cs.uiuc.edu for LLVM" <llvm-commits at cs.uiuc.edu>
>> Sent: Thursday, March 6, 2014 8:25:22 AM
>> Subject: Re: [PATCH] Teach the DAGCombiner how to fold a OR of two    shufflevector into a single shufflevector node
>>
>> Hi Nadav, Hal and Tim,
>>
>> I modified my original patch adding checks agains
>> 'TLI.isShuffleMaskLegal' as suggested by Hal.
>> This new patch now avoids folding an OR of two shuffles if the
>> resulting shuffle would be illegal for the target.
>>
>> The rules for this new combine are:
>>    // fold (or (shuf A, V_0, MaskA), (shuf B, V_0, MaskB)) -> (shuf
>>    A, B, Mask1)
>>    // fold (or (shuf A, V_0, MaskA), (shuf B, V_0, MaskB)) -> (shuf
>>    B, A, Mask2)
>>
>> Basically:
>> we can take advantage of the fact that OR is commutative and compute
>> two different shuffle masks: Mask1, Mask2.
>> If the dag node in input can be folded, then we check if Mask1 is a
>> legal shuffle mask; in case it folds according to the first rule.
>> Otherwise we checks if Mask2 is legal to see if we can fold according
>> to the second rule.
>> If both Masks are illegal then we conservatively don't fold the OR
>> into a shuffle.
>
> LGTM. Thanks!
>
>  -Hal
>
>>
>> ///
>> @Nadav
>> In my experiments on the x86 (Corei7 with and without AVX support) I
>> noticed that this new patch correctly folds all the interesting cases
>> except this one:
>>
>> define <4 x i32> @foo(<4 x i32> %a, <4 x i32> %b) {
>>   %shuf1 = shufflevector <4 x i32> %a, <4 x i32> zeroinitializer, <4
>>   x
>> i32><i32 0, i32 4, i32 4, i32 4>
>>   %shuf2 = shufflevector <4 x i32> %b, <4 x i32> zeroinitializer, <4
>>   x
>> i32><i32 4, i32 0, i32 3, i32 2>
>>   %or = or <4 x i32> %shuf1, %shuf2
>>   ret <4 x i32> %or
>> }
>>
>> The OR in function @foo is not folded because that would result in an
>> illegal shuffle.
>> However, that particular case could be still optimized into a
>> sequence
>> of two legal shuffle operations (getting rid of the OR entirely).
>> Ideally we could generate the following assembly:
>>     movlhps %xmm1, %xmm0
>>     shufps $-71 %xmm1, %xmm0
>>
>> The transformation would be valid in general if the target can emit
>> (V)MOVLHPS instructions and if each shuffle in input to the OR has
>> only one Use.
>> If ok for you, once this patch is accepted I will send another patch
>> to improve that particular case on the x86 only (adding a target
>> specific combine).
>> ///
>>
>> Please let me know if this new patch is reasonable for you.
>>
>> Thanks in advance!
>> Andrea
>>
>>
>> On Wed, Mar 5, 2014 at 9:37 PM, Tim Northover
>> <t.p.northover at gmail.com> wrote:
>> >> It is okay to generate new shuffles from a collection of
>> >> insert/extracts
>> >> because the new shuffle instructions can't be worse then a
>> >> collection
>> >> of inserts and extracts.
>> >
>> > Thanks Nadav; it sounds like I should remove the TODO and comment
>> > on
>> > the scope of that function then as part of whatever I do.
>> >
>> > Tim.
>> >
>> > _______________________________________________
>> > llvm-commits mailing list
>> > llvm-commits at cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory