[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Thu Oct 1 11:41:02 PDT 2015

Hi Hal,

It's still my most active patch, I'm working on it. I hoped to complete it today but unfortunately that didn't happen.

James

> On 1 Oct 2015, at 18:29, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ----- Original Message -----
>> From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: llvm-commits at lists.llvm.org, reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato golin"
>> <renato.golin at linaro.org>, "james molloy" <james.molloy at arm.com>
>> Sent: Thursday, October 1, 2015 12:16:56 PM
>> Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
>>
>> Hi Hal,
>>
>>> This pattern does not look hard to match in the backend, albeit a
>>> bit tedious.
>>> Did you try and run into problems?
>>
>> No, I did not try that.
>
> Also, as I recall, James was working on a patch (or helping to review one?) to help eliminate those extra zexts from this kind of vectorized code. James, can you comment on what happened there?
>
> Thanks again,
> Hal
>
>>
>> Regards,
>> Shahid
>>
>>> -----Original Message-----
>>> From: Hal Finkel [mailto:hfinkel at anl.gov]
>>> Sent: Thursday, October 01, 2015 5:58 PM
>>> To: Shahid, Asghar-ahmad
>>> Cc: llvm-commits at lists.llvm.org;
>>> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato
>>> golin;
>>> james molloy
>>> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
>>> corresponding SDNodes for horizontal sum operation.
>>>
>>> ----- Original Message -----
>>>> From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>>> Cc: llvm-commits at lists.llvm.org,
>>> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato
>>> golin"
>>>> <renato.golin at linaro.org>, "james molloy" <james.molloy at arm.com>
>>>> Sent: Thursday, October 1, 2015 3:32:30 AM
>>>> Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
>>> corresponding SDNodes for horizontal sum operation.
>>>>
>>>> Hi Hal,
>>>>
>>>> Reply inlined.
>>>>
>>>>> -----Original Message-----
>>>>> From: Hal Finkel [mailto:hfinkel at anl.gov]
>>>>> Sent: Thursday, October 01, 2015 4:14 AM
>>>>> To: Shahid, Asghar-ahmad
>>>>> Cc: llvm-commits at lists.llvm.org;
>>>>> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato
>>>>> golin;
>>>>> james molloy
>>>>> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
>>>>> and
>>>>> corresponding SDNodes for horizontal sum operation.
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
>>>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>>>>> Cc: llvm-commits at lists.llvm.org,
>>>>> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
>>>>> "renato
>>>>> golin"
>>>>>> <renato.golin at linaro.org>, "james molloy"
>>>>>> <james.molloy at arm.com>
>>>>>> Sent: Wednesday, September 30, 2015 1:19:06 AM
>>>>>> Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
>>>>>> and
>>>>> corresponding SDNodes for horizontal sum operation.
>>>>>>
>>>>>> Hi Hal,
>>>>>>
>>>>>> Let me present the full context. For below example:
>>>>>> ===============================================
>>>>>> unsigned sad_fun(unsigned char *a1, unsigned char *a2)  {
>>>>>> unsigned
>>>>>> sum =0;
>>>>>>    for(int i=0; i<16; i++) {
>>>>>>      sum += abs(a1[i] - a2[i]);
>>>>>>    }
>>>>>>  return sum;
>>>>>> }
>>>>>> ===============================================
>>>>>>
>>>>>> LLVM IR:
>>>>>>
>>>>>> ; ModuleID = 'sad_1.c'
>>>>>> target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
>>>>>> target triple = "x86_64-unknown-linux-gnu"
>>>>>>
>>>>>> ; Function Attrs: nounwind readonly uwtable define i32
>>>>>> @sad_fun(i8*
>>>>>> nocapture readonly %a1, i8* nocapture readonly %a2) #0 {
>>>>>> entry:
>>>>>>  br label %vector.body
>>>>>>
>>>>>> vector.body:                                      ; preds =
>>>>>> %entry
>>>>>>  %0 = bitcast i8* %a1 to <4 x i8>*
>>>>>>  %wide.load = load <4 x i8>, <4 x i8>* %0, align 1, !tbaa !1
>>>>>>  %1 = zext <4 x i8> %wide.load to <4 x i32>
>>>>>>  %2 = bitcast i8* %a2 to <4 x i8>*
>>>>>>  %wide.load12 = load <4 x i8>, <4 x i8>* %2, align 1, !tbaa
>>>>>>  !1
>>>>>>  %3 = zext <4 x i8> %wide.load12 to <4 x i32>
>>>>>>  %4 = sub nsw <4 x i32> %1, %3
>>>>>>  %5 = icmp sgt <4 x i32> %4, <i32 -1, i32 -1, i32 -1, i32
>>>>>>  -1>
>>>>>>  %6 = sub nsw <4 x i32> zeroinitializer, %4
>>>>>>  %7 = select <4 x i1> %5, <4 x i32> %4, <4 x i32> %6
>>>>>>  %8 = getelementptr inbounds i8, i8* %a1, i64 4
>>>>>>  %9 = bitcast i8* %8 to <4 x i8>*
>>>>>>  %wide.load.1 = load <4 x i8>, <4 x i8>* %9, align 1, !tbaa
>>>>>>  !1
>>>>>>  %10 = zext <4 x i8> %wide.load.1 to <4 x i32>
>>>>>>  %11 = getelementptr inbounds i8, i8* %a2, i64 4
>>>>>>  %12 = bitcast i8* %11 to <4 x i8>*
>>>>>>  %wide.load12.1 = load <4 x i8>, <4 x i8>* %12, align 1,
>>>>>>  !tbaa
>>>>>>  !1
>>>>>>  %13 = zext <4 x i8> %wide.load12.1 to <4 x i32>
>>>>>>  %14 = sub nsw <4 x i32> %10, %13
>>>>>>  %15 = icmp sgt <4 x i32> %14, <i32 -1, i32 -1, i32 -1, i32
>>>>>>  -1>
>>>>>>  %16 = sub nsw <4 x i32> zeroinitializer, %14
>>>>>>  %17 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %16
>>>>>>  %18 = add nsw <4 x i32> %17, %7
>>>>>>  %19 = getelementptr inbounds i8, i8* %a1, i64 8
>>>>>>  %20 = bitcast i8* %19 to <4 x i8>*
>>>>>>  %wide.load.2 = load <4 x i8>, <4 x i8>* %20, align 1, !tbaa
>>>>>>  !1
>>>>>>  %21 = zext <4 x i8> %wide.load.2 to <4 x i32>
>>>>>>  %22 = getelementptr inbounds i8, i8* %a2, i64 8
>>>>>>  %23 = bitcast i8* %22 to <4 x i8>*
>>>>>>  %wide.load12.2 = load <4 x i8>, <4 x i8>* %23, align 1,
>>>>>>  !tbaa
>>>>>>  !1
>>>>>>  %24 = zext <4 x i8> %wide.load12.2 to <4 x i32>
>>>>>>  %25 = sub nsw <4 x i32> %21, %24
>>>>>>  %26 = icmp sgt <4 x i32> %25, <i32 -1, i32 -1, i32 -1, i32
>>>>>>  -1>
>>>>>>  %27 = sub nsw <4 x i32> zeroinitializer, %25
>>>>>>  %28 = select <4 x i1> %26, <4 x i32> %25, <4 x i32> %27
>>>>>>  %29 = add nsw <4 x i32> %28, %18
>>>>>>  %30 = getelementptr inbounds i8, i8* %a1, i64 12
>>>>>>  %31 = bitcast i8* %30 to <4 x i8>*
>>>>>>  %wide.load.3 = load <4 x i8>, <4 x i8>* %31, align 1, !tbaa
>>>>>>  !1
>>>>>>  %32 = zext <4 x i8> %wide.load.3 to <4 x i32>
>>>>>>  %33 = getelementptr inbounds i8, i8* %a2, i64 12
>>>>>>  %34 = bitcast i8* %33 to <4 x i8>*
>>>>>>  %wide.load12.3 = load <4 x i8>, <4 x i8>* %34, align 1,
>>>>>>  !tbaa
>>>>>>  !1
>>>>>>  %35 = zext <4 x i8> %wide.load12.3 to <4 x i32>
>>>>>>  %36 = sub nsw <4 x i32> %32, %35
>>>>>>  %37 = icmp sgt <4 x i32> %36, <i32 -1, i32 -1, i32 -1, i32
>>>>>>  -1>
>>>>>>  %38 = sub nsw <4 x i32> zeroinitializer, %36
>>>>>>  %39 = select <4 x i1> %37, <4 x i32> %36, <4 x i32> %38
>>>>>>  %40 = add nsw <4 x i32> %39, %29
>>>>>>  %rdx.shuf = shufflevector <4 x i32> %40, <4 x i32> undef,
>>>>>>  <4 x
>>>>>>  i32>
>>>>>>  <i32 2, i32 3, i32 undef, i32 undef>
>>>>>>  %bin.rdx = add <4 x i32> %40, %rdx.shuf
>>>>>>  %rdx.shuf13 = shufflevector <4 x i32> %bin.rdx, <4 x i32>
>>>>>>  undef, <4
>>>>>>  x i32> <i32 1, i32 undef, i32 undef, i32 undef>
>>>>>>  %bin.rdx14 = add <4 x i32> %bin.rdx, %rdx.shuf13
>>>>>>  %41 = extractelement <4 x i32> %bin.rdx14, i32 0
>>>>>>  ret i32 %41
>>>>>> }
>>>>>>
>>>>>> In the above LLVM IR, vectorizer has vectorized the SAD
>>>>>> pattern
>>>>>> with vector type v4xi8.
>>>>>> For example on X86, we want to generate "v16xi8" version of
>>>>>> *PSAD*
>>>>>> instruction. However to do this we need to catch the full
>>>>>> chain of
>>>>>> vectorized absolute diffrence and the reduction sum.
>>>>>> We found this is difficult to pattern match in codegen and
>>>>>> thought
>>>>>> to go the intrinsic way.
>>>>>
>>>>> Put you need the vectorizer to vectorize in terms of v16xi8
>>>>> regardless of how you match the sum. You say that the pattern
>>>>> is
>>>>> difficult to match, but do you mean matching it with the right
>>>>> or
>>>>> the wrong vectorization factor?
>>>>
>>>> I mean matching with the right vectorization factor for a given
>>>> target.
>>>
>>> Good.
>>>
>>> This pattern does not look hard to match in the backend, albeit a
>>> bit tedious.
>>> Did you try and run into problems?
>>>
>>>>
>>>>>
>>>>>> We also realized
>>>>>> That with intrinsic we can also use SLP to vectorize the
>>>>>> unrolled
>>>>>> pattern if required. Then we simplified the SAD intrinsic
>>>>>> operation into two intrinsic which can be utilized to
>>>>>> generate
>>>>>> "SAD"
>>>>>> operation
>>>>>> as well as in different context. For example "hsum",  as
>>>>>> mentioned
>>>>>> by Bruno, can be used in context of POPCNT operation.
>>>>>>
>>>>>>> No; vectorization is not special in that regard, and needs
>>>>>>> to be
>>>>>>> semantics preserving. We have 'fast math' flags for
>>>>>>> floating-point operations, which help there, but there are
>>>>>>> no
>>>>>>> such flags for integer operations.
>>>>>> Do we need ordering definition for integer ADD operation?
>>>>>>
>>>>>>> Do the underlying hardware operations on x86 (etc.) really
>>>>>>> not
>>>>>>> define an ordering for the horizontal adds?
>>>>>> X86 *PSAD*,  which is for integer type does the horizontal
>>>>>> add
>>>>>> on intermediate result, does not define ordering.
>>>>>
>>>>> Do you mean that the intermediate accumulator has a larger
>>>>> number of
>>>>> bits than each individual vector element?
>>>>
>>>> No, with intermediate result I meant the result of the "Absolute
>>>> Difference" operation whose width is same as individual vector
>>>> element.
>>>
>>> Alright, looking at the documentation for PSADBW, I believe I see
>>> the source
>>> of our confusion. PSADBW is defined to, "Computes the absolute
>>> value of
>>> the difference of 8 unsigned byte integers from the source operand
>>> (first
>>> operand) and from the destination operand (second operand). These 8
>>> differences are then summed to produce an unsigned word integer
>>> result
>>> that is stored in the destination operand."
>>>
>>> So, this, fundamentally, is an *unsigned* i8 operation. If this is
>>> what you're
>>> trying to model, we should remove from our discussion the topic of
>>> nsw-
>>> related semantics here. nuw also does not apply. Wrapping of all
>>> kinds is
>>> perfectly well defined. Given that wrapping is well defined, and
>>> that
>>> unsigned integer addition is associative (when we don't have
>>> wrapping
>>> restrictions), we also don't need to discuss the order of the
>>> additions either
>>> (they're irrelevant; all orderings would be permitted under 'as if'
>>> regardless
>>> of what we specified).
>>>
>>> -Hal
>>>
>>>>
>>>>>
>>>>> -Hal
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Shahid
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Hal Finkel [mailto:hfinkel at anl.gov]
>>>>>>> Sent: Wednesday, September 30, 2015 3:42 AM
>>>>>>> To: Shahid, Asghar-ahmad
>>>>>>> Cc: llvm-commits at lists.llvm.org;
>>>>>>> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org;
>>> renato
>>>>>>> golin;
>>>>>>> james molloy
>>>>>>> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics
>>>>>>> 'hsum*'
>>>>>>> and
>>>>>>> corresponding SDNodes for horizontal sum operation.
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Asghar-ahmad via llvm-commits Shahid"
>>>>>>>> <llvm-commits at lists.llvm.org>
>>>>>>>> To:
>>>>>>>> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
>>>>>>>> "renato
>>>>>>> golin" <renato.golin at linaro.org>, "james molloy"
>>>>>>>> <james.molloy at arm.com>
>>>>>>>> Cc: llvm-commits at lists.llvm.org
>>>>>>>> Sent: Sunday, September 27, 2015 11:42:46 PM
>>>>>>>> Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics
>>>>>>>> 'hsum*'
>>>>>>>> and
>>>>>>> corresponding SDNodes for horizontal sum operation.
>>>>>>>>
>>>>>>>> Hi Hal,
>>>>>>>>
>>>>>>>> Response inlined.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Shahid
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
>>>>>>>>> Sent: Saturday, September 26, 2015 2:58 AM
>>>>>>>>> To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
>>>>>>>>> hfinkel at anl.gov; james.molloy at arm.com
>>>>>>>>> Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com;
>>>>>>>>> llvm-
>>>>>>>>> commits at lists.llvm.org
>>>>>>>>> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics
>>>>>>>>> 'hsum*'
>>>>>>>>> and
>>>>>>>>> corresponding SDNodes for horizontal sum operation.
>>>>>>>>>
>>>>>>>>> hfinkel added a comment.
>>>>>>>>>
>>>>>>>>> I need to take a step back here; why are we doing this
>>>>>>>>> again?
>>>>>>>>>
>>>>>>>>> I read again the RFC thread
>>>>>>>>> (http://lists.llvm.org/pipermail/llvm-dev/2015-
>>>>>>>>> May/085078.html), and it ended with the following (from
>>>>>>>>> Renato):
>>>>>>>>>
>>>>>>>>>>> BTW, now my plan is to just add the two intrinsics
>>>>>>>>>>> for
>>>>>>>>>>> 'absolute
>>>>>>>>> difference'
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> and 'horizontal add'.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> That's ok, as long as they're impossible to represent
>>>>>>>>>> in
>>>>>>>>>> plain IR.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> and I think that we were all in agreement on this
>>>>>>>>> point.
>>>>>>>>> But now
>>>>>>>>> I'm not sure you've demonstrated the prerequisite.
>>>>>>>> Earlier to this RFC there was a discussion around a patch
>>>>>>>> (http://reviews.llvm.org/D9029) regarding llvm.sad
>>>>>>>> intrinsic
>>>>>>>> which highlights the need to have llvm.hsum and friend
>>>>>>>> intrinsics. This RFC was a result of that discussion.
>>>>>>>>
>>>>>>>>> The underlying operations here (and in
>>>>>>>>> http://reviews.llvm.org/D10867) seem like they are
>>>>>>>>> representable using IR (as demonstrated by the fact
>>>>>>>>> that you
>>>>>>>>> provide potentially-equivalent IR sequences in the
>>>>>>>>> documentation), except for the ordering freedom here.
>>>>>>>>
>>>>>>>> Of course the logical expansion of the underlying
>>>>>>>> operation is
>>>>>>>> trivially representable but not the pattern matching in
>>>>>>>> codegen.
>>>>>>>> Actually pattern matching is complex and may well be
>>>>>>>> fragile
>>>>>>>> in codegen due to the long chain of generic IR to be
>>>>>>>> matched.
>>>>>>>> Moreover movements of truncate/extends ops in DAG
>>>>>>>> increase the
>>>>>>>> complexity of matching further.
>>>>>>>
>>>>>>> It might, but it is not clearly bad enough to require an
>>>>>>> alternative solution.
>>>>>>>
>>>>>>> In my experience, the patterns that get difficult to match
>>>>>>> at
>>>>>>> the SDAG level are those that consist, in whole or in part,
>>>>>>> of
>>>>>>> operations that are illegal (either involving illegal types
>>>>>>> or
>>>>>>> illegal operations). This becomes difficult because you
>>>>>>> either
>>>>>>> need to match the post-legalized nodes (which might involve
>>>>>>> undoing scalarization, etc.) or you need to match the nodes
>>>>>>> pre-legalization (where you cannot depend on the
>>>>>>> canonicalization that restricts the state space later).
>>>>>>> Here,
>>>>>>> however, we seem to be dealing with patterns of otherwise
>>>>>>> legal
>>>>>>> nodes, on otherwise legal vector types, and while the
>>>>>>> matching
>>>>>>> might not be trivial, it is not obviously difficult in
>>>>>>> *ISelLowering.cpp either.
>>>>>>>
>>>>>>> The other issue that I've run into with SDAG pattern
>>>>>>> matching is
>>>>>>> the fact that it is basic-block local, and some of the
>>>>>>> operations you want to match may have been CSE'd or
>>>>>>> otherwise
>>>>>>> hoisted into dominating blocks. We generally put code in
>>>>>>> CGP to
>>>>>>> deal with these kinds of things.
>>>>>>>
>>>>>>> One advantage of putting the code into SDAG pattern
>>>>>>> matching is
>>>>>>> that the patterns get used regardless of where the code
>>>>>>> originates (vectorizers or manually-written vector code).
>>>>>>> This
>>>>>>> is important.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> And this, I fear, is where we run into trouble. The
>>>>>>>>> thing
>>>>>>>>> that is not representable in the IR is that the order
>>>>>>>>> of
>>>>>>>>> operations in the horizontal sum is undefined, and at
>>>>>>>>> the
>>>>>>>>> same time, signed overflow is undefined. This cannot be
>>>>>>>>> represented in the IR because nsw adds don't
>>>>>>>>> reassociate,
>>>>>>>>> and thus, there's no way to represent the sequence of
>>>>>>>>> nsw
>>>>>>>>> adds such that they can be reassociated while retaining
>>>>>>>>> their nsw property. But the problem is that, because
>>>>>>>>> this
>>>>>>>>> freedom cannot be represented in the IR, we also can't
>>>>>>>>> generate it from IR in a semantics-preserving way; and,
>>>>>>>>> thus, it would not be legal to generate it in the
>>>>>>>>> vectorizers.
>>>>>>>>
>>>>>>>> As proposed, we are not going to generate it always in
>>>>>>>> vectorizer but only based on target checks, can't we also
>>>>>>>> check for ordering relaxation there?
>>>>>>>> Or fix the ordering
>>>>>>>> constraint by some other means.
>>>>>>>
>>>>>>> No; vectorization is not special in that regard, and needs
>>>>>>> to be
>>>>>>> semantics preserving. We have 'fast math' flags for
>>>>>>> floating-point operations, which help there, but there are
>>>>>>> no
>>>>>>> such flags for integer operations.
>>>>>>>
>>>>>>> Do the underlying hardware operations on x86 (etc.) really
>>>>>>> not
>>>>>>> define an ordering for the horizontal adds?
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thus, this change does not seem right, and approving
>>>>>>>>> http://reviews.llvm.org/D10867 seems like a mistake as
>>>>>>>>> well.
>>>>>>>>> We could certainly fix the definition here to make it
>>>>>>>>> exactly representable in the IR, but then what's the
>>>>>>>>> point
>>>>>>>>> of doing so?
>>>>>>>>>
>>>>>>>>> In the RFC, you mentioned cost modeling as a major
>>>>>>>>> motivation for adding intrinsics, but that seems like
>>>>>>>>> an
>>>>>>>>> unrelated problem (at least in part).
>>>>>>>>
>>>>>>>> In RFC, I had also proposed llvm.sad intrinsic which I
>>>>>>>> thought
>>>>>>>> would make cost modelling easier for those targets which
>>>>>>>> support sad operation natively. Hence The motivation. But
>>>>>>>> Renato's explanation clarified the things related to cost
>>>>>>>> modeling. However the complexity of pattern matching was
>>>>>>>> always a bigger issue.
>>>>>>>
>>>>>>> Perhaps, but it is localized. And matching adds and
>>>>>>> subtracts
>>>>>>> plus shuffles does not seem like it should be unusually
>>>>>>> painful.
>>>>>>>
>>>>>>>>
>>>>>>>>> During
>>>>>>>>> vectorization, we can use special interfaces to
>>>>>>>>> estimate the
>>>>>>>>> cost of complex patterns. In fact, we already have an
>>>>>>>>> interface for
>>>>>>>>> reductions:
>>>>>>>>> TTI.getReductionCost. There is a second relevant code
>>>>>>>>> model:
>>>>>>>>> That
>>>>>>>>> used by the unroller and inliner. Vectorization happens
>>>>>>>>> after inlining, so that interaction is not really
>>>>>>>>> relevant,
>>>>>>>>> but partial unrolling happens after vectorization, and
>>>>>>>>> so
>>>>>>>>> the cost model there might want to understand that a
>>>>>>>>> complex
>>>>>>>>> sequence of shuffles, extracts and adds has a
>>>>>>>>> disproportionately-
>>>>>>>>> low cost. The same is true of the inliner if the input
>>>>>>>>> IR
>>>>>>>>> uses vector types and initially contains these
>>>>>>>>> operations,
>>>>>>>>> but even in that case, you'd not want to canonicalize
>>>>>>>>> on the
>>>>>>>>> intrinsics too early in case other optimizations remove
>>>>>>>>> the
>>>>>>>>> need for most of the operations. Thus, in the end, you
>>>>>>>>> need
>>>>>>>>> pattern-
>>>>>>>>> matching code near the end of the pipeline anyway to
>>>>>>>>> account
>>>>>>>>> for input IR directly containing relevant operations on
>>>>>>>>> vector types.
>>>>>>>>>
>>>>>>>>> In short, I don't understand why we're going in this
>>>>>>>>> direction.
>>>>>>>>> You
>>>>>>>>> can match these operations in the backend, and you can
>>>>>>>>> match
>>>>>>>>> them in the cost-model code. If we can't do the latter,
>>>>>>>>> then
>>>>>>>>> we should fix that. And we already have special
>>>>>>>>> interfaces
>>>>>>>>> for vectorization costs for complex operations such as
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://reviews.llvm.org/D10964
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> llvm-commits mailing list
>>>>>>>> llvm-commits at lists.llvm.org
>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>>>>
>>>>>>> --
>>>>>>> Hal Finkel
>>>>>>> Assistant Computational Scientist Leadership Computing
>>>>>>> Facility
>>>>>>> Argonne National Laboratory
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Assistant Computational Scientist
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782