[llvm-dev] llvm-dev Digest, Vol 144, Issue 93

Sat Jun 18 10:18:03 PDT 2016

Dear all,
      Please help me to generate DFG(Data flow graph) in LLVM.

Best regards,
    Huy

Mobile: +84942976091

-----Original Message-----
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of via llvm-dev
Sent: Friday, June 17, 2016 12:47 AM
To: llvm-dev at lists.llvm.org
Subject: llvm-dev Digest, Vol 144, Issue 93

Send llvm-dev mailing list submissions to
	llvm-dev at lists.llvm.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
or, via email, send a message with subject or body 'help' to
	llvm-dev-request at lists.llvm.org

You can reach the person managing the list at
	llvm-dev-owner at lists.llvm.org

When replying, please edit your Subject line so it is more specific than "Re: Contents of llvm-dev digest..."

Today's Topics:

   1. Re: [RFC] Allow loop vectorizer to choose vector widths that
      generate illegal types (Nadav Rotem via llvm-dev)
   2. Re: Intended behavior of CGSCC pass manager.
      (Xinliang David Li via llvm-dev)
   3. Re: [RFC] Allow loop vectorizer to choose vector widths that
      generate illegal types (Hal Finkel via llvm-dev)
   4. Re: parallel-lib: New LLVM Suproject (Jason Henline via llvm-dev)

----------------------------------------------------------------------

Message: 1
Date: Thu, 16 Jun 2016 17:23:41 +0000 (GMT)
From: Nadav Rotem via llvm-dev <llvm-dev at lists.llvm.org>
To: Michael Kuperstein <mkuper at google.com>
Cc: Cong Hou <congh at google.com>, Matthew Simpson
	<mssimpso at codeaurora.org>, Llvm Dev <llvm-dev at lists.llvm.org>, David
	Li <davidxl at google.com>, Wei Mi <wmi at google.com>
Subject: Re: [llvm-dev] [RFC] Allow loop vectorizer to choose vector
	widths that generate illegal types
Message-ID: <c663eb92-7b1f-448d-b9da-cc3354e504dc at me.com>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

On Jun 16, 2016, at 09:09 AM, Michael Kuperstein <mkuper at google.com> wrote:

Thanks, Ayal!

On Thu, Jun 16, 2016 at 7:15 AM, Zaks, Ayal <ayal.zaks at intel.com> wrote:
Some thoughts:

o To determine the VF for a loop with mixed data sizes, choosing the smallest ensures each vector register used is full, choosing the largest will minimize the number of vector registers used. Which one’s better, or some size in between, depends on the target’s costs for the vector operations, availability of registers and possibly control/memory divergence and trip count. “This is a question of cost modeling” and its associated compile-time, but in general good vectorization of loops with mixed data sizes is expected to be important, especially when larger scopes are vectorized. BTW, SLP followed this a year ago: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150706/286110.html

Yes, I agree completely.
The approach we have right now is that availability of registers is a hard upper bound, and I'm not planning on changing that (e.g. by modeling spill cost.) at the moment.

I'm not too worried about the compile time impact. I haven't measured it yet, but one thing that may mitigate this is the fact that postponing interleaving until the legalizer will result in smaller IR coming out of the vectorizer. So the increased compile-time cost of the TTI queries may be offset by the decreased amount of work for post-vectorizer IR passes and pre-legalization ISel. Anyway, this is all idle talk right now, as you and Nadav said, it needs to be measured.

o As for increasing VF beyond maximize-bandwidth, one could argue that a vectorizer should focus on tapping the SIMD capabilities of the target, up to maximize-bandwidth, and that its vectorized loop should later be subject to a separate independent unroller/interleaver pass. One suggestion, regardless, is to use the term “unroll-and-jam”, which traditionally applies to loops containing control-flow and nested loops but is quite clear for innermost loops too, instead of the overloaded term “interleaving”. Admittedly loop vectorization conceptually applies unroll-and-jam followed by packetization into vectors, so why unroll-and-jam twice. As noted, the considerations for best unroll factor are different from those of best VF for optimal usage of SIMD capabilities. Indeed representing in LLVM-IR a loop with vectors longer than maximize-bandwidth looks more appealing than replicating its ‘legal’ vectors, easier produced by the vectorizer than by an unroll-and-jam pass. BTW, taken to the extreme, one could vectorize to the full trip count of the loop, as in http://impact.crhc.illinois.edu/shared/Papers/tr2014.mxpa.pdf, where memory spatial locality is deemed more important to optimize than register usage.

Experimenting with increasing the VF search space beyond the size of the machine vector for loops with mixed types makes sense to me.

In a separate thread you mentioned that the cost model is inaccurate for some cross-vector trunc/exts. It should be easy to improve the quality of the estimation for the common operations. It should just be a matter of plugging a new value in the cost table that overrides the legalization logic in the cost model. 

"Why unroll-and-jam twice" is precisely the motivation behind increasing VF beyond maximize-bandwidth. :-) Both getting good code from the legalizer and getting good cost modeling for illegal types are required to increase VF up to choosing the smallest scalar type. And if that works out, then going beyond maximize-bandwidth seems like it should require fairly little additional work. I think once we go beyond maximize-bandwidth, and assume the legalizer will split things back up, the consideration for the best unroll factor and the best VF becomes essentially the same, since increasing the VF, in effect, increases the unroll factor.

It's possible that we'll need two different cost estimates, one up to max-bandwidth, and one beyond max-bandwidth - and in this case, I'm not sure the exercise is worthwhile.
In any case, I mostly see this is as a bonus, what's really important to me is getting maximize-bandwidth to work well.

As to the terminology - I agree "unroll-and-jam" is the correct technical term, but it's not currently used in the vectorizer, and I wanted to keep the terminology here consistent with the code.

Ayal.

From: Michael Kuperstein [mailto:mkuper at google.com]
Sent: Thursday, June 16, 2016 10:42
To: Nadav Rotem <nadav.rotem at me.com>
Cc: Hal Finkel <hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>; Demikhovsky, Elena <elena.demikhovsky at intel.com>; Adam Nemet <anemet at apple.com>; Sanjoy Das <sanjoy at playingwithpointers.com>; James Molloy <james.molloy at arm.com>; Matthew Simpson <mssimpso at codeaurora.org>; Sanjay Patel <spatel at rotateright.com>; Chandler Carruth <chandlerc at google.com>; David Li <davidxl at google.com>; Wei Mi <wmi at google.com>; Dehao Chen <dehao at google.com>; Cong Hou <congh at google.com>; Llvm Dev <llvm-dev at lists.llvm.org>
Subject: Re: [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Hi Nadav,
Thanks a lot for the feedback!

Of course we need to explore this with numbers. Not just in terms of the performance vs. compile-time, but in general in terms of the performance benefit. For now, I'm just trying to get a feel for whether people think this sounds like a reasonable idea. As I wrote in the original email, we already have this under a flag (it was added by Cong last year). But it will be hard to get reliable performance numbers without first having the cost model provide better-quality answers at the higher vectorization factors. 

I didn't mean that we should be duplicating every optimization the SelectionDAG makes. Of course the cost model is only a rough approximation. What I do want the (generic) cost model to do, however, is provide a more-or-less precise approximation of legalization costs. To be concrete, http://reviews.llvm.org/D21251 is a first step in that direction. Do you think this is something the cost model should not be doing?

Regarding loop widening - see my email to Dibyendu for what I meant. For mixed-type loops, it really depends. Let's say you have a mixed-type loop, with i32 and i64, and 256-bit registers. Would the extra parallelism you get from vectorizing by 4 and interleaving be worth the throughput loss you suffer from not vectorizing the i32 operations by 8? It seems like this would depend heavily on the specific loop, and the proportion of i32 and i64 instructions. This is exactly the question I'd like to get the cost model to answer. Do you think this is not feasible? It shouldn't (I hope :-) ) require modeling every possible shuffle.

Thanks,
  Michael

On Wed, Jun 15, 2016 at 11:24 PM, Nadav Rotem <nadav.rotem at me.com> wrote:
Hi Michael, 

Thank you for working on this. The loop vectorizer tries a bunch of different vectorization factors and stops at the widest word size mostly because of compile time concerns. On every vectorization factors that we check we have to scan all of the instructions in the loop and make multiple calls into TTI. If you decide to increase the VF enumeration space then you will linearly increase the compile time of the loop vectorizer. I think that it would be a good idea to explore this compile-time vs performance tradeoff with numbers.

The cost model is designed to be a fast approximation of SelectionDAG. We don't want to duplicate every optimization in SelectionDAG into the cost model because this would make the code model (and the optimizer) difficult to maintain. If the cost model does not represent an operation that you care about then you should add it to the cost tables. 

I don't understand how selecting wide vectors would eliminate the need to have loop widening.  Loop widening happens to break data dependencies and allow more parallelism. If you have two independent arithmetic operations then they can go into different execution units, or to pipelined execution units. Your mixed-typed loops would cause a shuffle across registers (which we can't model well in the cost model, for obvious reasons) that will pack multiple lanes into a smaller vector and this would introduce a data dependency. 

Maybe you should start by increasing the enumeration space (by 2X, for example) under a flag and see if you get any performance gains. 

-Nadav

On Jun 15, 2016, at 03:48 PM, Michael Kuperstein <mkuper at google.com> wrote:

Hello,

Currently the loop vectorizer will, by default, not consider vectorization factors that would make it generate types that do not fit into the target platform's vector registers. That is, if the widest scalar type in the scalar loop is i64, and the platform's largest vector register is 256-bit wide, we will not consider a VF above 4.

We have a command line option (-mllvm -vectorizer-maximize-bandwidth), that will choose VFs for consideration based on the narrowest scalar type instead of the widest one, but I don't believe it has been widely tested. If anyone has had an opportunity to play around with it, I'd love to hear about the results.

What I'd like to do is:
Step 1: Make -vectorizer-maximize-bandwidth the default. This should improve the performance of loops that contain mixed-width types.
Step 2: Remove the artificial width limitation altogether, and base the vectorization factor decision purely on the cost model. This should allow us to get rid of the interleaving code in the loop vectorizer, and get interleaving for "free" from the legalizer instead.

There are two potential road-blocks I see - the cost-model, and the legalizer. To make this work, we need to:
a) Model the cost of operations on illegal types better. Right now, what we get is sometimes completely ridiculous (e.g. see http://reviews.llvm.org/D21251).
b) Make sure the cost model actually stops us when the VF becomes too large. This is mostly a question of correctly estimating the register pressure. In theory, that should not be a issue - we already rely on this estimate to choose the interleaving factor, so using the same logic to upper-bound the VF directly shouldn't make things worse.
c) Ensure the legalizer is up to the task of emitting good code for overly wide vectors. I've talked about this with Chandler, and his opinion (Chandler, please correct me if I'm wrong) is that on x86, the legalizer is likely to be able to handle this. This may not be true for other platforms. So, I'd like to try to make this the default on a platform-by-platform basis, starting with x86.

What do you think? Does this seem like a step in the right direction? Anything important I'm missing?

Thanks,
  Michael

---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/30e20d82/attachment-0001.html>

------------------------------

Message: 2
Date: Thu, 16 Jun 2016 10:45:50 -0700
From: Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org>
To: Hal Finkel <hfinkel at anl.gov>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] Intended behavior of CGSCC pass manager.
Message-ID:
	<CAAkRFZ+Sx2amWWJDP801BGzKLuSt8Bz+K54QWmU2gx1HoZXwkg at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

> To clarify, we're trying to provide this invariant on the "ref" graph 
> or on the graph with direct calls only? I think the invariant need 
> only apply to the former
>

More clarification needed :) What do you mean by 'invariant need only apply to the former'?

> if we're relying on this for correctness (i.e. an analysis must visit 
> all callees before visiting the callers).
>

Not necessarily. Due to lost edges (from caller to indirect callees), a callee node may be visited later.  The analysis will just have to punt when a special edge to 'external' node is seen.

David

>
>  -Hal
>
> Consider the pipeline `cgscc(function(...simplifications that can 
> devirtualize...),foo-cgscc-pass)`. A possible visitation is as follows:
>
> 1. Visit SCC {S,T} and run `function(...simplifications that can 
> devirtualize...)`. This reveals the call edge T->Y.
> 2. We continue visiting SCC {S,T} and run foo-cgscc-pass on SCC {S,T}.
> 3. Visit SCC {X,Y} and run `function(...simplifications that can 
> devirtualize...)`. This reveals the call edge X->S.
> 4. ??? what do we do now.
> Alternative 4.a) Should we continue the visitation and call 
> foo-cgscc-pass on "SCC" {X,Y}?
> Alternative 4.b) Should foo-cgscc-pass now run on SCC {S,T,X,Y}?
> Alternative 4.c) Should we restart the entire outer `cgscc(...)` 
> visitation on SCC {S,T,X,Y}?
>
> (Without a cap both 4.b and 4.c could become quadratic on a graph like
> http://reviews.llvm.org/F2073607)
>
> -- Sean Silva
>
>
>>
>> thanks,
>>
>> David
>>
>>
>>
>>
>>> Sean:~/pg/llvm % git grep 'public CallGraphSCCPass'
>>> include/llvm/Transforms/IPO/InlinerPass.h:struct Inliner : public 
>>> CallGraphSCCPass {
>>> lib/Transforms/IPO/ArgumentPromotion.cpp:  struct ArgPromotion : 
>>> public CallGraphSCCPass { 
>>> lib/Transforms/IPO/FunctionAttrs.cpp:struct
>>> PostOrderFunctionAttrsLegacyPass : public CallGraphSCCPass {
>>> lib/Transforms/IPO/PruneEH.cpp:  struct PruneEH : public 
>>> CallGraphSCCPass {
>>> lib/Analysis/CallGraphSCCPass.cpp:  class PrintCallGraphPass : 
>>> public CallGraphSCCPass { tools/opt/PassPrinters.cpp:struct 
>>> CallGraphSCCPassPrinter : public CallGraphSCCPass {
>>>
>>> CGSCC passes seem to have been added in what is now SVN r8247 (~Aug
>>> 2003)
>>> http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20030825/00
>>> 6619.html (LLVM appears to have been in CVS at the time).
>>>
>>> Chris, do you remember the motivation for doing the CGSCC visitation 
>>> instead of a pure post-order function visitation like David is mentioning?
>>> (or was it just an oversight / hindsight-20-20 thing?) Do you think 
>>> it would make sense to replace CGSCC visitation with post-order 
>>> function visitation in the current LLVM?
>>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>> David
>>>>
>>>>
>>>>> -- Sean Silva
>>>>>
>>>>>
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Sorry for the wall of text.
>>>>>>>
>>>>>>>
>>>>>>> -- Sean Silva
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/4c04ea97/attachment-0001.html>

------------------------------

Message: 3
Date: Thu, 16 Jun 2016 12:48:27 -0500
From: Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org>
To: Michael Kuperstein <mkuper at google.com>
Cc: Matthew Simpson <mssimpso at codeaurora.org>, Cong Hou
	<congh at google.com>, Llvm Dev <llvm-dev at lists.llvm.org>, David Li
	<davidxl at google.com>, Wei Mi <wmi at google.com>
Subject: Re: [llvm-dev] [RFC] Allow loop vectorizer to choose vector
	widths that generate illegal types
Message-ID:
	<4515713.853.1466099299868.JavaMail.hfinkel at sapling5.localdomain>
Content-Type: text/plain; charset="utf-8"

----- Original Message -----

> From: "Michael Kuperstein" <mkuper at google.com>
> To: "Ayal Zaks" <ayal.zaks at intel.com>
> Cc: "Nadav Rotem" <nadav.rotem at me.com>, "Hal Finkel"
> <hfinkel at anl.gov>, "Elena Demikhovsky"
> <elena.demikhovsky at intel.com>, "Adam Nemet" <anemet at apple.com>, 
> "Sanjoy Das" <sanjoy at playingwithpointers.com>, "James Molloy"
> <james.molloy at arm.com>, "Matthew Simpson" <mssimpso at codeaurora.org>, 
> "Sanjay Patel" <spatel at rotateright.com>, "Chandler Carruth"
> <chandlerc at google.com>, "David Li" <davidxl at google.com>, "Wei Mi"
> <wmi at google.com>, "Dehao Chen" <dehao at google.com>, "Cong Hou"
> <congh at google.com>, "Llvm Dev" <llvm-dev at lists.llvm.org>
> Sent: Thursday, June 16, 2016 11:09:09 AM
> Subject: Re: [RFC] Allow loop vectorizer to choose vector widths that 
> generate illegal types

> Thanks, Ayal!

> On Thu, Jun 16, 2016 at 7:15 AM, Zaks, Ayal < ayal.zaks at intel.com >
> wrote:

> > Some thoughts:
> 

> > o To determine the VF for a loop with mixed data sizes, choosing the 
> > smallest ensures each vector register used is full, choosing the 
> > largest will minimize the number of vector registers used. Which 
> > one’s better, or some size in between, depends on the target’s costs 
> > for the vector operations, availability of registers and possibly 
> > control/memory divergence and trip count. “This is a question of 
> > cost modeling” and its associated compile-time, but in general good 
> > vectorization of loops with mixed data sizes is expected to be 
> > important, especially when larger scopes are vectorized. BTW, SLP 
> > followed this a year ago:
> > http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150706/28
> > 6110.html
> 

> Yes, I agree completely.
> The approach we have right now is that availability of registers is a 
> hard upper bound, and I'm not planning on changing that (e.g. by 
> modeling spill cost.) at the moment.
We chatted about this on IRC, but I'll add here that I'm in favor of this. I think that it will put (useful) pressure on improving the cost model, and our register-pressure heuristics, and the quality of our legalization code. In the end, I think that the representation makes sense. 

> I'm not too worried about the compile time impact. I haven't measured 
> it yet, but one thing that may mitigate this is the fact that 
> postponing interleaving until the legalizer will result in smaller IR 
> coming out of the vectorizer. So the increased compile-time cost of 
> the TTI queries may be offset by the decreased amount of work for 
> post-vectorizer IR passes and pre-legalization ISel. Anyway, this is 
> all idle talk right now, as you and Nadav said, it needs to be 
> measured.

> > o As for increasing VF beyond maximize-bandwidth, one could argue 
> > that a vectorizer should focus on tapping the SIMD capabilities of 
> > the target, up to maximize-bandwidth, and that its vectorized loop 
> > should later be subject to a separate independent 
> > unroller/interleaver pass. One suggestion, regardless, is to use the 
> > term “unroll-and-jam”, which traditionally applies to loops 
> > containing control-flow and nested loops but is quite clear for 
> > innermost loops too, instead of the overloaded term “interleaving”.
> > Admittedly loop vectorization conceptually applies unroll-and-jam 
> > followed by packetization into vectors, so why unroll-and-jam twice.
> > As noted, the considerations for best unroll factor are different 
> > from those of best VF for optimal usage of SIMD capabilities.
> > Indeed
> > representing in LLVM-IR a loop with vectors longer than 
> > maximize-bandwidth looks more appealing than replicating its ‘legal’
> > vectors, easier produced by the vectorizer than by an unroll-and-jam 
> > pass. BTW, taken to the extreme, one could vectorize to the full 
> > trip count of the loop, as in 
> > http://impact.crhc.illinois.edu/shared/Papers/tr2014.mxpa.pdf , 
> > where memory spatial locality is deemed more important to optimize 
> > than register usage.
> 

> "Why unroll-and-jam twice" is precisely the motivation behind
> increasing VF beyond maximize-bandwidth. :-)
> Both getting good code from the legalizer and getting good cost
> modeling for illegal types are required to increase VF up to
> choosing the smallest scalar type. And if that works out, then going
> beyond maximize-bandwidth seems like it should require fairly little
> additional work. I think once we go beyond maximize-bandwidth, and
> assume the legalizer will split things back up, the consideration
> for the best unroll factor and the best VF becomes essentially the
> same, since increasing the VF, in effect, increases the unroll
> factor.

> It's possible that we'll need two different cost estimates, one up to
> max-bandwidth, and one beyond max-bandwidth - and in this case, I'm
> not sure the exercise is worthwhile.
> In any case, I mostly see this is as a bonus, what's really important
> to me is getting maximize-bandwidth to work well.

> As to the terminology - I agree "unroll-and-jam" is the correct
> technical term, but it's not currently used in the vectorizer, and I
> wanted to keep the terminology here consistent with the code.
Yea, we came up with interleaving to differentiate it from the concatenation unrolling that the unrolling pass performs. If we'd like to rename this to be akin to a jamming factor, for consistency with the literature, I don't object. As I recall, interleaving was the least bad option we discussed at the time ;) 

-Hal 

> > Ayal.
> 

> > From: Michael Kuperstein [mailto: mkuper at google.com ]
> 
> > Sent: Thursday, June 16, 2016 10:42
> 
> > To: Nadav Rotem < nadav.rotem at me.com >
> 
> > Cc: Hal Finkel < hfinkel at anl.gov >; Zaks, Ayal <
> > ayal.zaks at intel.com
> > >; Demikhovsky, Elena < elena.demikhovsky at intel.com >; Adam Nemet <
> > anemet at apple.com >; Sanjoy Das < sanjoy at playingwithpointers.com >;
> > James Molloy < james.molloy at arm.com >; Matthew Simpson <
> > mssimpso at codeaurora.org >; Sanjay Patel < spatel at rotateright.com >;
> > Chandler Carruth < chandlerc at google.com >; David Li <
> > davidxl at google.com >; Wei Mi < wmi at google.com >; Dehao Chen <
> > dehao at google.com >; Cong Hou < congh at google.com >; Llvm Dev <
> > llvm-dev at lists.llvm.org >
> 
> > Subject: Re: [RFC] Allow loop vectorizer to choose vector widths
> > that
> > generate illegal types
> 

> > Hi Nadav,
> 

> > Thanks a lot for the feedback!
> 

> > Of course we need to explore this with numbers. Not just in terms
> > of
> > the performance vs. compile-time, but in general in terms of the
> > performance benefit. For now, I'm just trying to get a feel for
> > whether people think this sounds like a reasonable idea. As I wrote
> > in the original email, we already have this under a flag (it was
> > added by Cong last year). But it will be hard to get reliable
> > performance numbers without first having the cost model provide
> > better-quality answers at the higher vectorization factors.
> 

> > I didn't mean that we should be duplicating every optimization the
> > SelectionDAG makes. Of course the cost model is only a rough
> > approximation. What I do want the (generic) cost model to do,
> > however, is provide a more-or-less precise approximation of
> > legalization costs. To be concrete, http://reviews.llvm.org/D21251
> > is a first step in that direction. Do you think this is something
> > the cost model should not be doing?
> 

> > Regarding loop widening - see my email to Dibyendu for what I
> > meant.
> > For mixed-type loops, it really depends. Let's say you have a
> > mixed-type loop, with i32 and i64, and 256-bit registers. Would the
> > extra parallelism you get from vectorizing by 4 and interleaving be
> > worth the throughput loss you suffer from not vectorizing the i32
> > operations by 8? It seems like this would depend heavily on the
> > specific loop, and the proportion of i32 and i64 instructions. This
> > is exactly the question I'd like to get the cost model to answer.
> > Do
> > you think this is not feasible? It shouldn't (I hope :-) ) require
> > modeling every possible shuffle.
> 

> > Thanks,
> 

> > Michael
> 

> > On Wed, Jun 15, 2016 at 11:24 PM, Nadav Rotem < nadav.rotem at me.com
> > >
> > wrote:
> 
> > > Hi Michael,
> > 
> 

> > > Thank you for working on this. The loop vectorizer tries a bunch
> > > of
> > > different vectorization factors and stops at the widest word size
> > > mostly because of compile time concerns. On every vectorization
> > > factors that we check we have to scan all of the instructions in
> > > the
> > > loop and make multiple calls into TTI. If you decide to increase
> > > the
> > > VF enumeration space then you will linearly increase the compile
> > > time of the loop vectorizer. I think that it would be a good idea
> > > to
> > > explore this compile-time vs performance tradeoff with numbers.
> > 
> 

> > > The cost model is designed to be a fast approximation of
> > > SelectionDAG. We don't want to duplicate every optimization in
> > > SelectionDAG into the cost model because this would make the code
> > > model (and the optimizer) difficult to maintain. If the cost
> > > model
> > > does not represent an operation that you care about then you
> > > should
> > > add it to the cost tables.
> > 
> 

> > > I don't understand how selecting wide vectors would eliminate the
> > > need to have loop widening. Loop widening happens to break data
> > > dependencies and allow more parallelism. If you have two
> > > independent
> > > arithmetic operations then they can go into different execution
> > > units, or to pipelined execution units. Your mixed-typed loops
> > > would
> > > cause a shuffle across registers (which we can't model well in
> > > the
> > > cost model, for obvious reasons) that will pack multiple lanes
> > > into
> > > a smaller vector and this would introduce a data dependency.
> > 
> 

> > > Maybe you should start by increasing the enumeration space (by
> > > 2X,
> > > for example) under a flag and see if you get any performance
> > > gains.
> > 
> 

> > > -Nadav
> > 
> 

> > > On Jun 15, 2016, at 03:48 PM, Michael Kuperstein <
> > > mkuper at google.com
> > > > wrote:
> > 
> 

> > > > Hello,
> > > 
> > 
> 

> > > > Currently the loop vectorizer will, by default, not consider
> > > > vectorization factors that would make it generate types that do
> > > > not
> > > > fit into the target platform's vector registers. That is, if
> > > > the
> > > > widest scalar type in the scalar loop is i64, and the
> > > > platform's
> > > > largest vector register is 256-bit wide, we will not consider a
> > > > VF
> > > > above 4.
> > > 
> > 
> 

> > > > We have a command line option (-mllvm
> > > > -vectorizer-maximize-bandwidth), that will choose VFs for
> > > > consideration based on the narrowest scalar type instead of the
> > > > widest one, but I don't believe it has been widely tested. If
> > > > anyone
> > > > has had an opportunity to play around with it, I'd love to hear
> > > > about the results.
> > > 
> > 
> 

> > > > What I'd like to do is:
> > > 
> > 
> 

> > > > Step 1: Make -vectorizer-maximize-bandwidth the default. This
> > > > should
> > > > improve the performance of loops that contain mixed-width
> > > > types.
> > > 
> > 
> 
> > > > Step 2: Remove the artificial width limitation altogether, and
> > > > base
> > > > the vectorization factor decision purely on the cost model.
> > > > This
> > > > should allow us to get rid of the interleaving code in the loop
> > > > vectorizer, and get interleaving for "free" from the legalizer
> > > > instead.
> > > 
> > 
> 

> > > > There are two potential road-blocks I see - the cost-model, and
> > > > the
> > > > legalizer. To make this work, we need to:
> > > 
> > 
> 

> > > > a) Model the cost of operations on illegal types better. Right
> > > > now,
> > > > what we get is sometimes completely ridiculous (e.g. see
> > > > http://reviews.llvm.org/D21251 ).
> > > 
> > 
> 

> > > > b) Make sure the cost model actually stops us when the VF
> > > > becomes
> > > > too
> > > > large. This is mostly a question of correctly estimating the
> > > > register pressure. In theory, that should not be a issue - we
> > > > already rely on this estimate to choose the interleaving
> > > > factor,
> > > > so
> > > > using the same logic to upper-bound the VF directly shouldn't
> > > > make
> > > > things worse.
> > > 
> > 
> 

> > > > c) Ensure the legalizer is up to the task of emitting good code
> > > > for
> > > > overly wide vectors. I've talked about this with Chandler, and
> > > > his
> > > > opinion (Chandler, please correct me if I'm wrong) is that on
> > > > x86,
> > > > the legalizer is likely to be able to handle this. This may not
> > > > be
> > > > true for other platforms. So, I'd like to try to make this the
> > > > default on a platform-by-platform basis, starting with x86.
> > > 
> > 
> 

> > > > What do you think? Does this seem like a step in the right
> > > > direction?
> > > > Anything important I'm missing?
> > > 
> > 
> 

> > > > Thanks,
> > > 
> > 
> 

> > > > Michael
> > > 
> > 
> 
> > ---------------------------------------------------------------------
> 
> > Intel Israel (74) Limited
> 
> > This e-mail and any attachments may contain confidential material
> > for
> 
> > the sole use of the intended recipient(s). Any review or
> > distribution
> 
> > by others is strictly prohibited. If you are not the intended
> 
> > recipient, please contact the sender and delete all copies.
> 
-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/b9b8a489/attachment-0001.html>

------------------------------

Message: 4
Date: Thu, 16 Jun 2016 17:50:04 +0000
From: Jason Henline via llvm-dev <llvm-dev at lists.llvm.org>
To: Tanya Lattner <tanyalattner at llvm.org>
Cc: llvm-dev <llvm-dev at lists.llvm.org>, "openmp-dev at lists.llvm.org"
	<openmp-dev at lists.llvm.org>
Subject: Re: [llvm-dev] parallel-lib: New LLVM Suproject
Message-ID:
	<CAE0US6+tmkoTT+86hHKCT9ONV0WwyV9vyBPDXrKi3CgUpzcCLw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Thanks for your help, Tanya!

I haven't created the project in SVN yet. Am I able to set it up myself on
the LLVM servers, or does someone else need to do that part?

I'll be glad to volunteer to moderate the new mailing lists.

We will want a website. I think there will be a top-level docs directory
for the project and a docs directory for each subproject. To begin with,
StreamExecutor will be the only subproject so the structure will look
something like this:

parallel-libs/
  docs/
  stream_executor/
    docs/

Does that seem like a reasonable way to set it up?

On Wed, Jun 15, 2016 at 8:44 PM Tanya Lattner <tanyalattner at llvm.org> wrote:

> Jason,
>
> This sounds good. Have you created the project in SVN yet?
>
> I’ll need to do the post commit hook to the mailing lists, plus set up
> mailing lists. Would anyone want to volunteer to moderate the new mailing
> lists? Will you have docs that need updating or website?
>
> -Tanya
>
> On Jun 13, 2016, at 11:01 AM, Jason Henline <jhen at google.com> wrote:
>
> Hi Tanya,
>
> As discussed in the past few weeks in the llvm-dev thread “RFC: Proposing
> an LLVM subproject for parallelism runtime and support libraries”, we would
> like to start a new LLVM subproject called parallel-libs (a kind of a
> parallel cousin to compiler-rt), and I was told you were the one to contact
> in order to get it created. The charter for the project is included below.
> Are you able to get this subproject set up?
>
> Thanks for your help,
> -Jason
>
>
> Charter:
> =====================================================
> LLVM parallel-libs Subproject Charter
> =====================================================
>
> ----------------------------------------------
> Description
> ----------------------------------------------
> The LLVM open source project will contain a subproject named
> `parallel-libs` which will host the development of libraries which are
> aimed at enabling parallelism in code and which are also closely tied to
> compiler technology.  Examples of libraries suitable for hosting within the
> `parallel-libs` subproject are runtime libraries and parallel math
> libraries. The initial candidates for inclusion in this subproject are
> StreamExecutor and libomptarget which would live in the `streamexecutor`
> and `libomptarget` subdirectories of `parallel-libs`, respectively.
>
> The `parallel-libs` project will host a collection of libraries where each
> library may be dependent on other libraries from the project or may be
> completely independent of any other libraries in the project. The rationale
> for hosting independent libraries within the same subproject is that all
> libraries in the project are providing related functionality that lives at
> the intersection of parallelism and compiler technology. It is expected
> that some libraries which initially began as independent will develop
> dependencies over time either between existing libraries or by extracting
> common code that can be used by each. One of the purposes of this
> subproject is to provide a working space where such refactoring and code
> sharing can take place.
>
> Libraries in the `parallel-libs` subproject may also depend on the LLVM
> core libraries. This will be useful for avoiding duplication of code within
> the LLVM project for common utilities such as those found in the LLVM
> support library.
>
>
> ----------------------------------------------
> Requirements
> ----------------------------------------------
> Libraries included in the `parallel-libs` subproject must strive to
> achieve the following requirements:
>
> 1. Adhere to the LLVM coding standards.
> 2. Use the LLVM build and test infrastructure.
> 3. Be released under LLVM's license.
>
>
> Coding standards
> ----------------
> Libraries in `parallel-libs` will match the LLVM coding standards. For
> existing projects being checked into the subproject as-is, an exception
> will be made during the initial check-in, with the understanding that the
> code will be promptly updated to follow the standards. Therefore, a three
> month grace period will be allowed for new libraries to meet the LLVM
> coding standards.
>
> Additional exceptions to strict adherence to the LLVM coding standards may
> be allowed in certain other cases, but the reasons for such exceptions must
> be discussed and documented on a case-by-case basis.
>
>
> LLVM build and test infrastructure
> ----------------------------------
> Using the LLVM build and test infrastructure currently means using `cmake`
> for building, `lit` for testing, and `buildbot` for automating build and
> testing. This project will follow the main LLVM project conventions here
> and track them as they evolve.
>
> Each subproject library will be able to build separately without a single,
> unified cmake file, but each subproject libraries will also be integrated
> into the LLVM build so they can be built directly from the top level of the
> LLVM cmake infrastructure.
>
>
> LLVM license
> ------------
> For simplicity, the `parallel-libs` project will use the normal LLVM
> license. While some runtime libraries use a dual license scheme in LLVM, we
> anticipate the project removing the need for this eventually and in the
> interim follow the simpler but still permissive license. Among other
> things, this makes it straightforward for these libraries to re-use core
> LLVM libraries where appropriate.
>
>
> ----------------------------------------------
> Mailing List and Bugs
> ----------------------------------------------
> Two mailing lists will be set up for the project:
>
> 1. parallel_libs-dev at lists.llvm.org for discussions among project
> developers, and
> 2. parallel_libs-commits at lists.llvm.org for patches and commits to the
> project.
>
> Each subproject library will manage its own components in Bugzilla. So,
> for example, there can be several Bugzilla components for different parts
> of StreamExecutor, etc.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/31d88979/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
llvm-dev mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

------------------------------

End of llvm-dev Digest, Vol 144, Issue 93
*****************************************