[llvm] r236031 - transform fadd chains to increase parallelism

Thu Apr 30 14:46:27 PDT 2015

On 4/30/15 3:34 PM, Sanjay Patel wrote:
> Thanks all. Between Owen's GPU description and Mehdi's test cases, I can
> see how this patch went off the rails.
>
> I'm back to wondering if we can still do this as a DAG combine with the
> help of a target hook:
>
> TLI.getReassociationLimit(Opcode, EVT)
>
> For some operation on some data type, does it make sense to attempt to
> extract some ILP? By default, we'd make this 0. For a machine that has
> no exposed superscalar / pipelining ILP opportunities, it would always
> return 0. If non-zero, the number would be a value that's based on the
> number of registers and/or issue width and/or pipe stages for the given
> operation. Something like the 'vectorization factor' or 'interleave
> factor' used by the vectorizers?
>
> unsigned CombineCount = 0;
> while (CombineCount < TLI.getReassociationLimit(Opcode, EVT))
>    if (tryTheCombine(Opcode, EVT)
>      CombineCount++;

How about TLI.canReassociate(), TLI.shouldReassociate(), and 
TLI.doReassociate()?... then the target could make an even more educated 
decision than this heuristic count.

Jon

>
>
> On Thu, Apr 30, 2015 at 1:25 PM, Eric Christopher <echristo at gmail.com
> <mailto:echristo at gmail.com>> wrote:
>
>
>
>     On Thu, Apr 30, 2015 at 12:24 PM Mehdi Amini <mehdi.amini at apple.com
>     <mailto:mehdi.amini at apple.com>> wrote:
>
>>         On Apr 30, 2015, at 12:04 PM, Owen Anderson <resistor at mac.com
>>         <mailto:resistor at mac.com>> wrote:
>>
>>
>>>         On Apr 30, 2015, at 8:41 AM, Sanjay Patel
>>>         <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote:
>>>
>>>         So to me, an in-order machine is still superscalar and
>>>         pipelined. You have to expose ILP or you die a high-frequency
>>>         death.
>>
>>         Many (most?) GPUs hide latencies via massive hyper threading
>>         rather than exploiting per-thread ILP.  The hardware presents
>>         a model where every instruction has unit latency, because the
>>         real latency is entirely hidden by hyper threading.  Using
>>         more registers eats up the finite pool of storage in the chip,
>>         limiting the number of threads that can run concurrently, and
>>         ultimately reducing the hardware’s ability to hyper thread,
>>         killing performance.
>>
>>         This isn’t just a concern for GPUs, though.  Even superscalar
>>         CPUs are not necessarily uniformly superscalar.  I’m aware of
>>         plenty of lower power designs that can multi-issue integer
>>         instructions but not floating point, for instance.
>
>         How would OOO change anything with respect to this transformation?
>
>
>     Basically using a simplifying assumption of OoO is "really large
>     multiple issue".
>
>     -eric
>
>         —
>         Mehdi
>
>         _______________________________________________
>         llvm-commits mailing list
>         llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>         http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>     _______________________________________________
>     llvm-commits mailing list
>     llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>     http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>

-- 
Jon Roelofs
jonathan at codesourcery.com
CodeSourcery / Mentor Embedded