[llvm] r236031 - transform fadd chains to increase parallelism
Jonathan Roelofs
jonathan at codesourcery.com
Thu Apr 30 14:46:27 PDT 2015
On 4/30/15 3:34 PM, Sanjay Patel wrote:
> Thanks all. Between Owen's GPU description and Mehdi's test cases, I can
> see how this patch went off the rails.
>
> I'm back to wondering if we can still do this as a DAG combine with the
> help of a target hook:
>
> TLI.getReassociationLimit(Opcode, EVT)
>
> For some operation on some data type, does it make sense to attempt to
> extract some ILP? By default, we'd make this 0. For a machine that has
> no exposed superscalar / pipelining ILP opportunities, it would always
> return 0. If non-zero, the number would be a value that's based on the
> number of registers and/or issue width and/or pipe stages for the given
> operation. Something like the 'vectorization factor' or 'interleave
> factor' used by the vectorizers?
>
> unsigned CombineCount = 0;
> while (CombineCount < TLI.getReassociationLimit(Opcode, EVT))
> if (tryTheCombine(Opcode, EVT)
> CombineCount++;
How about TLI.canReassociate(), TLI.shouldReassociate(), and
TLI.doReassociate()?... then the target could make an even more educated
decision than this heuristic count.
Jon
>
>
> On Thu, Apr 30, 2015 at 1:25 PM, Eric Christopher <echristo at gmail.com
> <mailto:echristo at gmail.com>> wrote:
>
>
>
> On Thu, Apr 30, 2015 at 12:24 PM Mehdi Amini <mehdi.amini at apple.com
> <mailto:mehdi.amini at apple.com>> wrote:
>
>> On Apr 30, 2015, at 12:04 PM, Owen Anderson <resistor at mac.com
>> <mailto:resistor at mac.com>> wrote:
>>
>>
>>> On Apr 30, 2015, at 8:41 AM, Sanjay Patel
>>> <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote:
>>>
>>> So to me, an in-order machine is still superscalar and
>>> pipelined. You have to expose ILP or you die a high-frequency
>>> death.
>>
>> Many (most?) GPUs hide latencies via massive hyper threading
>> rather than exploiting per-thread ILP. The hardware presents
>> a model where every instruction has unit latency, because the
>> real latency is entirely hidden by hyper threading. Using
>> more registers eats up the finite pool of storage in the chip,
>> limiting the number of threads that can run concurrently, and
>> ultimately reducing the hardware’s ability to hyper thread,
>> killing performance.
>>
>> This isn’t just a concern for GPUs, though. Even superscalar
>> CPUs are not necessarily uniformly superscalar. I’m aware of
>> plenty of lower power designs that can multi-issue integer
>> instructions but not floating point, for instance.
>
> How would OOO change anything with respect to this transformation?
>
>
> Basically using a simplifying assumption of OoO is "really large
> multiple issue".
>
> -eric
>
> —
> Mehdi
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
--
Jon Roelofs
jonathan at codesourcery.com
CodeSourcery / Mentor Embedded
More information about the llvm-commits
mailing list