[llvm-dev] [llvm-mca] What's the difference between Rthroughput and "total cycles" in llvm-mca

Fri Jun 7 06:33:19 PDT 2019

On Fri, Jun 7, 2019 at 2:30 PM Andrea Di Biagio <andrea.dibiagio at gmail.com>
wrote:

> In the absence of data dependencies, throughput of a block of code is
> superiorly limited by the dispatch rate (i.e. our DispatchWidth), and the
> availability of hardware resources.
>
> DispatchWidth is the maximum number of micro opcodes that can be
> dispatched to the out-of-order every cycle. That value inevitably affects
> the block throughput. Example: if a block in input decodes to 4
> micro-opcodes in total, and the processor can only dispatch up to 2 opcodes
> per cycle, then the maximum block throughput cannot exceed 0.5 (i.e. one
> block every two cycles).
>
> Block throughput is also constrained by the availability of hardware
> resources.
> Example: if we have 4 ADD micro-opcodes, and each opcode consumes 1cy of
> ALU pipeline, then the block throughput is superiorly limited by N/4, where
> N is the number of ALU pipelines available on the target, and 4 is the
> number of ALU cycles consumed. So, if there is only 1 ALU pipeline, then
> the block throughput is superiorly limited to 1/4 = 0.25 (blocks per cycle)
>
> Back to the computation of the "Block Throughput".
>

Sorry, I should have written "Block RThroughput" here.

It is statically computed as the reciprocal of the block throughput. As for
> the normal instruction throughput, the computation doesn't take into
> account operand dependencies. Therefore, we could say that it is computed
> as the MAX of:
>  - #MicroOpcodes of a block / DispatchWidth
>  - #Consumed resource cycles / #Resources   [ for every resource kind ].
>
> In the absence of loop-carried dependencies between different iterations,
> the observed ‘uOps Per Cycle’ tends to a theoretical maximum throughput
> which can be computed by dividing the total number of uOps of a block by
> the Block RThroughput.
>
> You can find more information about it in the llvm-mca docs under section
> "How LLVM-MCA works".
>
> I hope it helps!
> -Andrea
>
> On Fri, Jun 7, 2019 at 12:43 PM Tom Chen <cyt046 at gmail.com> wrote:
>
>> Hi Andrea,
>> So does this definition make sense for basic blocks with more than one
>> instructions? E.g. how should one interpret a basic block with RThroughput
>> of 2.3?
>>
>> On Fri, Jun 7, 2019 at 7:39 AM Andrea Di Biagio <
>> andrea.dibiagio at gmail.com> wrote:
>>
>>> Hi Tom,
>>>
>>> Field 'Total Cycles' from the summary view simply reports the elapsed
>>> number of cycles for the entire simulation.
>>>
>>> Rthroughput (from the "Instruction Info" view) is the reciprocal of the
>>> instruction throughput.
>>> Throughput is computed as the maximum number of instructions of a same
>>> type that can be executed per clock cycle in the absence of operand
>>> dependencies.
>>>
>>> Example (x86 - AMD Jaguar):
>>>    ADD EAX, ESI
>>>
>>> The integer unit in Jaguar has two ALU pipelines. An ADD instruction can
>>> issue to any of those pipelines. That means, two independent ADD can be
>>> issue during a same cycle. Therefore, throughput is 2  (instructions per
>>> cycle), and RThrougput (1/throughput) is 0.5.
>>>
>>> I hope it helps,
>>> -Andrea
>>>
>>> On Thu, Jun 6, 2019 at 10:11 PM Tom Chen via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> What is the difference between the two? I thought "Rthroughput" is
>>>> basically the number of cycles required to execute a single iteration at
>>>> steady state, but this does not seem to match with the schedule/timeline
>>>> generated by llvm-mca.
>>>> Thanks in advance,
>>>> Tom
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190607/affd46d3/attachment.html>