[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
Clement Courbet via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 15 08:52:52 PDT 2018
On Thu, Mar 15, 2018 at 4:49 PM, Clement Courbet <courbet at google.com> wrote:
>
>
> On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>> On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote:
>>
>> [You can find an easier to read and more complete version of this RFC
>> here
>> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>
>> .]
>>
>> Knowing instruction scheduling properties (latency, uops) is the basis
>> for all scheduling work done by LLVM.
>>
>> Unfortunately, vendors usually release only partial (and sometimes
>> incorrect) information. Updating the information is painful and requires
>> careful guesswork and analysis. As a result, scheduling information is
>> incomplete for most X86 models (this bug
>> <https://bugs.llvm.org/show_bug.cgi?id=32325> tracks some of these
>> issues). The goal of the tool presented here is to automatically
>> (in)validate the TableDef scheduling models. In the long run we envision
>> automatic generation of the models.
>>
>> At Google, we have developed a tool that, given an instruction mnemonic,
>> uses the data in `MCInstrInfo` to generate a code snippet that makes
>> execution as serial (resp. as parallel) as possible so that we can measure
>> the latency (resp. uop decomposition) of the instruction. The code snippet
>> is jitted and executed on the host subtarget. The time taken (resp.
>> resource usage) is measured using hardware performance counters. More
>> details can be found in the ‘implementation’ section of the RFC.
>>
>> For people familiar with the work of Agner Fog, this is essentially an
>> automation of the process of building the code snippets using instruction
>> descriptions from LLVM.
>> Results
>>
>> -
>>
>> Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084>
>> (sandybridge):
>>
>> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>>
>> ---
>>
>> asm_template:
>>
>> name: latency IMUL16rri8
>>
>> cpu_name: sandybridge
>>
>> llvm_triple: x86_64-grtev4-linux-gnu
>>
>> num_repetitions: 10000
>>
>> measurements:
>>
>> - { key: latency, value: 4.0115, debug_string: '' }
>>
>> error: ''
>>
>> ...
>>
>> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>>
>> ---
>>
>> asm_template:
>>
>> name: uops IMUL16rri8
>>
>> cpu_name: sandybridge
>>
>> llvm_triple: x86_64-grtev4-linux-gnu
>>
>> num_repetitions: 10000
>>
>> measurements:
>>
>> - { key: '2', value: 0.5232, debug_string: SBPort0 }
>>
>> - { key: '3', value: 1.0039, debug_string: SBPort1 }
>>
>> - { key: '4', value: 0.0024, debug_string: SBPort4 }
>>
>> - { key: '5', value: 0.3693, debug_string: SBPort5 }
>>
>> error: ''
>>
>> ...
>>
>> Running both these commands took ~.2 seconds including printing.
>>
>>
>>
>> -
>>
>> List of measured latencies
>> <https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>
>> for sandybridge, haswell and skylake processors including diffs with LLVM
>> latencies. Excerpt:
>>
>>
>>
>> sandybridge
>>
>> haswell
>>
>> skylake
>>
>> mnemonic
>>
>> llvm-exegesis
>>
>> TD file
>>
>> llvm-exegesis
>>
>> TD file
>>
>> llvm-exegesis
>>
>> TD file
>>
>> SHR32r1
>>
>> 1.01
>>
>> 1.00
>>
>> 1.00
>>
>> 1.00
>>
>> 1.01
>>
>> 1.00
>>
>> IMUL16rri
>>
>> 4.02
>>
>> 3.00
>>
>> 4.01
>>
>> 3.00
>>
>> 4.01
>>
>> 3.00
>>
>>
>> -
>>
>> Some instructions have different implementations depending on which
>> registers are assigned. This is well known for cases like `xor eax,
>> eax` and `xor eax, ebx`, which emits no uops in the first case (this
>> happens during register renaming, see Agner Fog’s “Register Allocation and
>> Renaming”, in microarchitecture.pdf
>> <http://www.agner.org/optimize/microarchitecture.pdf>). But we found
>> out that this can go further. For example, SHLD64rri8 takes one cycle
>> and runs on P06 in the `shld rax, rax, 0x1` case, but takes 3 cycles
>> and runs on P1 in the `shld rbx, rax, 0x1` case. To the best of our
>> knowledge, this has not yet been described.
>>
>>
>> This is great!
>>
>> Future Work
>>
>> -
>>
>> [easy] Fix Intel Scheduling Models.
>> -
>>
>> [easy] Extend to memory operands.
>> -
>>
>> [easy] Make the tool work reliably for x87 instructions.
>> -
>>
>> [medium] A tool that automatically create patches to TD files.
>> -
>>
>> [medium] Measure the effect of immediate/register values: Some
>> instructions have performance characteristics that depends on the values it
>> operates on. We should explore the value space (0, 1, ~1, 2^{8,16,32,64},
>> inf, nan, denorm...).
>> -
>>
>> [medium] Measure the effect of changing registers on instruction
>> implementation (see results section
>> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>
>> above). Model this in LLVM TD schema.
>> -
>>
>> [hard] Make the tool work for instruction that have side effects
>> (e.g. PUSH/POP, JMP, ...). This might involve extending the TD schema with
>> information on how to setup measurements for specific instructions.
>> -
>>
>> [??] Make the tool work for other CPUs. This mainly depends on the
>> presence of performance counters.
>>
>> Open Questions We depend on libpfm
>> <http://perfmon2.sourceforge.net/docs_v4.html>. How do we handle the
>> dependency ?
>>
>>
>> Are there options that you have in mind? It's an external MIT-licensed
>> dependency. Wouldn't CMake just detect it when it's available?
>>
>
> That's what we've done for now (see code here
> <https://reviews.llvm.org/differential/changeset/?ref=1002469&whitespace=ignore-most>).
> We're not sure what the policy is wrt external deps. Right now if the tool
> is enabled and libpfm is not on the system, we die with an error message.
> The other options would be to disable the tool in that case (I'm not sure
> how to do that). Opinions ?
>
There's also the option where not having libpfm still compiles the tool but
returns dummy measurements (though #ifdefs). This has the advantage that
everybody can compile the tool (e.g. to check against API changes in LLVM).
>
>
>>
>> -Hal
>>
>> --
>> Guillaume Chatelet (gchatelet at google.com), Clement Courbet (
>> courbet at google.com) for the Google Compiler Research Team
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/0e429279/attachment-0001.html>
More information about the llvm-dev
mailing list