[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Thu Mar 15 08:30:27 PDT 2018

Patch for this RFC is available at https://reviews.llvm.org/D44519.

On Thu, Mar 15, 2018 at 4:04 PM Guillaume Chatelet <gchatelet at google.com>
wrote:

> [You can find an easier to read and more complete version of this RFC here
> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>
> .]
>
>
> Knowing instruction scheduling properties (latency, uops) is the basis for
> all scheduling work done by LLVM.
>
> Unfortunately, vendors usually release only partial (and sometimes
> incorrect) information.  Updating the information is painful and requires
> careful guesswork and analysis. As a result, scheduling information is
> incomplete for most X86 models (this bug
> <https://bugs.llvm.org/show_bug.cgi?id=32325> tracks some of these
> issues). The goal of the tool presented here is to automatically
> (in)validate the TableDef scheduling models. In the long run we envision
> automatic generation of the models.
>
> At Google, we have developed a tool that, given an instruction mnemonic,
> uses the data in `MCInstrInfo` to generate a code snippet that makes
> execution as serial (resp. as parallel) as possible so that we can measure
> the latency (resp. uop decomposition) of the instruction. The code snippet
> is jitted and executed on the host subtarget. The time taken (resp.
> resource usage) is measured using hardware performance counters. More
> details can be found in the ‘implementation’ section of the RFC.
>
> For people familiar with the work of Agner Fog, this is essentially an
> automation of the process of building the code snippets using instruction
> descriptions from LLVM.
> Results
>
>    -
>
>    Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084>
>    (sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:
>
>  name:            latency IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: latency, value: 4.0115, debug_string: '' }
>
> error:           ''
>
> ...
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:
>
>  name:            uops IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
>  - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
>  - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
>  - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error:           ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
>
>
>    -
>
>    List of measured latencies
>    <https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>
>    for sandybridge, haswell and skylake processors including diffs with LLVM
>    latencies. Excerpt:
>
>
>
> sandybridge
>
> haswell
>
> skylake
>
> mnemonic
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> SHR32r1
>
> 1.01
>
> 1.00
>
> 1.00
>
> 1.00
>
> 1.01
>
> 1.00
>
> IMUL16rri
>
> 4.02
>
> 3.00
>
> 4.01
>
> 3.00
>
> 4.01
>
> 3.00
>
>
>    -
>
>    Some instructions have different implementations depending on which
>    registers are assigned. This is well known for cases like `xor eax,
>    eax` and `xor eax, ebx`, which emits no uops in the first case (this
>    happens during register renaming, see Agner Fog’s “Register Allocation and
>    Renaming”, in microarchitecture.pdf
>    <http://www.agner.org/optimize/microarchitecture.pdf>). But we found
>    out that this can go further. For example, SHLD64rri8 takes one cycle
>    and runs on P06 in the `shld rax, rax, 0x1` case, but takes 3 cycles
>    and runs on P1 in the `shld rbx, rax, 0x1` case. To the best of our
>    knowledge, this has not yet been described.
>
> Future Work
>
>    -
>
>    [easy] Fix Intel Scheduling Models.
>    -
>
>    [easy] Extend to memory operands.
>    -
>
>    [easy] Make the tool work reliably for x87 instructions.
>    -
>
>    [medium] A tool that automatically create patches to TD files.
>    -
>
>    [medium] Measure the effect of immediate/register values: Some
>    instructions have performance characteristics that depends on the values it
>    operates on. We should explore the value space (0, 1, ~1, 2^{8,16,32,64},
>    inf, nan, denorm...).
>    -
>
>    [medium] Measure the effect of changing registers on instruction
>    implementation (see results section
>    <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>
>    above). Model this in LLVM TD schema.
>    -
>
>    [hard] Make the tool work for instruction that have side effects (e.g.
>    PUSH/POP, JMP, ...). This might involve extending the TD schema with
>    information on how to setup measurements for specific instructions.
>    -
>
>    [??] Make the tool work for other CPUs. This mainly depends on the
>    presence of performance counters.
>
> Open QuestionsWe depend on libpfm
> <http://perfmon2.sourceforge.net/docs_v4.html>. How do we handle the
> dependency ?
>
> --
> Guillaume Chatelet (gchatelet at google.com), Clement Courbet (
> courbet at google.com) for the Google Compiler Research Team
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/3ae3fbc2/attachment.html>