[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
Guillaume Chatelet via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 15 08:30:27 PDT 2018
Patch for this RFC is available at https://reviews.llvm.org/D44519.
On Thu, Mar 15, 2018 at 4:04 PM Guillaume Chatelet <gchatelet at google.com>
wrote:
> [You can find an easier to read and more complete version of this RFC here
> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>
> .]
>
>
> Knowing instruction scheduling properties (latency, uops) is the basis for
> all scheduling work done by LLVM.
>
> Unfortunately, vendors usually release only partial (and sometimes
> incorrect) information. Updating the information is painful and requires
> careful guesswork and analysis. As a result, scheduling information is
> incomplete for most X86 models (this bug
> <https://bugs.llvm.org/show_bug.cgi?id=32325> tracks some of these
> issues). The goal of the tool presented here is to automatically
> (in)validate the TableDef scheduling models. In the long run we envision
> automatic generation of the models.
>
> At Google, we have developed a tool that, given an instruction mnemonic,
> uses the data in `MCInstrInfo` to generate a code snippet that makes
> execution as serial (resp. as parallel) as possible so that we can measure
> the latency (resp. uop decomposition) of the instruction. The code snippet
> is jitted and executed on the host subtarget. The time taken (resp.
> resource usage) is measured using hardware performance counters. More
> details can be found in the ‘implementation’ section of the RFC.
>
> For people familiar with the work of Agner Fog, this is essentially an
> automation of the process of building the code snippets using instruction
> descriptions from LLVM.
> Results
>
> -
>
> Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084>
> (sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:
>
> name: latency IMUL16rri8
>
> cpu_name: sandybridge
>
> llvm_triple: x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
> - { key: latency, value: 4.0115, debug_string: '' }
>
> error: ''
>
> ...
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:
>
> name: uops IMUL16rri8
>
> cpu_name: sandybridge
>
> llvm_triple: x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
> - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
> - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
> - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
> - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error: ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
>
>
> -
>
> List of measured latencies
> <https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>
> for sandybridge, haswell and skylake processors including diffs with LLVM
> latencies. Excerpt:
>
>
>
> sandybridge
>
> haswell
>
> skylake
>
> mnemonic
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> SHR32r1
>
> 1.01
>
> 1.00
>
> 1.00
>
> 1.00
>
> 1.01
>
> 1.00
>
> IMUL16rri
>
> 4.02
>
> 3.00
>
> 4.01
>
> 3.00
>
> 4.01
>
> 3.00
>
>
> -
>
> Some instructions have different implementations depending on which
> registers are assigned. This is well known for cases like `xor eax,
> eax` and `xor eax, ebx`, which emits no uops in the first case (this
> happens during register renaming, see Agner Fog’s “Register Allocation and
> Renaming”, in microarchitecture.pdf
> <http://www.agner.org/optimize/microarchitecture.pdf>). But we found
> out that this can go further. For example, SHLD64rri8 takes one cycle
> and runs on P06 in the `shld rax, rax, 0x1` case, but takes 3 cycles
> and runs on P1 in the `shld rbx, rax, 0x1` case. To the best of our
> knowledge, this has not yet been described.
>
> Future Work
>
> -
>
> [easy] Fix Intel Scheduling Models.
> -
>
> [easy] Extend to memory operands.
> -
>
> [easy] Make the tool work reliably for x87 instructions.
> -
>
> [medium] A tool that automatically create patches to TD files.
> -
>
> [medium] Measure the effect of immediate/register values: Some
> instructions have performance characteristics that depends on the values it
> operates on. We should explore the value space (0, 1, ~1, 2^{8,16,32,64},
> inf, nan, denorm...).
> -
>
> [medium] Measure the effect of changing registers on instruction
> implementation (see results section
> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>
> above). Model this in LLVM TD schema.
> -
>
> [hard] Make the tool work for instruction that have side effects (e.g.
> PUSH/POP, JMP, ...). This might involve extending the TD schema with
> information on how to setup measurements for specific instructions.
> -
>
> [??] Make the tool work for other CPUs. This mainly depends on the
> presence of performance counters.
>
> Open QuestionsWe depend on libpfm
> <http://perfmon2.sourceforge.net/docs_v4.html>. How do we handle the
> dependency ?
>
> --
> Guillaume Chatelet (gchatelet at google.com), Clement Courbet (
> courbet at google.com) for the Google Compiler Research Team
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/3ae3fbc2/attachment.html>
More information about the llvm-dev
mailing list