[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Thu Mar 15 09:30:24 PDT 2018

Sounds like a very useful tool.  Thank you for contributing.

Taking a step back and looking at the big picture, combining this with 
the recently contributed llvm-mca dramatically improves our scheduling 
and performance analysis story.  Being able to take a snippet of code on 
a particular machine, measure latency/throughput/ports for each 
instruction (this tool), and then analyze the entire code sequence in an 
actionable way using the measured information (llvm-mca), leads to a 
very powerful performance analysis workflow.

On 03/15/2018 08:04 AM, Guillaume Chatelet via llvm-dev wrote:
> [You can find an easier to read and more complete version of this RFC 
> here 
> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.]
>
> Knowing instruction scheduling properties (latency, uops) is the basis 
> for all scheduling work done by LLVM.
>
>
> Unfortunately, vendors usually release only partial (and sometimes 
> incorrect) information.  Updating the information is painful and 
> requires careful guesswork and analysis. As a result, scheduling 
> information is incomplete for most X86 models (this bug 
> <https://bugs.llvm.org/show_bug.cgi?id=32325>tracks some of these 
> issues). The goal of the tool presented here is to automatically 
> (in)validate the TableDef scheduling models. In the long run we 
> envision automatic generation of the models.
>
>
> At Google, we have developed a tool that, given an instruction 
> mnemonic, uses the data in `MCInstrInfo` to generate a code snippet 
> that makes execution as serial (resp. as parallel) as possible so that 
> we can measure the latency (resp. uop decomposition) of the 
> instruction. The code snippet is jitted and executed on the host 
> subtarget. The time taken (resp. resource usage) is measured using 
> hardware performance counters. More details can be found in the 
> ‘implementation’ section of the RFC.
>
>
> For people familiar with the work of Agner Fog, this is essentially an 
> automation of the process of building the code snippets using 
> instruction descriptions from LLVM.
>
>
>   Results
>
>  *
>
>     Solving this bug
>     <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:
>
>  name:            latency IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: latency, value: 4.0115, debug_string: '' }
>
> error:           ''
>
> ...
>
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:
>
>  name:            uops IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
>  - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
>  - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
>  - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error:           ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
>  *
>
>     List of measured latencies
>     <https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>for
>     sandybridge, haswell and skylake processors including diffs with
>     LLVM latencies. Excerpt:
>
>
>
> 	
>
> sandybridge
>
> 	
>
> haswell
>
> 	
>
> skylake
>
> mnemonic
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> SHR32r1
>
> 	
>
> 1.01
>
> 	
>
> 1.00
>
> 	
>
> 1.00
>
> 	
>
> 1.00
>
> 	
>
> 1.01
>
> 	
>
> 1.00
>
> IMUL16rri
>
> 	
>
> 4.02
>
> 	
>
> 3.00
>
> 	
>
> 4.01
>
> 	
>
> 3.00
>
> 	
>
> 4.01
>
> 	
>
> 3.00
>
>
>  *
>
>     Some instructions have different implementationsdepending on which
>     registers are assigned. This is well known for cases like `xor
>     eax, eax`and `xor eax, ebx`, which emits no uops in the first case
>     (this happens during register renaming, see Agner Fog’s “Register
>     Allocation and Renaming”, in microarchitecture.pdf
>     <http://www.agner.org/optimize/microarchitecture.pdf>). But we
>     found out that this can go further. For example, SHLD64rri8takes
>     one cycle and runs on P06 in the `shld rax, rax, 0x1`case, but
>     takes 3 cycles and runs on P1 in the `shld rbx, rax, 0x1`case. To
>     the best of our knowledge, this has not yet been described.
>
>
>   Future Work
>
>  *
>
>     [easy] Fix Intel Scheduling Models.
>
>  *
>
>     [easy] Extend to memory operands.
>
>  *
>
>     [easy] Make the tool work reliably for x87 instructions.
>
>  *
>
>     [medium] A tool that automatically create patches to TD files.
>
>  *
>
>     [medium] Measure the effect of immediate/register values: Some
>     instructions have performance characteristics that depends on the
>     values it operates on. We should explore the value space (0, 1,
>     ~1, 2^{8,16,32,64}, inf, nan, denorm...).
>
>  *
>
>     [medium] Measure the effect of changing registers on instruction
>     implementation(see results section
>     <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>above).
>     Model this in LLVM TD schema.
>
>  *
>
>     [hard] Make the tool work for instruction that have side effects
>     (e.g. PUSH/POP, JMP, ...). This might involve extending the TD
>     schema with information on how to setup measurements for specific
>     instructions.
>
>  *
>
>     [??] Make the tool work for other CPUs. This mainly depends on the
>     presence of performance counters.
>
>
>   Open Questions
>
> We depend on libpfm <http://perfmon2.sourceforge.net/docs_v4.html>. 
> How do we handle the dependency ?
> --
> Guillaume Chatelet (gchatelet at google.com 
> <mailto:gchatelet at google.com>), Clement Courbet (courbet at google.com 
> <mailto:courbet at google.com>) for the Google Compiler Research Team
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/f2c50aff/attachment.html>