[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Thu Mar 15 08:41:12 PDT 2018

On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote:
> [You can find an easier to read and more complete version of this RFC
> here
> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.]
>
> Knowing instruction scheduling properties (latency, uops) is the basis
> for all scheduling work done by LLVM.
>
>
> Unfortunately, vendors usually release only partial (and sometimes
> incorrect) information.  Updating the information is painful and
> requires careful guesswork and analysis. As a result, scheduling
> information is incomplete for most X86 models (this bug
> <https://bugs.llvm.org/show_bug.cgi?id=32325>tracks some of these
> issues). The goal of the tool presented here is to automatically
> (in)validate the TableDef scheduling models. In the long run we
> envision automatic generation of the models.
>
>
> At Google, we have developed a tool that, given an instruction
> mnemonic, uses the data in `MCInstrInfo` to generate a code snippet
> that makes execution as serial (resp. as parallel) as possible so that
> we can measure the latency (resp. uop decomposition) of the
> instruction. The code snippet is jitted and executed on the host
> subtarget. The time taken (resp. resource usage) is measured using
> hardware performance counters. More details can be found in the
> ‘implementation’ section of the RFC.
>
>
> For people familiar with the work of Agner Fog, this is essentially an
> automation of the process of building the code snippets using
> instruction descriptions from LLVM.
>
>
>   Results
>
>  *
>
>     Solving this bug
>     <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:    
>
>  name:            latency IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:    
>
>  - { key: latency, value: 4.0115, debug_string: '' }
>
> error:           ''
>
> ...
>
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:    
>
>  name:            uops IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:    
>
>  - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
>  - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
>  - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
>  - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error:           ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
>  *
>
>     List of measured latencies
>     <https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>for
>     sandybridge, haswell and skylake processors including diffs with
>     LLVM latencies. Excerpt:
>
>
>
> 	
>
> sandybridge
>
> 	
>
> haswell
>
> 	
>
> skylake
>
> mnemonic
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> SHR32r1
>
> 	
>
> 1.01
>
> 	
>
> 1.00
>
> 	
>
> 1.00
>
> 	
>
> 1.00
>
> 	
>
> 1.01
>
> 	
>
> 1.00
>
> IMUL16rri
>
> 	
>
> 4.02
>
> 	
>
> 3.00
>
> 	
>
> 4.01
>
> 	
>
> 3.00
>
> 	
>
> 4.01
>
> 	
>
> 3.00
>
>
>  *
>
>     Some instructions have different implementationsdepending on which
>     registers are assigned. This is well known for cases like `xor
>     eax, eax`and `xor eax, ebx`, which emits no uops in the first case
>     (this happens during register renaming, see Agner Fog’s “Register
>     Allocation and Renaming”, in microarchitecture.pdf
>     <http://www.agner.org/optimize/microarchitecture.pdf>). But we
>     found out that this can go further. For example, SHLD64rri8takes
>     one cycle and runs on P06 in the `shld rax, rax, 0x1`case, but
>     takes 3 cycles and runs on P1 in the `shld rbx, rax, 0x1`case. To
>     the best of our knowledge, this has not yet been described.
>

This is great!

>
>   Future Work
>
>  *
>
>     [easy] Fix Intel Scheduling Models.
>
>  *
>
>     [easy] Extend to memory operands.
>
>  *
>
>     [easy] Make the tool work reliably for x87 instructions.
>
>  *
>
>     [medium] A tool that automatically create patches to TD files.
>
>  *
>
>     [medium] Measure the effect of immediate/register values: Some
>     instructions have performance characteristics that depends on the
>     values it operates on. We should explore the value space (0, 1,
>     ~1, 2^{8,16,32,64}, inf, nan, denorm...).
>
>  *
>
>     [medium] Measure the effect of changing registers on instruction
>     implementation(see results section
>     <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>above).
>     Model this in LLVM TD schema.
>
>  *
>
>     [hard] Make the tool work for instruction that have side effects
>     (e.g. PUSH/POP, JMP, ...). This might involve extending the TD
>     schema with information on how to setup measurements for specific
>     instructions.
>
>  *
>
>     [??] Make the tool work for other CPUs. This mainly depends on the
>     presence of performance counters.
>
>
>   Open Questions
>
> We depend on libpfm <http://perfmon2.sourceforge.net/docs_v4.html>.
> How do we handle the dependency ?

Are there options that you have in mind? It's an external MIT-licensed
dependency. Wouldn't CMake just detect it when it's available?

 -Hal

> --
> Guillaume Chatelet (gchatelet at google.com
> <mailto:gchatelet at google.com>), Clement Courbet (courbet at google.com
> <mailto:courbet at google.com>) for the Google Compiler Research Team
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/e6f392fe/attachment-0001.html>