[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Thu Mar 15 08:56:09 PDT 2018


On 03/15/2018 10:49 AM, Clement Courbet wrote:
>
>
> On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>     On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote:
>>     [You can find an easier to read and more complete version of this
>>     RFC here
>>     <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.]
>>
>>     Knowing instruction scheduling properties (latency, uops) is the
>>     basis for all scheduling work done by LLVM.
>>
>>
>>     Unfortunately, vendors usually release only partial (and
>>     sometimes incorrect) information.  Updating the information is
>>     painful and requires careful guesswork and analysis. As a result,
>>     scheduling information is incomplete for most X86 models (this
>>     bug <https://bugs.llvm.org/show_bug.cgi?id=32325>tracks some of
>>     these issues). The goal of the tool presented here is to
>>     automatically (in)validate the TableDef scheduling models. In the
>>     long run we envision automatic generation of the models.
>>
>>
>>     At Google, we have developed a tool that, given an instruction
>>     mnemonic, uses the data in `MCInstrInfo` to generate a code
>>     snippet that makes execution as serial (resp. as parallel) as
>>     possible so that we can measure the latency (resp. uop
>>     decomposition) of the instruction. The code snippet is jitted and
>>     executed on the host subtarget. The time taken (resp. resource
>>     usage) is measured using hardware performance counters. More
>>     details can be found in the ‘implementation’ section of the RFC.
>>
>>
>>     For people familiar with the work of Agner Fog, this is
>>     essentially an automation of the process of building the code
>>     snippets using instruction descriptions from LLVM.
>>
>>
>>       Results
>>
>>      *
>>
>>         Solving this bug
>>         <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge):
>>
>>     > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>>
>>     ---
>>
>>     asm_template:    
>>
>>      name:            latency IMUL16rri8
>>
>>     cpu_name:        sandybridge
>>
>>     llvm_triple:     x86_64-grtev4-linux-gnu
>>
>>     num_repetitions: 10000
>>
>>     measurements:    
>>
>>      - { key: latency, value: 4.0115, debug_string: '' }
>>
>>     error:           ''
>>
>>     ...
>>
>>
>>     > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>>
>>     ---
>>
>>     asm_template:    
>>
>>      name:            uops IMUL16rri8
>>
>>     cpu_name:        sandybridge
>>
>>     llvm_triple:     x86_64-grtev4-linux-gnu
>>
>>     num_repetitions: 10000
>>
>>     measurements:    
>>
>>      - { key: '2', value: 0.5232, debug_string: SBPort0 }
>>
>>      - { key: '3', value: 1.0039, debug_string: SBPort1 }
>>
>>      - { key: '4', value: 0.0024, debug_string: SBPort4 }
>>
>>      - { key: '5', value: 0.3693, debug_string: SBPort5 }
>>
>>     error:           ''
>>
>>     ...
>>
>>     Running both these commands took ~.2 seconds including printing.
>>
>>
>>      *
>>
>>         List of measured latencies
>>         <https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>for
>>         sandybridge, haswell and skylake processors including diffs
>>         with LLVM latencies. Excerpt:
>>
>>
>>
>>     	
>>
>>     sandybridge
>>
>>     	
>>
>>     haswell
>>
>>     	
>>
>>     skylake
>>
>>     mnemonic
>>
>>     	
>>
>>     llvm-exegesis
>>
>>     	
>>
>>     TD file
>>
>>     	
>>
>>     llvm-exegesis
>>
>>     	
>>
>>     TD file
>>
>>     	
>>
>>     llvm-exegesis
>>
>>     	
>>
>>     TD file
>>
>>     SHR32r1
>>
>>     	
>>
>>     1.01
>>
>>     	
>>
>>     1.00
>>
>>     	
>>
>>     1.00
>>
>>     	
>>
>>     1.00
>>
>>     	
>>
>>     1.01
>>
>>     	
>>
>>     1.00
>>
>>     IMUL16rri
>>
>>     	
>>
>>     4.02
>>
>>     	
>>
>>     3.00
>>
>>     	
>>
>>     4.01
>>
>>     	
>>
>>     3.00
>>
>>     	
>>
>>     4.01
>>
>>     	
>>
>>     3.00
>>
>>
>>      *
>>
>>         Some instructions have different implementationsdepending on
>>         which registers are assigned. This is well known for cases
>>         like `xor eax, eax`and `xor eax, ebx`, which emits no uops in
>>         the first case (this happens during register renaming, see
>>         Agner Fog’s “Register Allocation and Renaming”, in
>>         microarchitecture.pdf
>>         <http://www.agner.org/optimize/microarchitecture.pdf>). But
>>         we found out that this can go further. For example,
>>         SHLD64rri8takes one cycle and runs on P06 in the `shld rax,
>>         rax, 0x1`case, but takes 3 cycles and runs on P1 in the `shld
>>         rbx, rax, 0x1`case. To the best of our knowledge, this has
>>         not yet been described.
>>
>
>     This is great!
>
>>
>>       Future Work
>>
>>      *
>>
>>         [easy] Fix Intel Scheduling Models.
>>
>>      *
>>
>>         [easy] Extend to memory operands.
>>
>>      *
>>
>>         [easy] Make the tool work reliably for x87 instructions.
>>
>>      *
>>
>>         [medium] A tool that automatically create patches to TD files.
>>
>>      *
>>
>>         [medium] Measure the effect of immediate/register values:
>>         Some instructions have performance characteristics that
>>         depends on the values it operates on. We should explore the
>>         value space (0, 1, ~1, 2^{8,16,32,64}, inf, nan, denorm...).
>>
>>      *
>>
>>         [medium] Measure the effect of changing registers on
>>         instruction implementation(see results section
>>         <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>above).
>>         Model this in LLVM TD schema.
>>
>>      *
>>
>>         [hard] Make the tool work for instruction that have side
>>         effects (e.g. PUSH/POP, JMP, ...). This might involve
>>         extending the TD schema with information on how to setup
>>         measurements for specific instructions.
>>
>>      *
>>
>>         [??] Make the tool work for other CPUs. This mainly depends
>>         on the presence of performance counters.
>>
>>
>>       Open Questions
>>
>>     We depend on libpfm
>>     <http://perfmon2.sourceforge.net/docs_v4.html>. How do we handle
>>     the dependency ?
>
>     Are there options that you have in mind? It's an external
>     MIT-licensed dependency. Wouldn't CMake just detect it when it's
>     available?
>
>
> That's what we've done for now (see code here
> <https://reviews.llvm.org/differential/changeset/?ref=1002469&whitespace=ignore-most>).
> We're not sure what the policy is wrt external deps. Right now if the
> tool is enabled and libpfm is not on the system, we die with an error
> message. The other options would be to disable the tool in that case
> (I'm not sure how to do that). Opinions ?

Sounds good (we can discuss this further, if necessary, in the code review).

 -Hal

>  
>
>
>      -Hal
>
>>     --
>>     Guillaume Chatelet (gchatelet at google.com
>>     <mailto:gchatelet at google.com>), Clement Courbet
>>     (courbet at google.com <mailto:courbet at google.com>) for the Google
>>     Compiler Research Team
>>
>>
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/7fd3b906/attachment.html>


More information about the llvm-dev mailing list