[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 15 08:41:12 PDT 2018
On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote:
> [You can find an easier to read and more complete version of this RFC
> here
> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.]
>
> Knowing instruction scheduling properties (latency, uops) is the basis
> for all scheduling work done by LLVM.
>
>
> Unfortunately, vendors usually release only partial (and sometimes
> incorrect) information. Updating the information is painful and
> requires careful guesswork and analysis. As a result, scheduling
> information is incomplete for most X86 models (this bug
> <https://bugs.llvm.org/show_bug.cgi?id=32325>tracks some of these
> issues). The goal of the tool presented here is to automatically
> (in)validate the TableDef scheduling models. In the long run we
> envision automatic generation of the models.
>
>
> At Google, we have developed a tool that, given an instruction
> mnemonic, uses the data in `MCInstrInfo` to generate a code snippet
> that makes execution as serial (resp. as parallel) as possible so that
> we can measure the latency (resp. uop decomposition) of the
> instruction. The code snippet is jitted and executed on the host
> subtarget. The time taken (resp. resource usage) is measured using
> hardware performance counters. More details can be found in the
> ‘implementation’ section of the RFC.
>
>
> For people familiar with the work of Agner Fog, this is essentially an
> automation of the process of building the code snippets using
> instruction descriptions from LLVM.
>
>
> Results
>
> *
>
> Solving this bug
> <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:
>
> name: latency IMUL16rri8
>
> cpu_name: sandybridge
>
> llvm_triple: x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
> - { key: latency, value: 4.0115, debug_string: '' }
>
> error: ''
>
> ...
>
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:
>
> name: uops IMUL16rri8
>
> cpu_name: sandybridge
>
> llvm_triple: x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
> - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
> - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
> - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
> - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error: ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
> *
>
> List of measured latencies
> <https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>for
> sandybridge, haswell and skylake processors including diffs with
> LLVM latencies. Excerpt:
>
>
>
>
>
> sandybridge
>
>
>
> haswell
>
>
>
> skylake
>
> mnemonic
>
>
>
> llvm-exegesis
>
>
>
> TD file
>
>
>
> llvm-exegesis
>
>
>
> TD file
>
>
>
> llvm-exegesis
>
>
>
> TD file
>
> SHR32r1
>
>
>
> 1.01
>
>
>
> 1.00
>
>
>
> 1.00
>
>
>
> 1.00
>
>
>
> 1.01
>
>
>
> 1.00
>
> IMUL16rri
>
>
>
> 4.02
>
>
>
> 3.00
>
>
>
> 4.01
>
>
>
> 3.00
>
>
>
> 4.01
>
>
>
> 3.00
>
>
> *
>
> Some instructions have different implementationsdepending on which
> registers are assigned. This is well known for cases like `xor
> eax, eax`and `xor eax, ebx`, which emits no uops in the first case
> (this happens during register renaming, see Agner Fog’s “Register
> Allocation and Renaming”, in microarchitecture.pdf
> <http://www.agner.org/optimize/microarchitecture.pdf>). But we
> found out that this can go further. For example, SHLD64rri8takes
> one cycle and runs on P06 in the `shld rax, rax, 0x1`case, but
> takes 3 cycles and runs on P1 in the `shld rbx, rax, 0x1`case. To
> the best of our knowledge, this has not yet been described.
>
This is great!
>
> Future Work
>
> *
>
> [easy] Fix Intel Scheduling Models.
>
> *
>
> [easy] Extend to memory operands.
>
> *
>
> [easy] Make the tool work reliably for x87 instructions.
>
> *
>
> [medium] A tool that automatically create patches to TD files.
>
> *
>
> [medium] Measure the effect of immediate/register values: Some
> instructions have performance characteristics that depends on the
> values it operates on. We should explore the value space (0, 1,
> ~1, 2^{8,16,32,64}, inf, nan, denorm...).
>
> *
>
> [medium] Measure the effect of changing registers on instruction
> implementation(see results section
> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>above).
> Model this in LLVM TD schema.
>
> *
>
> [hard] Make the tool work for instruction that have side effects
> (e.g. PUSH/POP, JMP, ...). This might involve extending the TD
> schema with information on how to setup measurements for specific
> instructions.
>
> *
>
> [??] Make the tool work for other CPUs. This mainly depends on the
> presence of performance counters.
>
>
> Open Questions
>
> We depend on libpfm <http://perfmon2.sourceforge.net/docs_v4.html>.
> How do we handle the dependency ?
Are there options that you have in mind? It's an external MIT-licensed
dependency. Wouldn't CMake just detect it when it's available?
-Hal
> --
> Guillaume Chatelet (gchatelet at google.com
> <mailto:gchatelet at google.com>), Clement Courbet (courbet at google.com
> <mailto:courbet at google.com>) for the Google Compiler Research Team
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/e6f392fe/attachment-0001.html>
More information about the llvm-dev
mailing list