[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 15 08:56:09 PDT 2018
On 03/15/2018 10:49 AM, Clement Courbet wrote:
>
>
> On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
> On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote:
>> [You can find an easier to read and more complete version of this
>> RFC here
>> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.]
>>
>> Knowing instruction scheduling properties (latency, uops) is the
>> basis for all scheduling work done by LLVM.
>>
>>
>> Unfortunately, vendors usually release only partial (and
>> sometimes incorrect) information. Updating the information is
>> painful and requires careful guesswork and analysis. As a result,
>> scheduling information is incomplete for most X86 models (this
>> bug <https://bugs.llvm.org/show_bug.cgi?id=32325>tracks some of
>> these issues). The goal of the tool presented here is to
>> automatically (in)validate the TableDef scheduling models. In the
>> long run we envision automatic generation of the models.
>>
>>
>> At Google, we have developed a tool that, given an instruction
>> mnemonic, uses the data in `MCInstrInfo` to generate a code
>> snippet that makes execution as serial (resp. as parallel) as
>> possible so that we can measure the latency (resp. uop
>> decomposition) of the instruction. The code snippet is jitted and
>> executed on the host subtarget. The time taken (resp. resource
>> usage) is measured using hardware performance counters. More
>> details can be found in the ‘implementation’ section of the RFC.
>>
>>
>> For people familiar with the work of Agner Fog, this is
>> essentially an automation of the process of building the code
>> snippets using instruction descriptions from LLVM.
>>
>>
>> Results
>>
>> *
>>
>> Solving this bug
>> <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge):
>>
>> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>>
>> ---
>>
>> asm_template:
>>
>> name: latency IMUL16rri8
>>
>> cpu_name: sandybridge
>>
>> llvm_triple: x86_64-grtev4-linux-gnu
>>
>> num_repetitions: 10000
>>
>> measurements:
>>
>> - { key: latency, value: 4.0115, debug_string: '' }
>>
>> error: ''
>>
>> ...
>>
>>
>> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>>
>> ---
>>
>> asm_template:
>>
>> name: uops IMUL16rri8
>>
>> cpu_name: sandybridge
>>
>> llvm_triple: x86_64-grtev4-linux-gnu
>>
>> num_repetitions: 10000
>>
>> measurements:
>>
>> - { key: '2', value: 0.5232, debug_string: SBPort0 }
>>
>> - { key: '3', value: 1.0039, debug_string: SBPort1 }
>>
>> - { key: '4', value: 0.0024, debug_string: SBPort4 }
>>
>> - { key: '5', value: 0.3693, debug_string: SBPort5 }
>>
>> error: ''
>>
>> ...
>>
>> Running both these commands took ~.2 seconds including printing.
>>
>>
>> *
>>
>> List of measured latencies
>> <https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>for
>> sandybridge, haswell and skylake processors including diffs
>> with LLVM latencies. Excerpt:
>>
>>
>>
>>
>>
>> sandybridge
>>
>>
>>
>> haswell
>>
>>
>>
>> skylake
>>
>> mnemonic
>>
>>
>>
>> llvm-exegesis
>>
>>
>>
>> TD file
>>
>>
>>
>> llvm-exegesis
>>
>>
>>
>> TD file
>>
>>
>>
>> llvm-exegesis
>>
>>
>>
>> TD file
>>
>> SHR32r1
>>
>>
>>
>> 1.01
>>
>>
>>
>> 1.00
>>
>>
>>
>> 1.00
>>
>>
>>
>> 1.00
>>
>>
>>
>> 1.01
>>
>>
>>
>> 1.00
>>
>> IMUL16rri
>>
>>
>>
>> 4.02
>>
>>
>>
>> 3.00
>>
>>
>>
>> 4.01
>>
>>
>>
>> 3.00
>>
>>
>>
>> 4.01
>>
>>
>>
>> 3.00
>>
>>
>> *
>>
>> Some instructions have different implementationsdepending on
>> which registers are assigned. This is well known for cases
>> like `xor eax, eax`and `xor eax, ebx`, which emits no uops in
>> the first case (this happens during register renaming, see
>> Agner Fog’s “Register Allocation and Renaming”, in
>> microarchitecture.pdf
>> <http://www.agner.org/optimize/microarchitecture.pdf>). But
>> we found out that this can go further. For example,
>> SHLD64rri8takes one cycle and runs on P06 in the `shld rax,
>> rax, 0x1`case, but takes 3 cycles and runs on P1 in the `shld
>> rbx, rax, 0x1`case. To the best of our knowledge, this has
>> not yet been described.
>>
>
> This is great!
>
>>
>> Future Work
>>
>> *
>>
>> [easy] Fix Intel Scheduling Models.
>>
>> *
>>
>> [easy] Extend to memory operands.
>>
>> *
>>
>> [easy] Make the tool work reliably for x87 instructions.
>>
>> *
>>
>> [medium] A tool that automatically create patches to TD files.
>>
>> *
>>
>> [medium] Measure the effect of immediate/register values:
>> Some instructions have performance characteristics that
>> depends on the values it operates on. We should explore the
>> value space (0, 1, ~1, 2^{8,16,32,64}, inf, nan, denorm...).
>>
>> *
>>
>> [medium] Measure the effect of changing registers on
>> instruction implementation(see results section
>> <https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>above).
>> Model this in LLVM TD schema.
>>
>> *
>>
>> [hard] Make the tool work for instruction that have side
>> effects (e.g. PUSH/POP, JMP, ...). This might involve
>> extending the TD schema with information on how to setup
>> measurements for specific instructions.
>>
>> *
>>
>> [??] Make the tool work for other CPUs. This mainly depends
>> on the presence of performance counters.
>>
>>
>> Open Questions
>>
>> We depend on libpfm
>> <http://perfmon2.sourceforge.net/docs_v4.html>. How do we handle
>> the dependency ?
>
> Are there options that you have in mind? It's an external
> MIT-licensed dependency. Wouldn't CMake just detect it when it's
> available?
>
>
> That's what we've done for now (see code here
> <https://reviews.llvm.org/differential/changeset/?ref=1002469&whitespace=ignore-most>).
> We're not sure what the policy is wrt external deps. Right now if the
> tool is enabled and libpfm is not on the system, we die with an error
> message. The other options would be to disable the tool in that case
> (I'm not sure how to do that). Opinions ?
Sounds good (we can discuss this further, if necessary, in the code review).
-Hal
>
>
>
> -Hal
>
>> --
>> Guillaume Chatelet (gchatelet at google.com
>> <mailto:gchatelet at google.com>), Clement Courbet
>> (courbet at google.com <mailto:courbet at google.com>) for the Google
>> Compiler Research Team
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/7fd3b906/attachment.html>
More information about the llvm-dev
mailing list