<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>This looks absolutely awesome.  Thank you for sharing your work;

      I can see many ways this will be useful.</p>

    <p>Philip<br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 03/01/2018 09:22 AM, Andrea Di

      Biagio via llvm-dev wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAMw=4mL+bANeYnUifa3f39o4mvY9bdPfdqXr6HJSXkO0KUB5Vw@mail.gmail.com">

      <div dir="ltr">Hi all,<br>

        <br>

        At Sony we developed an LLVM based performance analysis tool

        named llvm-mca. We<br>

        currently use it internally to statically measure the

        performance of code, and<br>

        to help triage potential problems with target scheduling

        models.  We decided to<br>

        post this RFC because we are interested in the feedback from the

        community, and<br>

        we also believe that other people might be interested in a tool

        like this.<br>

        <br>

        llvm-mca uses information which is already available in LLVM

        (e.g. scheduling<br>

        models) to statically measure the performance of machine code in

        a specific cpu.<br>

        Performance is measured in terms of throughput as well as

        processor resource<br>

        consumption.  The tool currently works for processors with an

        out-of-order<br>

        backend, for which there is a scheduling model available in

        LLVM.<br>

        <br>

        The main goal of this tool is not just to predict the

        performance of the code<br>

        when run on the target, but also help with diagnosing potential

        performance<br>

        issues.<br>

        <br>

        Given an assembly code sequence, llvm-mca estimates the IPC

        (instructions per<br>

        cycle), as well as hardware resources pressure. The analysis and

        reporting style<br>

        were inspired by the IACA tool from Intel.<br>

        <br>

        The presence of long data dependency chains, as well as poor

        usage of hardware<br>

        resources may lead to bottlenecks in the back-end.  The tool is

        able to generate<br>

        a detailed report which should help with identifying and

        analyzing sources of<br>

        bottlenecks.<br>

        <br>

        Scheduling models are mostly used to compute instruction

        latencies, to obtain<br>

        read-advance information, and understand how processor resources

        are used by<br>

        instructions.  By design, the quality of the performance

        analysis conducted by<br>

        the tool is inevitably affected by the quality of the target

        scheduling models. <br>

        <br>

        However, scheduling models intentionally do not describe all

        processors details,<br>

        since the goal is just to enable the scheduling of machine

        instructions during<br>

        compilation. That means, there are processor details which are

        not important for<br>

        the purpose of scheduling instructions (and therefore not

        described by the<br>

        scheduling model), but are very important for this tool.<br>

        <br>

        A few examples of details that are missing in scheduling models

        are:<br>

         - Maximum number of instructions retired per cycle.<br>

         - Actual dispatch width (it often differs from the issue

        width).<br>

         - Number of temporary registers available for renaming.<br>

         - Number of read/write ports in the register file(s).<br>

         - Length of the load/store queue in the LSUnit.<br>

        <br>

        It is also very difficult to find a "good" abstract model to

        describe the<br>

        behavior of out-of-order processors. So, we have to keep in mind

        that all of<br>

        these aspects are going to affect the quality of the static

        analysis performed<br>

        by the tool.<br>

        <br>

        An extensive list of known limitations is reported in one of the

        last sections<br>

        of this document. There is also a section related to design

        problems which must<br>

        be addressed (hopefully with the help of the community).  At the

        moment, the<br>

        tool has been mostly tested for x86 targets, but there are still

        several<br>

        limitations, some of which could be overcome by integrating

        extra information<br>

        into the scheduling models.<br>

        <br>

        As was mentioned before, this tool has been (and is still being)

        used internally<br>

        in Sony to debug/triage issues in the btver2 scheduling model.

        We have also<br>

        tested it on other targets to check how generic the tool is. In

        our experience,<br>

        the tool makes it easy to identify simple mistakes like "wrong

        number of micro<br>

        opcodes specified for an instruction", or "wrong set of hardware

        resources".<br>

        Some of these mistakes are quite common (sometimes on mature

        models too), and<br>

        often difficult to catch.  Reports generated by this tool are

        simple to analyze,<br>

        and contain enough details to help triage most performance

        problems.<br>

        <br>

        1. How the tool works<br>

        ---------------------<br>

        <br>

        The tool takes assembly code as input. Assembly code is parsed

        into a sequence<br>

        of MCInst with the help of the existing LLVM target assembly

        parsers. The parsed<br>

        sequence of MCInst is then analyzed by a 'Backend' module to

        generate a<br>

        performance report.<br>

        <br>

        The Backend module internally emulates the execution of the

        machine code<br>

        sequence in a loop of iterations (which by default is 70). At

        the end of this<br>

        process, the backend collects a number of statistics which are

        then printed out<br>

        in the form of a report.<br>

        <br>

        Here is an example of performance report generated by the tool

        for a dot-product<br>

        of two packed float vectors of four elements. The analysis is

        conducted for<br>

        target x86, cpu btver2:<br>

        <br>

        ///////////////////<br>

        <br>

        Iterations:     300<br>

        Instructions:   900<br>

        Total Cycles:   610<br>

        Dispatch Width: 2<br>

        IPC:            1.48<br>

        <br>

        <br>

        Resources:<br>

        [0] - JALU0<br>

        [1] - JALU1<br>

        [2] - JDiv<br>

        [3] - JFPM<br>

        [4] - JFPU0<br>

        [5] - JFPU1<br>

        [6] - JLAGU<br>

        [7] - JSAGU<br>

        [8] - JSTC<br>

        [9] - JVIMUL<br>

        <br>

        <br>

        Resource pressure per iteration:<br>

        [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]   

        [9]    <br>

         -      -      -      -     2.00   1.00    -      -      -     

        -     <br>

        <br>

        Resource pressure by instruction:<br>

        [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]   

        [9]        Instructions:<br>

         -      -      -      -      -     1.00    -      -      -     

        -         vmulps    %xmm0, %xmm1, %xmm2<br>

         -      -      -      -     1.00    -      -      -      -     

        -         vhaddps    %xmm2, %xmm2, %xmm3<br>

         -      -      -      -     1.00    -      -      -      -     

        -         vhaddps    %xmm3, %xmm3, %xmm4<br>

        <br>

        <br>

        Instruction Info:<br>

        [1]: #uOps<br>

        [2]: Latency<br>

        [3]: RThroughput<br>

        [4]: MayLoad<br>

        [5]: MayStore<br>

        [6]: HasSideEffects<br>

        <br>

        [1]    [2]    [3]    [4]    [5]    [6]    Instructions:<br>

         1      2     1.00                        vmulps    %xmm0,

        %xmm1, %xmm2<br>

         1      3     1.00                        vhaddps    %xmm2,

        %xmm2, %xmm3<br>

         1      3     1.00                        vhaddps    %xmm3,

        %xmm3, %xmm4<br>

        <br>

        ///////////////////<br>

        <br>

        According to this report, the dot-product kernel has been

        executed 300 times,<br>

        for a total of 900 instructions dynamically executed.<br>

        <br>

        The report is structured in three main sections.  A first

        section collects a few<br>

        performance numbers; the goal of this section is to give a very

        quick overview<br>

        of the performance throughput. In this example, the two

        important perforamce<br>

        indicators are a) the predicted total number of cycles, and b)

        the IPC. <br>

        IPC is probably the most important throughput indicator. A big

        delta between the<br>

        Dispatch Width and the computed IPC is an indicator of potential

        performance<br>

        issues.<br>

        <br>

        The second section is the so-called "resource pressure view". 

        This view reports<br>

        the average number of resource cycles consumed every iteration

        by instructions<br>

        for every processor resource unit available on the target. 

        Information is<br>

        structured in two tables. The first table reports the number of

        resource cycles<br>

        spent on average every iteration. The second table correlates

        the resource<br>

        cycles to the machine instruction in the sequence. For example,

        every iteration<br>

        of the dot-product, instruction 'vmulps' always executes on

        resource unit [5]<br>

        (JFPU1 - floating point pipeline #1), consuming an average of 1

        resource cycle<br>

        per iteration.  Note that on Jaguar, vector FP multiply can only

        be issued to<br>

        pipeline JFPU1, while horizontal FP adds can only be issued to

        pipeline JFPU0.<br>

        <br>

        The third (and last) section of the report shows the latency and

        reciprocal<br>

        throughput of every instruction in the sequence. That section

        also reports extra<br>

        information related to the number of micro opcodes, and opcode

        properties (i.e.<br>

        'MayLoad', 'MayStore' and 'UnmodeledSideEffects').<br>

        <br>

        The resource pressure view helps with identifying bottlenecks

        caused by high<br>

        usage of specific hardware resources.  Situations with resource

        pressure mainly<br>

        concentrated on a few resources should, in general, be avoided. 

        Ideally,<br>

        pressure should be uniformly distributed between multiple

        resources.<br>

        <br>

        Timeline View<br>

        -------------<br>

        <br>

        A detailed report of each instruction's state transitions over

        time can be<br>

        enabled using the command line flag '-timeline'.  This prints an

        extra section<br>

        in the report which contains the so-called "timeline view". 

        Below is the<br>

        timeline view for the dot-product example from the previous

        section.<br>

        <br>

        ///////////////<br>

        Timeline view:<br>

                           012345<br>

        Index    0123456789      <br>

        <br>

        [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2<br>

        [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3<br>

        [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4<br>

        <br>

        [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2<br>

        [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3<br>

        [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4<br>

        <br>

        [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2<br>

        [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3<br>

        [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4<br>

        <br>

        <br>

        Average Wait times (based on the timeline view):<br>

        [0]: Executions<br>

        [1]: Average time spent waiting in a scheduler's queue<br>

        [2]: Average time spent waiting in a scheduler's queue while

        ready<br>

        [3]: Average time elapsed from WB until retire stage<br>

        <br>

              [0]    [1]    [2]    [3]<br>

        0.     3     1.0    1.0    3.3      vmulps    %xmm0, %xmm1,

        %xmm2<br>

        1.     3     3.3    0.7    1.0      vhaddps    %xmm2, %xmm2,

        %xmm3<br>

        2.     3     5.7    0.0    0.0      vhaddps    %xmm3, %xmm3,

        %xmm4<br>

        ///////////////<br>

        <br>

        The timeline view is very interesting because it shows how

        instructions changed<br>

        in state during execution.  It also gives an idea of how the

        tool "sees"<br>

        instructions executed on the target.<br>

        <br>

        The timeline view is structured in two tables.  The first table

        shows how<br>

        instructions change in state over time (measured in cycles); the

        second table<br>

        (named "Average Wait times") reports useful timing statistics

        which should help<br>

        diagnose performance bottlenecks caused by long data

        dependencies and sub-optimal<br>

        usage of hardware resources.<br>

        <br>

        An instruction in the timeline view is identified by a pair of

        indices, where<br>

        the 'first' index identifies an iteration, and the 'second'

        index is the actual<br>

        instruction index (i.e. where it appears in the code sequence).<br>

        <br>

        Excluding the first and last column, the remaining columns are

        in cycles.  Cycles<br>

        are numbered sequentially starting from 0.  The following

        characters are used to<br>

        describe the state of an instruction:<br>

        <br>

         D : Instruction dispatched.<br>

         e : Instruction executing.<br>

         E : Instruction executed.<br>

         R : Instruction retired.<br>

         = : Instruction already dispatched, waiting to be executed.<br>

         - : Instruction executed, waiting to be retired.<br>

        <br>

        Based on the timeline view from the example, we know that:<br>

          - Instruction [1, 0] was dispatched at cycle 1.<br>

          - Instruction [1, 0] started executing at cycle 2.<br>

          - Instruction [1, 0] reached the write back stage at cycle 4.<br>

          - Instruction [1, 0] was retired at cycle 10.<br>

        <br>

        Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't

        have to wait in<br>

        the scheduler's queue for the operands to become available. By

        the time the<br>

        vmulps is dispatched, operands are already available, and

        pipeline JFPU1 is<br>

        ready to serve another instruction.  So the instruction can be

        immediately<br>

        issued on the JFPU1 pipeline. That is demonstrated by the fact

        that the<br>

        instruction only spent 1cy in the scheduler's queue.<br>

        <br>

        There is a gap of 5 cycles between the write-back stage and the

        retire event.<br>

        That is because instructions must retire in program order, so

        [1,0] has to wait<br>

        for [0, 2] to be retired first (i.e it has to wait unti cycle

        10).<br>

        <br>

        In the dot-product example, all instructions are in a RAW (Read

        After Write)<br>

        dependency chain.  Register %xmm2 written by the vmulps is

        immediately used by<br>

        the first vhaddps, and register %xmm3 written by the first

        vhaddps is used by<br>

        the second vhaddps.  Long data dependencies negatively affect

        the ILP<br>

        (Instruction Level Parallelism).<br>

        <br>

        In the dot-product example, there are anti-dependencies

        introduced by<br>

        instructions from different iterations.  However, those

        dependencies can be<br>

        removed at register renaming stage (at the cost of allocating

        register aliases,<br>

        and therefore consuming temporary registers).<br>

        <br>

        Table "Average Wait times" helps diagnose performance issues

        that are caused by<br>

        the presence of long latency instructions and potentially long

        data dependencies<br>

        which may limit the ILP.  Note that the tool by default assumes

        at least 1cy<br>

        between the dispatch event and the issue event.<br>

        <br>

        When the performance is limited by data dependencies and/or long

        latency<br>

        instructions, the number of cycles spent while in the "ready"

        state is expected<br>

        to be very small when compared with the total number of cycles

        spent in the<br>

        scheduler's queue.  So the difference between the two counters

        is a good<br>

        indicator of how big of an impact data dependencies had on the

        execution of<br>

        instructions.  When performance is mostly limited by the lack of

        hardware<br>

        resources, the delta between the two counters is small. 

        However, the number of<br>

        cycles spent in the queue tends to be bigger (i.e. more than

        1-3cy) especially<br>

        when compared with other low latency instructions.<br>

        <br>

        Extra statistics to further diagnose performance issues.<br>

        --------------------------------------------------------<br>

        <br>

        Flag '-verbose' enables extra statistics and performance

        counters for the<br>

        dispatch logic, the reorder buffer, the retire control unit and

        the register<br>

        file.<br>

        <br>

        Below is an example of verbose output generated by the tool for

        the dot-product<br>

        example discussed in the previous sections.<br>

        <br>

        ///////////////////<br>

        Iterations:     300<br>

        Instructions:   900<br>

        Total Cycles:   610<br>

        Dispatch Width: 2<br>

        IPC:            1.48<br>

        <br>

        <br>

        Dynamic Dispatch Stall Cycles:<br>

        RAT     - Register unavailable:                      0<br>

        RCU     - Retire tokens unavailable:                 0<br>

        SCHEDQ  - Scheduler full:                            272<br>

        LQ      - Load queue full:                           0<br>

        SQ      - Store queue full:                          0<br>

        GROUP   - Static restrictions on the dispatch group: 0<br>

        <br>

        <br>

        Register Alias Table:<br>

        Total number of mappings created: 900<br>

        Max number of mappings used:      35<br>

        <br>

        <br>

        Dispatch Logic - number of cycles where we saw N instructions

        dispatched:<br>

        [# dispatched], [# cycles]<br>

         0,              24  (3.9%)<br>

         1,              272  (44.6%)<br>

         2,              314  (51.5%)<br>

        <br>

        <br>

        Schedulers - number of cycles where we saw N instructions

        issued:<br>

        [# issued], [# cycles]<br>

         0,          7  (1.1%)<br>

         1,          306  (50.2%)<br>

         2,          297  (48.7%)<br>

        <br>

        <br>

        Retire Control Unit - number of cycles where we saw N

        instructions retired:<br>

        [# retired], [# cycles]<br>

         0,           109  (17.9%)<br>

         1,           102  (16.7%)<br>

         2,           399  (65.4%)<br>

        <br>

        <br>

        Scheduler's queue usage:<br>

        JALU01,  0/20<br>

        JFPU01,  18/18<br>

        JLSAGU,  0/12<br>

        ///////////////////<br>

        <br>

        Based on the verbose report, the backend was only able to

        dispatch two<br>

        instructions 51.5% of the time.  The dispatch group was limited

        to one<br>

        instruction 44.6% of the cycles, which corresponds to 272

        cycles.<br>

        <br>

        If we look at section "Dynamic Dispatch Stall Cycles", we can

        see how counter<br>

        SCHEDQ reports 272 cycles.  Counter SCHEDQ is incremented every

        time the dispatch<br>

        logic is unable to dispatch a full group of two instructions

        because the<br>

        scheduler's queue is full.<br>

        <br>

        Section "Scheduler's queue usage" shows how the maximum number

        of buffer entries<br>

        (i.e. scheduler's queue entries) used at runtime for resource

        JFPU01 reached its<br>

        maximum. Note that AMD Jaguar implements three schedulers:<br>

          * JALU01 - A scheduler for ALU instructions<br>

          * JLSAGU - A scheduler for address generation<br>

          * JFPU01 - A scheduler floating point operations.<br>

        <br>

        The dot-product is a kernel of three floating point instructions

        (a vector<br>

        multiply followed by two horizontal adds).  That explains why

        only the floating<br>

        point scheduler appears to be used according to section

        "Scheduler's queue<br>

        usage".<br>

        <br>

        A full scheduler's queue is either caused by data dependency

        chains, or by a<br>

        sub-optimal usage of hardware resources.  Sometimes, resource

        pressure can be<br>

        mitigated by rewriting the kernel using different instructions

        that consume<br>

        different scheduler resources.  Schedulers with a small queue

        are less resilient<br>

        to bottlenecks caused by the presence of long data dependencies.<br>

        <br>

        In this example, we can conclude that the IPC is mostly limited

        by data<br>

        dependencies, and not by resource pressure.<br>

        <br>

        LLVM-MCA instruction flow<br>

        -------------------------<br>

        <br>

        This section describes the instruction flow through the

        out-of-order backend, as<br>

        well as the functional units involved in the process. <br>

        <br>

        An instruction goes through a default sequence of stages:<br>

            - Dispatch (Instruction is dispatched to the schedulers).<br>

            - Issue (Instruction is issued to the processor pipelines).<br>

            - Write Back (Instruction is executed, and results are

        written back).<br>

            - Retire (Instruction is retired; writes are architecturally

        committed).<br>

        <br>

        The tool only models the out-of-order portion of a processor.

        Therefore, the<br>

        instruction fetch and decode stages are not modeled. Performance

        bottlenecks in<br>

        the frontend are not diagnosed by this tool.  The tool assumes

        that instructions<br>

        have all been decoded and placed in a queue.  Also, the tool

        doesn't know<br>

        anything about branch prediction.<br>

        <br>

        Instruction Dispatch<br>

        --------------------<br>

        <br>

        During the Dispatch stage, instructions are picked in program

        order from a queue<br>

        of already decoded instructions, and dispatched in groups to the

        hardware<br>

        schedulers.  The dispatch logic is implemented by class

        DispatchUnit in file<br>

        Dispatch.h.<br>

        <br>

        The size of a dispatch group depends on the availability of

        hardware resources,<br>

        and it cannot exceed the value of field 'DispatchWidth' in class

        DispatchUnit.<br>

        Note that field DispatchWidth defaults to the value of field

        'IssueWidth' from<br>

        the scheduling model.<br>

        <br>

        Users can override the DispatchWidth value with flag

        "-dispatch=<N>" (where 'N'<br>

        is an unsigned quantity).<br>

        <br>

        An instruction can be dispatched if:<br>

         - The size of the dispatch group is smaller than DispatchWidth<br>

         - There are enough entries in the reorder buffer<br>

         - There are enough temporary registers to do register renaming<br>

         - Schedulers are not full.<br>

        <br>

        Scheduling models don't describe register files, and therefore

        the tool doesn't<br>

        know if there is more than one register file, and how many

        temporaries are<br>

        available for register renaming.<br>

        <br>

        By default, the tool (optimistically) assumes a single register

        file with an<br>

        unbounded number of temporary registers.  Users can limit the

        number of<br>

        temporary registers available for register renaming using flag<br>

        `-register-file-size=<N>`, where N is the number of

        temporaries.  A value of<br>

        zero for N means 'unbounded'.  Knowing how many temporaries are

        available for<br>

        register renaming, the tool can predict dispatch stalls caused

        by the lack of<br>

        temporaries.<br>

        <br>

        The number of reorder buffer entries consumed by an instruction

        depends on the<br>

        number of micro-opcodes it specifies in the target scheduling

        model (see field<br>

        'NumMicroOpcodes' of tablegen class ProcWriteResources and its

        derived classes;<br>

        TargetSchedule.td).<br>

        <br>

        The reorder buffer is implemented by class RetireControlUnit

        (see Dispatch.h).<br>

        Its goal is to track the progress of instructions that are

        "in-flight", and<br>

        retire instructions in program order.  The number of entries in

        the reorder<br>

        buffer defaults to the value of field 'MicroOpBufferSize' from

        the target<br>

        scheduling model.<br>

        <br>

        Instructions that are dispatched to the schedulers consume

        scheduler buffer<br>

        entries.  The tool queries the scheduling model to figure out

        the set of<br>

        buffered resources consumed by an instruction.  Buffered

        resources are treated<br>

        like "scheduler" resources, and the field 'BufferSize' (from the

        processor<br>

        resource tablegen definition) defines the size of the

        scheduler's queue.<br>

        <br>

        Zero latency instructions (for example NOP instructions) don't

        consume scheduler<br>

        resources.  However, those instructions still reserve a number

        of slots in the<br>

        reorder buffer.<br>

        <br>

        Instruction Issue<br>

        -----------------<br>

        <br>

        As mentioned in the previous section, each scheduler resource

        implements a queue<br>

        of instructions.  An instruction has to wait in the scheduler's

        queue until<br>

        input register operands become available.  Only at that point,

        does the<br>

        instruction becomes eligible for execution and may be issued

        (potentially<br>

        out-of-order) to a pipeline for execution.<br>

        <br>

        Instruction latencies can be computed by the tool with the help

        of the<br>

        scheduling model; latency values are defined by the scheduling

        model through<br>

        ProcWriteResources objects.<br>

        <br>

        Class Scheduler (see file Scheduler.h) knows how to emulate

        multiple processor<br>

        schedulers.  A Scheduler is responsible for tracking data

        dependencies, and<br>

        dynamically select which processor resources are consumed/used

        by instructions.<br>

        <br>

        Internally, the Scheduler class delegates the management of

        processor resource<br>

        units and resource groups to the ResourceManager class. 

        ResourceManager is also<br>

        responsible for selecting resource units that are effectively

        consumed by<br>

        instructions.  For example, if an instruction consumes 1cy of a

        resource group,<br>

        the ResourceManager object selects one of the available units

        from the group; by<br>

        default, it uses a round-robin selector to guarantee that

        resource usage is<br>

        uniformly distributed between all units of a group.<br>

        <br>

        Internally, class Scheduler implements three instruction queues:<br>

          - WaitQueue: a queue of instructions whose operands are not

        ready yet.<br>

          - ReadyQueue: a queue of instructions ready to execute.<br>

          - IssuedQueue: a queue of instructions executing.<br>

        <br>

        Depending on the operands availability, instructions that are

        dispatched to the<br>

        Scheduler are either placed into the WaitQueue or into the

        ReadyQueue.<br>

        <br>

        Every cycle, class Scheduler checks if instructions can be moved

        from the<br>

        WaitQueue to the ReadyQueue, and if instructions from the

        ReadyQueue can be<br>

        issued to the underlying pipelines.  The algorithm prioritizes

        older<br>

        instructions over younger instructions.<br>

        <br>

        Objects of class ResourceState (see Scheduler.h) describe

        processor resources.<br>

        There is an instance of class ResourceState for each single

        processor resource<br>

        specified by the scheduling model.  A ResourceState object for a

        processor<br>

        resource with multiple units dynamically tracks the availability

        of every single<br>

        unit.  For example, the ResourceState of a resource group tracks

        the<br>

        availability of every resource in that group.  Internally,

        ResourceState<br>

        implements a round-robin selector to dynamically pick the next

        unit to use from<br>

        the group.<br>

        <br>

        Write-Back and Retire Stage<br>

        ---------------------------<br>

        <br>

        Issued instructions are moved from the ReadyQueue to the

        IssuedQueue.  There,<br>

        instructions wait until they reach the write-back stage.  At

        that point, they<br>

        get removed from the queue and the retire control unit is

        notified.<br>

        <br>

        On the event of "instruction executed", the retire control unit

        flags the<br>

        instruction as "ready to retire".<br>

        <br>

        Instruction are retired in program order; an "instruction

        retired" event is sent<br>

        to the register file which frees the temporary registers

        allocated for the<br>

        instruction at register renaming stage.<br>

        <br>

        Load/Store Unit and Memory Consistency Model<br>

        --------------------------------------------<br>

        <br>

        The tool attempts to emulate out-of-order execution of memory

        operations.  Class<br>

        LSUnit (see file LSUnit.h) emulates a load/store unit

        implementing queues for<br>

        speculative execution of loads and stores.<br>

         <br>

        Each load (or store) consumes an entry in the load (or store)

        queue.  The number<br>

        of slots in the load/store queues is unknown by the tool, since

        there is no<br>

        mention of it in the scheduling model.  In practice, users can

        specify flag<br>

        `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in

        the queue to be<br>

        equal to exactly N (an unsigned value). If N is zero, then the

        tool assumes an<br>

        unbounded queue (this is the default).<br>

        <br>

        LSUnit implements a relaxed consistency model for memory loads

        and stores. The<br>

        rules are:<br>

        1) A younger load is allowed to pass an older load only if there

        is no<br>

           intervening store in between the two loads.<br>

        2) An younger store is not allowed to pass an older store.<br>

        3) A younger store is not allowed to pass an older load.<br>

        4) A younger load is allowed to pass an older store provided

        that the load does<br>

           not alias with the store.<br>

        <br>

        By default, this class conservatively (i.e. pessimistically)

        assumes that loads<br>

        always may-alias store operations.  Essentially, this LSUnit

        doesn't perform any<br>

        sort of alias analysis to rule out cases where loads and stores

        don't overlap<br>

        with each other.  The downside of this approach however is that

        younger loads are<br>

        never allowed to pass older stores.  To make it possible for a

        younger load to<br>

        pass an older store, users can use the command line flag

        -noalias.  Under<br>

        'noalias', a younger load is always allowed to pass an older

        store.<br>

        <br>

        Note that, in the case of write-combining memory, rule 2. could

        be relaxed a bit<br>

        to allow reordering of non-aliasing store operations.  That

        being said, at the<br>

        moment, there is no way to further relax the memory model (flag

        -noalias is the<br>

        only option).  Essentially, there is no option to specify a

        different memory<br>

        type (for example: write-back, write-combining, write-through;

        etc.) and<br>

        consequently to weaken or strengthen the memory model.<br>

        <br>

        Other limitations are:<br>

         * LSUnit doesn't know when store-to-load forwarding may occur.<br>

         * LSUnit doesn't know anything about the cache hierarchy and

        memory types.<br>

         * LSUnit doesn't know how to identify serializing operations

        and memory fences.<br>

        <br>

        No assumption is made on the store buffer size.  As mentioned

        before, LSUnit<br>

        conservatively assumes a may-alias relation between loads and

        stores, and it<br>

        doesn't attempt to identify cases where store-to-load forwarding

        would occur in<br>

        practice.<br>

        <br>

        LSUnit doesn't attempt to predict whether a load or store hits

        or misses the L1<br>

        cache.  It only knows if an instruction "MayLoad" and/or

        "MayStore".  For loads,<br>

        the scheduling model provides an "optimistic" load-to-use

        latency (which usually<br>

        matches the load-to-use latency for when there is a hit in the

        L1D).<br>

        <br>

        Class MCInstrDesc in LLVM doesn't know about serializing

        operations, nor<br>

        memory-barrier like instructions.  LSUnit conservatively assumes

        that an<br>

        instruction which has both 'MayLoad' and 'UnmodeledSideEffects'

        behaves like a<br>

        "soft" load-barrier.  That means, it serializes loads without

        forcing a flush of<br>

        the load queue.  Similarly, instructions flagged with both

        'MayStore' and<br>

        'UnmodeledSideEffects' are treated like store barriers.  A full

        memory barrier<br>

        is a 'MayLoad' and 'MayStore' instruction with

        'UnmodeledSideEffects'.  This is<br>

        inaccurate, but it is the best that we can do at the moment with

        the current<br>

        information available in LLVM.<br>

        <br>

        A load/store barrier consumes one entry of the load/store

        queue.  A load/store<br>

        barrier enforces ordering of loads/stores.  A younger load

        cannot pass a load<br>

        barrier.  Also, a younger store cannot pass a store barrier.  A

        younger load has<br>

        to wait for the memory/load barrier to execute.  A load/store

        barrier is<br>

        "executed" when it becomes the oldest entry in the load/store

        queue(s). That<br>

        also means, by construction, all the older loads/stores have

        been executed.<br>

        <br>

        In conclusion the full set of rules is:<br>

          1. A store may not pass a previous store.<br>

          2. A load may not pass a previous store unless flag 'NoAlias'

        is set.<br>

          3. A load may pass a previous load.<br>

          4. A store may not pass a previous load (regardless of flag

        'NoAlias').<br>

          5. A load has to wait until an older load barrier is fully

        executed.<br>

          6. A store has to wait until an older store barrier is fully

        executed.<br>

        <br>

        Known limitations<br>

        -----------------<br>

        Previous sections described cases where the tool is missing

        information to give<br>

        an accurate report.  For example, the first sections of this

        document explained<br>

        how the lack of knowledge about the processor negatively affects

        the performance<br>

        analysis.  The lack of knowledge is often a consequence of how

        scheduling models<br>

        are defined; as mentioned before, scheduling models

        intentionally don't describe<br>

        processors in fine details.<br>

        <br>

        The accuracy of the performance analysis is also affected by

        assumptions made by<br>

        the processor model used by the tool.<br>

        <br>

        Most recent Intel and AMD processors implement dedicated

        LoopBuffer/OpCache in<br>

        the hardware frontend to speedup the throughput in the presence

        of tight loops.<br>

        The presence of these buffers complicates the decoding logic,

        and requires<br>

        knowledge on the branch predictor too.  Class

        'SchedMachineModel' in tablegen<br>

        provides a field named 'LoopMicroOpBufferSize' which is used to

        describe loop<br>

        buffers.  However, the purpose of that field is to enable loop

        unrolling of<br>

        tight loops; essentially, it affects the cost model used by pass

        loop-unroll.<br>

        <br>

        By design, the tool only cares about the out-of-order portion of

        a processor,<br>

        and consequently doesn't try to predict the frontend

        throughput.  Processors may<br>

        implement complex decoding schemes; statically predicting the

        frontend<br>

        throughput is in general beyond the scope of this tool.  For the

        same reasons,<br>

        this tool intentionally doesn't model branch prediction.  That

        being said, this<br>

        tool could be definitely extended in future to also account for

        the hardware<br>

        frontend when doing performance analysis.  This would inevitably

        require extra<br>

        (extensive) processor knowledge related to all the available

        decoding paths in<br>

        the hardware frontend.<br>

        <br>

        When computing the IPC, the tool assumes a zero-latency

        "perfect" fetch&decode<br>

        stage; the full sequence of decoded instructions is immediately

        visible to the<br>

        dispatch logic from the start.<br>

        <br>

        The tool doesn't know about simultaneous mutithreading. 

        According to the tool,<br>

        processor resources are not statically/dynamically partitioned. 

        Processor<br>

        resources are fully available to the hardware thread executing

        the<br>

        microbenchmark.<br>

        <br>

        The execution model implemented by this tool assumes that

        instructions are<br>

        firstly dispatched in groups to hardware schedulers, and then

        issued to<br>

        pipelines for execution.  The model assumes dynamic scheduling

        of instructions.<br>

        Instructions are placed in a queue and potentially executed

        out-of-order (based<br>

        on the operand availability). The dispatch stage is definitely

        distinct from the<br>

        issue stage.<br>

        <br>

        This model doesn't correctly describe processors where the

        dispatch/issue is a<br>

        single stage. This is what happens for example in VLIW

        processors, where<br>

        instructions are packaged and statically scheduled at compile

        time; it is up to<br>

        the compiler to predict the latency of instructions and package

        issue groups<br>

        accordingly. For such targets, there is no dynamic scheduling

        done by the<br>

        hardware.<br>

        <br>

        Existing classes (DispatchUnit, Scheduler, etc.) could be

        extended/adapted to<br>

        support processors with a single dispatch/issue stage. The

        execution flow would<br>

        require some changes in the way how existing components (i.e. 

        DispatchUnit,<br>

        Scheduler, etc.) interact. This can be a future development.<br>

        <br>

        The following sections describes other known limitations.  The

        goal is not to<br>

        provide an extensive list of limitations; we want to report what

        we believe are<br>

        the most important limitations, and suggest possible methods to

        overcome them.<br>

        <br>

        Load/Store barrier instructions and serializing operations<br>

        ----------------------------------------------------------<br>

        Section "Load/Store Unit and Memory Consistency Model" already

        mentioned how<br>

        LLVM doesn't know about serializing operations and memory

        barriers.  Most of it<br>

        boils down to the fact that class MCInstrDesc (intentionally)

        doesn't expose<br>

        those properties.  Instead, both serializing operations and

        memory barriers<br>

        "have side-effects" according to MCInstrDesc.  That is because,

        at least for<br>

        scheduling purposes, knowing that an instruction has unmodeled

        side effects is<br>

        often enough to treat the instruction like a compiler scheduling

        barrier.<br>

        <br>

        A performance analysis tool could use the extra knowledge on

        barriers and<br>

        serializing operations to generate a more accurate performance

        report.  One way<br>

        to improve this is by reserving a couple of bits in field

        'Flags' from class<br>

        MCInstrDesc: one bit for barrier operations, and another bit to

        mark<br>

        instructions as serializing operations.<br>

        <br>

        Lack of support for instruction itineraries<br>

        -------------------------------------------<br>

        The current version of the tool doesn't know how to process

        instruction<br>

        itineraries.  This is probably one of the most important

        limitations, since it<br>

        affects a few out-of-order processors in LLVM.<br>

        <br>

        As mentioned in section 'Instruction Issue', class Scheduler

        delegates to an<br>

        instance of class ResourceManager the handling of processor

        resources.<br>

        ResourceManager is where most of the scheduling logic is

        implemented.<br>

        <br>

        Adding support for instruction itineraries requires that we

        teach<br>

        ResourceManager how to handle functional units and instruction

        stages.  This<br>

        development can be a future extension, and it would probably

        require a few<br>

        changes to the ResourceManager interface.<br>

        <br>

        Instructions that affect control flow are not correctly modeled<br>

        ---------------------------------------------------------------<br>

        Examples of instructions that affect the control flow are:

        return, indirect<br>

        branches, calls, etc.  The tool doesn't try to predict/evaluate

        branch targets.<br>

        In particular, the tool doesn't model any sort of branch

        prediction, nor does it<br>

        attempt to track changes to the program counter.  The tool

        always assumes that<br>

        the input assembly sequence is the body of a microbenchmark (a

        simple loop<br>

        executed for a number of iterations). The "next" instruction in

        sequence is<br>

        always the next instruction to dispatch.<br>

        <br>

        Call instructions default to an arbitrary high latency of 100cy.

        A warning is<br>

        generated if the tool encounters a call instruction in the

        sequence.  Return<br>

        instructions are not evaluated, and therefore control flow is

        not affected.<br>

        However, the tool still queries the processor scheduling model

        to obtain latency<br>

        information for instructions that affect the control flow.<br>

        <br>

        Possible extensions to the scheduling model<br>

        -------------------------------------------<br>

        Section "Instruction Dispatch" explained how the tool doesn't

        know about the<br>

        register files, and temporaries available in each register file

        for register<br>

        renaming purposes.<br>

        <br>

        The LLVM scheduling model could be extended to better describe

        register files.<br>

        Ideally, scheduling model should be able to define:<br>

         - The size of each register file<br>

         - How many temporary registers are available for register

        renaming<br>

         - How register classes map to register files<br>

        <br>

        The scheduling model doesn't specify the retire throughput (i.e.

        how many<br>

        instructions can be retired every cycle).  Users can specify

        flag<br>

        `-max-retire-per-cycle=<uint>` to limit how many

        instructions the retire control<br>

        unit can retire every cycle.  Ideally, every processor should be

        able to specify<br>

        the retire throughput (for example, by adding an extra field to

        the scheduling<br>

        model tablegen class).<br>

        <br>

        Known limitations on X86 processors<br>

        -----------------------------------<br>

        <br>

        1) Partial register updates versus full register updates.<br>

        <br>

        On x86-64, a 32-bit GPR write fully updates the super-register.

        Example:<br>

              add %edi %eax    ## eax += edi<br>

        <br>

        Here, register %eax aliases the lower half of 64-bit register

        %rax. On x86-64,<br>

        register %rax is fully updated by the 'add' (the upper half of

        %rax is zeroed).<br>

        Essentially, it "kills" any previous definition of (the upper

        half of) register<br>

        %rax.<br>

        <br>

        On the other hand, 8/16 bit register writes only perform a

        so-called "partial<br>

        register update". Example:<br>

              add %di, %ax     ## ax += di<br>

        <br>

        Here, register %eax is only partially updated. To be more

        specific, the lower<br>

        half of %eax is set, and the upper half is left unchanged. There

        is also no<br>

        change in the upper 48 bits of register %rax.<br>

        <br>

        To get accurate performance analysis, the tool has to know which

        instructions<br>

        perform a partial register update, and which instructions fully

        update the<br>

        destination's super-register.<br>

        <br>

        One way to expose this information is (again) via tablegen.  For

        example, we<br>

        could add a flag in the tablegen instruction class to tag

        instructions that<br>

        perform partial register updates. Something like this: 'bit<br>

        hasPartialRegisterUpdate = 1'. However, this would force a `let<br>

        hasPartialRegisterUpdate = 0` on several instruction

        definitions.<br>

        <br>

        Another approach is to have a MCSubtargetInfo hook similar to

        this:<br>

            virtual bool updatesSuperRegisters(unsigned short opcode) {

        return false; }<br>

        <br>

        Targets will be able to override this method if needed.  Again,

        this is just an<br>

        idea. But the plan is to have this fixed as a future

        development.<br>

        <br>

        2) Macro Op fusion.<br>

        <br>

        The tool doesn't know about macro-op fusion. On modern x86

        processors, a<br>

        'cmp/test' followed by a 'jmp' is fused into a single macro

        operation.  The<br>

        advantage is that the fused pair only consumes a single slot in

        the dispatch<br>

        group. <br>

        <br>

        As a future development, the tool should be extended to address

        macro-fusion.<br>

        Ideally, we could have LLVM generate a table enumerating all the

        opcode pairs<br>

        that can be fused together. That table could be exposed to the

        tool via the<br>

        MCSubtargetInfo interface.  This is just an idea; there may be

        better ways to<br>

        implement this.<br>

        <br>

        3) Intel processors: mixing legacy SSE with AVX instructions.<br>

        <br>

        On modern Intel processors with AVX, mixing legacy SSE code with

        AVX code<br>

        negatively impacts the performance.  The tool is not aware of

        this issue, and<br>

        the performance penalty is not accounted when doing the

        analysis. This is<br>

        something that we would like to improve in future.<br>

        <br>

        4) Zero-latency register moves and Zero-idioms.<br>

        <br>

        Most modern AMD/Intel processors know how to optimize out

        register-register<br>

        moves and zero idioms at register renaming stage. The tool

        doesn't know<br>

        about these patterns, and this may negatively impact the

        performance analysis.<br>

        <br>

        Known design problems<br>

        ---------------------<br>

        This section describes two design issues that are currently

        affecting the tool.<br>

        The long term plan is to "fix" these issues.<br>

        <br>

        1) Variant instructions not correctly modeled.<br>

        <br>

        The tool doesn't know how to analyze instructions with a

        "variant" scheduling<br>

        class descriptor. A variant scheduling class needs to be

        resolved dynamically.<br>

        The "actual" scheduling class often depends on the subtarget, as

        well as<br>

        properties of the specific MachineInstr object.<br>

        <br>

        Unfortunately, the tool manipulates MCInst, and it doesn't know

        anything about<br>

        MachineInstr. As a consequence, the tool cannot use the existing

        machine<br>

        subtarget hooks that are normally used to resolve the variant

        scheduling class.<br>

        This is a major design issue which mostly affects ARM/AArch64

        targets.  It<br>

        mostly boils down to the fact that the existing scheduling

        framework was meant<br>

        to work for MachineInstr.<br>

        <br>

        When the tool encounters a "variant" instruction, it assumes a

        generic 1cy<br>

        latency. However, the tool would not be able to tell which

        processor resources<br>

        are effectively consumed by the variant instruction.<br>

        <br>

        2) MCInst and MCInstrDesc.<br>

        <br>

        Performance analysis tools require data dependency information

        to correctly<br>

        predict the runtime performance of the code. This tool must

        always be able to<br>

        obtain the set of implicit/explicit register defs/uses for every

        instruction of<br>

        the input assembly sequence.<br>

        <br>

        In the first section of this document, it was mentioned how the

        tool takes as<br>

        input an assembly sequence. That sequence is parsed into a

        MCInst sequence with<br>

        the help of assembly parsers available from the targets.<br>

        <br>

        A MCInst is a very low-level instruction representation. The

        tool can inspect<br>

        the MCOperand sequence of an MCInst to identify register

        operands. However,<br>

        there is no way to tell register operands that are definitions

        from register<br>

        operands that are uses.<br>

        <br>

        In LLVM, class MCInstrDesc is used to fully describe target

        instructions and<br>

        their operands. The opcode of a machine instruction (a

        MachineInstr object) can<br>

        be used to query the instruction set through method

        `MCInstrInfo::get' to obtain<br>

        the associated MCInstrDesc object.<br>

        <br>

        However class MCInstrDesc describes properties and operands of

        MachineInstr<br>

        objects. Essentially, MCInstrDesc is not meant to be used to

        describe MCInst<br>

        objects.  To be more specific, MCInstrDesc objects are

        automatically generated<br>

        via tablegen from the instruction set description in the target

        .td files.  For<br>

        example, field `MCInstrDesc::NumDefs' is always equal to the

        cardinality of the<br>

        `(outs)` set from the tablegen instruction definition.<br>

        <br>

        By construction, register definitions always appear at the

        beginning of the<br>

        MachineOperands list in MachineInstr. Basically, the (outs) are

        the first<br>

        operands of a MachineInstr, and the (ins) will come after in the

        machine operand<br>

        list. Knowing the number of register definitions is enough to

        identify<br>

        all the register operands that are definitions.<br>

        <br>

        In a normal compilation process, MCInst objects are generated

        from MachineInstr<br>

        objects through a lowering step. By default the lowering logic

        simply iterates<br>

        over the machine operands of a MachineInstr, and

        converts/expands them into<br>

        equivalent MCOperand objects.<br>

        <br>

        The default lowering strategy has the advantage of preserving

        all of the<br>

        above mentioned assumptions on the machine operand sequence.

        That means, register<br>

        definitions would still be at the beginning of the MCOperand

        sequence, and<br>

        register uses would come after.<br>

        <br>

        Targets may still define custom lowering routines for specific

        opcodes. Some of<br>

        these routines may lower operands in a way that potentially

        breaks (some of) the<br>

        assumptions on the machine operand sequence which were valid for

        MachineInstr.<br>

        Luckily, this is not the most common form of lowering done by

        the targets, and<br>

        the vast majority of the MachineInstr are lowered based on the

        default strategy<br>

        which preserves the original machine operand sequence.  This is

        especially true<br>

        for x86, where the custom lowering logic always preserves the

        original (i.e.<br>

        from the MachineInstr) operand sequence.<br>

        <br>

        This tool currently works under the strong (and potentially

        incorrect)<br>

        assumption that register def/uses in a MCInst can always be

        identified by<br>

        querying the machine instruction descriptor for the opcode. This

        assumption made<br>

        it possible to develop this tool and get good numbers at least

        for the<br>

        processors available in the x86 backend.<br>

        <br>

        That being said, the analysis is still potentially incorrect for

        other targets.<br>

        So we plan (with the help of the community) to find a proper

        mechanism to map<br>

        when possible MCOperand indices back to MachineOperand indices

        of the equivalent<br>

        MachineInstr.  This would be equivalent to describing changes

        made by the<br>

        lowering step which affected the operand sequence. For example,

        we could have an<br>

        index for every register MCOperand (or -1, if the operand didn't

        exist in the<br>

        original MachineInstr). The mapping could look like this

        <0,1,3,2>.  Here,<br>

        MCOperand #2 was obtained from the lowering of MachineOperand

        #3. etc.<br>

        <br>

        This information could be automatically generated via tablegen

        for all the<br>

        instructions whose custom lowering step breaks assumptions made

        by the tool on<br>

        the register operand sequence (In general, these instructions

        should be the<br>

        minority of a target's instruction set). Unfortunately, we don't

        have that<br>

        information now.  As a consequence, we assume that the number of

        explicit<br>

        register definitions is the same number specified in

        MCInstrDesc.  We also<br>

        assume that register definitions always come first in the

        operand sequence.<br>

        <br>

        In conclusion: these are for now the strong assumptions made by

        the tool:<br>

          * The number of explicit and implicit register definitions in

        a MCInst<br>

            matches the number of explicit and implicit definitions

        specified by the<br>

            MCInstrDesc object.<br>

          * Register uses always come after register definitions.<br>

          * If an opcode specifies an optional definition, then the

        optional<br>

            definition is always the last register operand in the

        sequence.<br>

        <br>

        Note that some of the information accessible from the

        MCInstrDesc is always<br>

        valid for MCInst. For example: implicit register defs, implicit

        register uses<br>

        and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties

        still apply to<br>

        MCInst. The tool knows about this, and uses that information

        during its<br>

        analysis.<br>

        <br>

        What to do next<br>

        ---------------<br>

        The source code has been uploaded for review on phabricator at

        this link: <a href="https://reviews.llvm.org/D43951"

          moz-do-not-send="true">https://reviews.llvm.org/D43951</a>.<br>

        <br>

        The review covers two patches:<br>

        A first (very small) patch that always enables the generation of

        processor<br>

        resource names in the SubtargetEmitter. Currently, the processor

        resource names<br>

        are only generated for debugging purposes, but are needed by the

        tool to<br>

        generate user friendly reports, so we would like to always

        generate them.<br>

        A second patch with the actual static analysis tool (in

        llvm/tools).<br>

        <br>

        Once these first two patches are committed, the plan is to keep

        working on the<br>

        tool with the help of the community to address all of the

        limitations described<br>

        by the previous sections, and find good solutions/fixes for the

        design issues<br>

        described by section "Known design problems".<br>

        <br>

        We hope the community will find this tool useful like we have.<br>

        <br>

        Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg

        Bedwell who really<br>

        helped me a lot by suggesting improvements and testing the tool.<br>

        <br>

        Thanks for your time.<br>

        -Andrea<br>

        <br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>