<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Reading through this last night got be thinking about how to model

    control flow.  Given most of my source code tends to be very

    branchy, be limited to straight line code is quite restrictive.  The

    main thing is that it requires a lot of hand simplification which

    can be rather error prone at times.  <br>

    <br>

    It occurs to me that we could remove the restriction around branches

    without necessarily fully modeling fetch, decode, or prediction.  If

    in addition to an assembly file, we also feed the tool a trace of

    branch executions, the tool could essentially unroll all loops

    dynamically.  The type of thing I'm thinking of would look like

    something like this: (ENTRY, branch_dest, branch_dest, branch_dest)

    where each "branch_dest" is the taken successor of the next

    encountered branch instruction.  <br>

    <br>

    As an example, consider a simple loop which executes 3 times:<br>

    ENTRY:<br>

      # fallthrough<br>

    bb1:<br>

      # various instructions<br>

      jne bb1<br>

    bb2:<br>

      ret<br>

    <br>

    The trace for this would look like: (ENTRY, bb1, bb1, bb1,

    FALLTHROUGH).  <br>

    <br>

    The nice thing about such a branch trace is that it separate the

    source of the trace from the analysis tool.  You could provide an

    actual trace collected through instrumentation, or a predicted trace

    generated from sample profiling information.  <br>

    <br>

    (There are obvious extensions to handle indirect branches, calls,

    and such here.  I'm leaving that as an exercise for the reader at

    the moment.)<br>

    <br>

    By itself, this seems like a really useful usability improvement.<br>

    <br>

    You could also build on top of this notion to model both branch

    prediction mispredicts and resource limitations.  (Or only one or

    the other.)  From the perspective of the predictor, the trace would

    be ground truth so you could analyze how well a particularly

    prediction strategy works on that particular trace.  Distinct from

    the prediction, you could model delay between branch execution and

    condition satisfaction to identify resource limits in the actual

    speculative execution.  (I'm very deliberately being vague here;

    it's been a long time since I looked into this aspect of things and

    don't remember the appropriate terminology.  Figured it was better

    to be vague than potentially misleading by getting it wrong.)<br>

    <br>

    Philip<br>

    <br>

    <div class="moz-cite-prefix">On 03/01/2018 09:22 AM, Andrea Di

      Biagio via llvm-dev wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAMw=4mL+bANeYnUifa3f39o4mvY9bdPfdqXr6HJSXkO0KUB5Vw@mail.gmail.com">

      <div dir="ltr">Hi all,<br>

        <br>

        At Sony we developed an LLVM based performance analysis tool

        named llvm-mca. We<br>

        currently use it internally to statically measure the

        performance of code, and<br>

        to help triage potential problems with target scheduling

        models.  We decided to<br>

        post this RFC because we are interested in the feedback from the

        community, and<br>

        we also believe that other people might be interested in a tool

        like this.<br>

        <br>

        llvm-mca uses information which is already available in LLVM

        (e.g. scheduling<br>

        models) to statically measure the performance of machine code in

        a specific cpu.<br>

        Performance is measured in terms of throughput as well as

        processor resource<br>

        consumption.  The tool currently works for processors with an

        out-of-order<br>

        backend, for which there is a scheduling model available in

        LLVM.<br>

        <br>

        The main goal of this tool is not just to predict the

        performance of the code<br>

        when run on the target, but also help with diagnosing potential

        performance<br>

        issues.<br>

        <br>

        Given an assembly code sequence, llvm-mca estimates the IPC

        (instructions per<br>

        cycle), as well as hardware resources pressure. The analysis and

        reporting style<br>

        were inspired by the IACA tool from Intel.<br>

        <br>

        The presence of long data dependency chains, as well as poor

        usage of hardware<br>

        resources may lead to bottlenecks in the back-end.  The tool is

        able to generate<br>

        a detailed report which should help with identifying and

        analyzing sources of<br>

        bottlenecks.<br>

        <br>

        Scheduling models are mostly used to compute instruction

        latencies, to obtain<br>

        read-advance information, and understand how processor resources

        are used by<br>

        instructions.  By design, the quality of the performance

        analysis conducted by<br>

        the tool is inevitably affected by the quality of the target

        scheduling models. <br>

        <br>

        However, scheduling models intentionally do not describe all

        processors details,<br>

        since the goal is just to enable the scheduling of machine

        instructions during<br>

        compilation. That means, there are processor details which are

        not important for<br>

        the purpose of scheduling instructions (and therefore not

        described by the<br>

        scheduling model), but are very important for this tool.<br>

        <br>

        A few examples of details that are missing in scheduling models

        are:<br>

         - Maximum number of instructions retired per cycle.<br>

         - Actual dispatch width (it often differs from the issue

        width).<br>

         - Number of temporary registers available for renaming.<br>

         - Number of read/write ports in the register file(s).<br>

         - Length of the load/store queue in the LSUnit.<br>

        <br>

        It is also very difficult to find a "good" abstract model to

        describe the<br>

        behavior of out-of-order processors. So, we have to keep in mind

        that all of<br>

        these aspects are going to affect the quality of the static

        analysis performed<br>

        by the tool.<br>

        <br>

        An extensive list of known limitations is reported in one of the

        last sections<br>

        of this document. There is also a section related to design

        problems which must<br>

        be addressed (hopefully with the help of the community).  At the

        moment, the<br>

        tool has been mostly tested for x86 targets, but there are still

        several<br>

        limitations, some of which could be overcome by integrating

        extra information<br>

        into the scheduling models.<br>

        <br>

        As was mentioned before, this tool has been (and is still being)

        used internally<br>

        in Sony to debug/triage issues in the btver2 scheduling model.

        We have also<br>

        tested it on other targets to check how generic the tool is. In

        our experience,<br>

        the tool makes it easy to identify simple mistakes like "wrong

        number of micro<br>

        opcodes specified for an instruction", or "wrong set of hardware

        resources".<br>

        Some of these mistakes are quite common (sometimes on mature

        models too), and<br>

        often difficult to catch.  Reports generated by this tool are

        simple to analyze,<br>

        and contain enough details to help triage most performance

        problems.<br>

        <br>

        1. How the tool works<br>

        ---------------------<br>

        <br>

        The tool takes assembly code as input. Assembly code is parsed

        into a sequence<br>

        of MCInst with the help of the existing LLVM target assembly

        parsers. The parsed<br>

        sequence of MCInst is then analyzed by a 'Backend' module to

        generate a<br>

        performance report.<br>

        <br>

        The Backend module internally emulates the execution of the

        machine code<br>

        sequence in a loop of iterations (which by default is 70). At

        the end of this<br>

        process, the backend collects a number of statistics which are

        then printed out<br>

        in the form of a report.<br>

        <br>

        Here is an example of performance report generated by the tool

        for a dot-product<br>

        of two packed float vectors of four elements. The analysis is

        conducted for<br>

        target x86, cpu btver2:<br>

        <br>

        ///////////////////<br>

        <br>

        Iterations:     300<br>

        Instructions:   900<br>

        Total Cycles:   610<br>

        Dispatch Width: 2<br>

        IPC:            1.48<br>

        <br>

        <br>

        Resources:<br>

        [0] - JALU0<br>

        [1] - JALU1<br>

        [2] - JDiv<br>

        [3] - JFPM<br>

        [4] - JFPU0<br>

        [5] - JFPU1<br>

        [6] - JLAGU<br>

        [7] - JSAGU<br>

        [8] - JSTC<br>

        [9] - JVIMUL<br>

        <br>

        <br>

        Resource pressure per iteration:<br>

        [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]   

        [9]    <br>

         -      -      -      -     2.00   1.00    -      -      -     

        -     <br>

        <br>

        Resource pressure by instruction:<br>

        [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]   

        [9]        Instructions:<br>

         -      -      -      -      -     1.00    -      -      -     

        -         vmulps    %xmm0, %xmm1, %xmm2<br>

         -      -      -      -     1.00    -      -      -      -     

        -         vhaddps    %xmm2, %xmm2, %xmm3<br>

         -      -      -      -     1.00    -      -      -      -     

        -         vhaddps    %xmm3, %xmm3, %xmm4<br>

        <br>

        <br>

        Instruction Info:<br>

        [1]: #uOps<br>

        [2]: Latency<br>

        [3]: RThroughput<br>

        [4]: MayLoad<br>

        [5]: MayStore<br>

        [6]: HasSideEffects<br>

        <br>

        [1]    [2]    [3]    [4]    [5]    [6]    Instructions:<br>

         1      2     1.00                        vmulps    %xmm0,

        %xmm1, %xmm2<br>

         1      3     1.00                        vhaddps    %xmm2,

        %xmm2, %xmm3<br>

         1      3     1.00                        vhaddps    %xmm3,

        %xmm3, %xmm4<br>

        <br>

        ///////////////////<br>

        <br>

        According to this report, the dot-product kernel has been

        executed 300 times,<br>

        for a total of 900 instructions dynamically executed.<br>

        <br>

        The report is structured in three main sections.  A first

        section collects a few<br>

        performance numbers; the goal of this section is to give a very

        quick overview<br>

        of the performance throughput. In this example, the two

        important perforamce<br>

        indicators are a) the predicted total number of cycles, and b)

        the IPC. <br>

        IPC is probably the most important throughput indicator. A big

        delta between the<br>

        Dispatch Width and the computed IPC is an indicator of potential

        performance<br>

        issues.<br>

        <br>

        The second section is the so-called "resource pressure view". 

        This view reports<br>

        the average number of resource cycles consumed every iteration

        by instructions<br>

        for every processor resource unit available on the target. 

        Information is<br>

        structured in two tables. The first table reports the number of

        resource cycles<br>

        spent on average every iteration. The second table correlates

        the resource<br>

        cycles to the machine instruction in the sequence. For example,

        every iteration<br>

        of the dot-product, instruction 'vmulps' always executes on

        resource unit [5]<br>

        (JFPU1 - floating point pipeline #1), consuming an average of 1

        resource cycle<br>

        per iteration.  Note that on Jaguar, vector FP multiply can only

        be issued to<br>

        pipeline JFPU1, while horizontal FP adds can only be issued to

        pipeline JFPU0.<br>

        <br>

        The third (and last) section of the report shows the latency and

        reciprocal<br>

        throughput of every instruction in the sequence. That section

        also reports extra<br>

        information related to the number of micro opcodes, and opcode

        properties (i.e.<br>

        'MayLoad', 'MayStore' and 'UnmodeledSideEffects').<br>

        <br>

        The resource pressure view helps with identifying bottlenecks

        caused by high<br>

        usage of specific hardware resources.  Situations with resource

        pressure mainly<br>

        concentrated on a few resources should, in general, be avoided. 

        Ideally,<br>

        pressure should be uniformly distributed between multiple

        resources.<br>

        <br>

        Timeline View<br>

        -------------<br>

        <br>

        A detailed report of each instruction's state transitions over

        time can be<br>

        enabled using the command line flag '-timeline'.  This prints an

        extra section<br>

        in the report which contains the so-called "timeline view". 

        Below is the<br>

        timeline view for the dot-product example from the previous

        section.<br>

        <br>

        ///////////////<br>

        Timeline view:<br>

                           012345<br>

        Index    0123456789      <br>

        <br>

        [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2<br>

        [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3<br>

        [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4<br>

        <br>

        [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2<br>

        [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3<br>

        [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4<br>

        <br>

        [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2<br>

        [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3<br>

        [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4<br>

        <br>

        <br>

        Average Wait times (based on the timeline view):<br>

        [0]: Executions<br>

        [1]: Average time spent waiting in a scheduler's queue<br>

        [2]: Average time spent waiting in a scheduler's queue while

        ready<br>

        [3]: Average time elapsed from WB until retire stage<br>

        <br>

              [0]    [1]    [2]    [3]<br>

        0.     3     1.0    1.0    3.3      vmulps    %xmm0, %xmm1,

        %xmm2<br>

        1.     3     3.3    0.7    1.0      vhaddps    %xmm2, %xmm2,

        %xmm3<br>

        2.     3     5.7    0.0    0.0      vhaddps    %xmm3, %xmm3,

        %xmm4<br>

        ///////////////<br>

        <br>

        The timeline view is very interesting because it shows how

        instructions changed<br>

        in state during execution.  It also gives an idea of how the

        tool "sees"<br>

        instructions executed on the target.<br>

        <br>

        The timeline view is structured in two tables.  The first table

        shows how<br>

        instructions change in state over time (measured in cycles); the

        second table<br>

        (named "Average Wait times") reports useful timing statistics

        which should help<br>

        diagnose performance bottlenecks caused by long data

        dependencies and sub-optimal<br>

        usage of hardware resources.<br>

        <br>

        An instruction in the timeline view is identified by a pair of

        indices, where<br>

        the 'first' index identifies an iteration, and the 'second'

        index is the actual<br>

        instruction index (i.e. where it appears in the code sequence).<br>

        <br>

        Excluding the first and last column, the remaining columns are

        in cycles.  Cycles<br>

        are numbered sequentially starting from 0.  The following

        characters are used to<br>

        describe the state of an instruction:<br>

        <br>

         D : Instruction dispatched.<br>

         e : Instruction executing.<br>

         E : Instruction executed.<br>

         R : Instruction retired.<br>

         = : Instruction already dispatched, waiting to be executed.<br>

         - : Instruction executed, waiting to be retired.<br>

        <br>

        Based on the timeline view from the example, we know that:<br>

          - Instruction [1, 0] was dispatched at cycle 1.<br>

          - Instruction [1, 0] started executing at cycle 2.<br>

          - Instruction [1, 0] reached the write back stage at cycle 4.<br>

          - Instruction [1, 0] was retired at cycle 10.<br>

        <br>

        Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't

        have to wait in<br>

        the scheduler's queue for the operands to become available. By

        the time the<br>

        vmulps is dispatched, operands are already available, and

        pipeline JFPU1 is<br>

        ready to serve another instruction.  So the instruction can be

        immediately<br>

        issued on the JFPU1 pipeline. That is demonstrated by the fact

        that the<br>

        instruction only spent 1cy in the scheduler's queue.<br>

        <br>

        There is a gap of 5 cycles between the write-back stage and the

        retire event.<br>

        That is because instructions must retire in program order, so

        [1,0] has to wait<br>

        for [0, 2] to be retired first (i.e it has to wait unti cycle

        10).<br>

        <br>

        In the dot-product example, all instructions are in a RAW (Read

        After Write)<br>

        dependency chain.  Register %xmm2 written by the vmulps is

        immediately used by<br>

        the first vhaddps, and register %xmm3 written by the first

        vhaddps is used by<br>

        the second vhaddps.  Long data dependencies negatively affect

        the ILP<br>

        (Instruction Level Parallelism).<br>

        <br>

        In the dot-product example, there are anti-dependencies

        introduced by<br>

        instructions from different iterations.  However, those

        dependencies can be<br>

        removed at register renaming stage (at the cost of allocating

        register aliases,<br>

        and therefore consuming temporary registers).<br>

        <br>

        Table "Average Wait times" helps diagnose performance issues

        that are caused by<br>

        the presence of long latency instructions and potentially long

        data dependencies<br>

        which may limit the ILP.  Note that the tool by default assumes

        at least 1cy<br>

        between the dispatch event and the issue event.<br>

        <br>

        When the performance is limited by data dependencies and/or long

        latency<br>

        instructions, the number of cycles spent while in the "ready"

        state is expected<br>

        to be very small when compared with the total number of cycles

        spent in the<br>

        scheduler's queue.  So the difference between the two counters

        is a good<br>

        indicator of how big of an impact data dependencies had on the

        execution of<br>

        instructions.  When performance is mostly limited by the lack of

        hardware<br>

        resources, the delta between the two counters is small. 

        However, the number of<br>

        cycles spent in the queue tends to be bigger (i.e. more than

        1-3cy) especially<br>

        when compared with other low latency instructions.<br>

        <br>

        Extra statistics to further diagnose performance issues.<br>

        --------------------------------------------------------<br>

        <br>

        Flag '-verbose' enables extra statistics and performance

        counters for the<br>

        dispatch logic, the reorder buffer, the retire control unit and

        the register<br>

        file.<br>

        <br>

        Below is an example of verbose output generated by the tool for

        the dot-product<br>

        example discussed in the previous sections.<br>

        <br>

        ///////////////////<br>

        Iterations:     300<br>

        Instructions:   900<br>

        Total Cycles:   610<br>

        Dispatch Width: 2<br>

        IPC:            1.48<br>

        <br>

        <br>

        Dynamic Dispatch Stall Cycles:<br>

        RAT     - Register unavailable:                      0<br>

        RCU     - Retire tokens unavailable:                 0<br>

        SCHEDQ  - Scheduler full:                            272<br>

        LQ      - Load queue full:                           0<br>

        SQ      - Store queue full:                          0<br>

        GROUP   - Static restrictions on the dispatch group: 0<br>

        <br>

        <br>

        Register Alias Table:<br>

        Total number of mappings created: 900<br>

        Max number of mappings used:      35<br>

        <br>

        <br>

        Dispatch Logic - number of cycles where we saw N instructions

        dispatched:<br>

        [# dispatched], [# cycles]<br>

         0,              24  (3.9%)<br>

         1,              272  (44.6%)<br>

         2,              314  (51.5%)<br>

        <br>

        <br>

        Schedulers - number of cycles where we saw N instructions

        issued:<br>

        [# issued], [# cycles]<br>

         0,          7  (1.1%)<br>

         1,          306  (50.2%)<br>

         2,          297  (48.7%)<br>

        <br>

        <br>

        Retire Control Unit - number of cycles where we saw N

        instructions retired:<br>

        [# retired], [# cycles]<br>

         0,           109  (17.9%)<br>

         1,           102  (16.7%)<br>

         2,           399  (65.4%)<br>

        <br>

        <br>

        Scheduler's queue usage:<br>

        JALU01,  0/20<br>

        JFPU01,  18/18<br>

        JLSAGU,  0/12<br>

        ///////////////////<br>

        <br>

        Based on the verbose report, the backend was only able to

        dispatch two<br>

        instructions 51.5% of the time.  The dispatch group was limited

        to one<br>

        instruction 44.6% of the cycles, which corresponds to 272

        cycles.<br>

        <br>

        If we look at section "Dynamic Dispatch Stall Cycles", we can

        see how counter<br>

        SCHEDQ reports 272 cycles.  Counter SCHEDQ is incremented every

        time the dispatch<br>

        logic is unable to dispatch a full group of two instructions

        because the<br>

        scheduler's queue is full.<br>

        <br>

        Section "Scheduler's queue usage" shows how the maximum number

        of buffer entries<br>

        (i.e. scheduler's queue entries) used at runtime for resource

        JFPU01 reached its<br>

        maximum. Note that AMD Jaguar implements three schedulers:<br>

          * JALU01 - A scheduler for ALU instructions<br>

          * JLSAGU - A scheduler for address generation<br>

          * JFPU01 - A scheduler floating point operations.<br>

        <br>

        The dot-product is a kernel of three floating point instructions

        (a vector<br>

        multiply followed by two horizontal adds).  That explains why

        only the floating<br>

        point scheduler appears to be used according to section

        "Scheduler's queue<br>

        usage".<br>

        <br>

        A full scheduler's queue is either caused by data dependency

        chains, or by a<br>

        sub-optimal usage of hardware resources.  Sometimes, resource

        pressure can be<br>

        mitigated by rewriting the kernel using different instructions

        that consume<br>

        different scheduler resources.  Schedulers with a small queue

        are less resilient<br>

        to bottlenecks caused by the presence of long data dependencies.<br>

        <br>

        In this example, we can conclude that the IPC is mostly limited

        by data<br>

        dependencies, and not by resource pressure.<br>

        <br>

        LLVM-MCA instruction flow<br>

        -------------------------<br>

        <br>

        This section describes the instruction flow through the

        out-of-order backend, as<br>

        well as the functional units involved in the process. <br>

        <br>

        An instruction goes through a default sequence of stages:<br>

            - Dispatch (Instruction is dispatched to the schedulers).<br>

            - Issue (Instruction is issued to the processor pipelines).<br>

            - Write Back (Instruction is executed, and results are

        written back).<br>

            - Retire (Instruction is retired; writes are architecturally

        committed).<br>

        <br>

        The tool only models the out-of-order portion of a processor.

        Therefore, the<br>

        instruction fetch and decode stages are not modeled. Performance

        bottlenecks in<br>

        the frontend are not diagnosed by this tool.  The tool assumes

        that instructions<br>

        have all been decoded and placed in a queue.  Also, the tool

        doesn't know<br>

        anything about branch prediction.<br>

        <br>

        Instruction Dispatch<br>

        --------------------<br>

        <br>

        During the Dispatch stage, instructions are picked in program

        order from a queue<br>

        of already decoded instructions, and dispatched in groups to the

        hardware<br>

        schedulers.  The dispatch logic is implemented by class

        DispatchUnit in file<br>

        Dispatch.h.<br>

        <br>

        The size of a dispatch group depends on the availability of

        hardware resources,<br>

        and it cannot exceed the value of field 'DispatchWidth' in class

        DispatchUnit.<br>

        Note that field DispatchWidth defaults to the value of field

        'IssueWidth' from<br>

        the scheduling model.<br>

        <br>

        Users can override the DispatchWidth value with flag

        "-dispatch=<N>" (where 'N'<br>

        is an unsigned quantity).<br>

        <br>

        An instruction can be dispatched if:<br>

         - The size of the dispatch group is smaller than DispatchWidth<br>

         - There are enough entries in the reorder buffer<br>

         - There are enough temporary registers to do register renaming<br>

         - Schedulers are not full.<br>

        <br>

        Scheduling models don't describe register files, and therefore

        the tool doesn't<br>

        know if there is more than one register file, and how many

        temporaries are<br>

        available for register renaming.<br>

        <br>

        By default, the tool (optimistically) assumes a single register

        file with an<br>

        unbounded number of temporary registers.  Users can limit the

        number of<br>

        temporary registers available for register renaming using flag<br>

        `-register-file-size=<N>`, where N is the number of

        temporaries.  A value of<br>

        zero for N means 'unbounded'.  Knowing how many temporaries are

        available for<br>

        register renaming, the tool can predict dispatch stalls caused

        by the lack of<br>

        temporaries.<br>

        <br>

        The number of reorder buffer entries consumed by an instruction

        depends on the<br>

        number of micro-opcodes it specifies in the target scheduling

        model (see field<br>

        'NumMicroOpcodes' of tablegen class ProcWriteResources and its

        derived classes;<br>

        TargetSchedule.td).<br>

        <br>

        The reorder buffer is implemented by class RetireControlUnit

        (see Dispatch.h).<br>

        Its goal is to track the progress of instructions that are

        "in-flight", and<br>

        retire instructions in program order.  The number of entries in

        the reorder<br>

        buffer defaults to the value of field 'MicroOpBufferSize' from

        the target<br>

        scheduling model.<br>

        <br>

        Instructions that are dispatched to the schedulers consume

        scheduler buffer<br>

        entries.  The tool queries the scheduling model to figure out

        the set of<br>

        buffered resources consumed by an instruction.  Buffered

        resources are treated<br>

        like "scheduler" resources, and the field 'BufferSize' (from the

        processor<br>

        resource tablegen definition) defines the size of the

        scheduler's queue.<br>

        <br>

        Zero latency instructions (for example NOP instructions) don't

        consume scheduler<br>

        resources.  However, those instructions still reserve a number

        of slots in the<br>

        reorder buffer.<br>

        <br>

        Instruction Issue<br>

        -----------------<br>

        <br>

        As mentioned in the previous section, each scheduler resource

        implements a queue<br>

        of instructions.  An instruction has to wait in the scheduler's

        queue until<br>

        input register operands become available.  Only at that point,

        does the<br>

        instruction becomes eligible for execution and may be issued

        (potentially<br>

        out-of-order) to a pipeline for execution.<br>

        <br>

        Instruction latencies can be computed by the tool with the help

        of the<br>

        scheduling model; latency values are defined by the scheduling

        model through<br>

        ProcWriteResources objects.<br>

        <br>

        Class Scheduler (see file Scheduler.h) knows how to emulate

        multiple processor<br>

        schedulers.  A Scheduler is responsible for tracking data

        dependencies, and<br>

        dynamically select which processor resources are consumed/used

        by instructions.<br>

        <br>

        Internally, the Scheduler class delegates the management of

        processor resource<br>

        units and resource groups to the ResourceManager class. 

        ResourceManager is also<br>

        responsible for selecting resource units that are effectively

        consumed by<br>

        instructions.  For example, if an instruction consumes 1cy of a

        resource group,<br>

        the ResourceManager object selects one of the available units

        from the group; by<br>

        default, it uses a round-robin selector to guarantee that

        resource usage is<br>

        uniformly distributed between all units of a group.<br>

        <br>

        Internally, class Scheduler implements three instruction queues:<br>

          - WaitQueue: a queue of instructions whose operands are not

        ready yet.<br>

          - ReadyQueue: a queue of instructions ready to execute.<br>

          - IssuedQueue: a queue of instructions executing.<br>

        <br>

        Depending on the operands availability, instructions that are

        dispatched to the<br>

        Scheduler are either placed into the WaitQueue or into the

        ReadyQueue.<br>

        <br>

        Every cycle, class Scheduler checks if instructions can be moved

        from the<br>

        WaitQueue to the ReadyQueue, and if instructions from the

        ReadyQueue can be<br>

        issued to the underlying pipelines.  The algorithm prioritizes

        older<br>

        instructions over younger instructions.<br>

        <br>

        Objects of class ResourceState (see Scheduler.h) describe

        processor resources.<br>

        There is an instance of class ResourceState for each single

        processor resource<br>

        specified by the scheduling model.  A ResourceState object for a

        processor<br>

        resource with multiple units dynamically tracks the availability

        of every single<br>

        unit.  For example, the ResourceState of a resource group tracks

        the<br>

        availability of every resource in that group.  Internally,

        ResourceState<br>

        implements a round-robin selector to dynamically pick the next

        unit to use from<br>

        the group.<br>

        <br>

        Write-Back and Retire Stage<br>

        ---------------------------<br>

        <br>

        Issued instructions are moved from the ReadyQueue to the

        IssuedQueue.  There,<br>

        instructions wait until they reach the write-back stage.  At

        that point, they<br>

        get removed from the queue and the retire control unit is

        notified.<br>

        <br>

        On the event of "instruction executed", the retire control unit

        flags the<br>

        instruction as "ready to retire".<br>

        <br>

        Instruction are retired in program order; an "instruction

        retired" event is sent<br>

        to the register file which frees the temporary registers

        allocated for the<br>

        instruction at register renaming stage.<br>

        <br>

        Load/Store Unit and Memory Consistency Model<br>

        --------------------------------------------<br>

        <br>

        The tool attempts to emulate out-of-order execution of memory

        operations.  Class<br>

        LSUnit (see file LSUnit.h) emulates a load/store unit

        implementing queues for<br>

        speculative execution of loads and stores.<br>

         <br>

        Each load (or store) consumes an entry in the load (or store)

        queue.  The number<br>

        of slots in the load/store queues is unknown by the tool, since

        there is no<br>

        mention of it in the scheduling model.  In practice, users can

        specify flag<br>

        `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in

        the queue to be<br>

        equal to exactly N (an unsigned value). If N is zero, then the

        tool assumes an<br>

        unbounded queue (this is the default).<br>

        <br>

        LSUnit implements a relaxed consistency model for memory loads

        and stores. The<br>

        rules are:<br>

        1) A younger load is allowed to pass an older load only if there

        is no<br>

           intervening store in between the two loads.<br>

        2) An younger store is not allowed to pass an older store.<br>

        3) A younger store is not allowed to pass an older load.<br>

        4) A younger load is allowed to pass an older store provided

        that the load does<br>

           not alias with the store.<br>

        <br>

        By default, this class conservatively (i.e. pessimistically)

        assumes that loads<br>

        always may-alias store operations.  Essentially, this LSUnit

        doesn't perform any<br>

        sort of alias analysis to rule out cases where loads and stores

        don't overlap<br>

        with each other.  The downside of this approach however is that

        younger loads are<br>

        never allowed to pass older stores.  To make it possible for a

        younger load to<br>

        pass an older store, users can use the command line flag

        -noalias.  Under<br>

        'noalias', a younger load is always allowed to pass an older

        store.<br>

        <br>

        Note that, in the case of write-combining memory, rule 2. could

        be relaxed a bit<br>

        to allow reordering of non-aliasing store operations.  That

        being said, at the<br>

        moment, there is no way to further relax the memory model (flag

        -noalias is the<br>

        only option).  Essentially, there is no option to specify a

        different memory<br>

        type (for example: write-back, write-combining, write-through;

        etc.) and<br>

        consequently to weaken or strengthen the memory model.<br>

        <br>

        Other limitations are:<br>

         * LSUnit doesn't know when store-to-load forwarding may occur.<br>

         * LSUnit doesn't know anything about the cache hierarchy and

        memory types.<br>

         * LSUnit doesn't know how to identify serializing operations

        and memory fences.<br>

        <br>

        No assumption is made on the store buffer size.  As mentioned

        before, LSUnit<br>

        conservatively assumes a may-alias relation between loads and

        stores, and it<br>

        doesn't attempt to identify cases where store-to-load forwarding

        would occur in<br>

        practice.<br>

        <br>

        LSUnit doesn't attempt to predict whether a load or store hits

        or misses the L1<br>

        cache.  It only knows if an instruction "MayLoad" and/or

        "MayStore".  For loads,<br>

        the scheduling model provides an "optimistic" load-to-use

        latency (which usually<br>

        matches the load-to-use latency for when there is a hit in the

        L1D).<br>

        <br>

        Class MCInstrDesc in LLVM doesn't know about serializing

        operations, nor<br>

        memory-barrier like instructions.  LSUnit conservatively assumes

        that an<br>

        instruction which has both 'MayLoad' and 'UnmodeledSideEffects'

        behaves like a<br>

        "soft" load-barrier.  That means, it serializes loads without

        forcing a flush of<br>

        the load queue.  Similarly, instructions flagged with both

        'MayStore' and<br>

        'UnmodeledSideEffects' are treated like store barriers.  A full

        memory barrier<br>

        is a 'MayLoad' and 'MayStore' instruction with

        'UnmodeledSideEffects'.  This is<br>

        inaccurate, but it is the best that we can do at the moment with

        the current<br>

        information available in LLVM.<br>

        <br>

        A load/store barrier consumes one entry of the load/store

        queue.  A load/store<br>

        barrier enforces ordering of loads/stores.  A younger load

        cannot pass a load<br>

        barrier.  Also, a younger store cannot pass a store barrier.  A

        younger load has<br>

        to wait for the memory/load barrier to execute.  A load/store

        barrier is<br>

        "executed" when it becomes the oldest entry in the load/store

        queue(s). That<br>

        also means, by construction, all the older loads/stores have

        been executed.<br>

        <br>

        In conclusion the full set of rules is:<br>

          1. A store may not pass a previous store.<br>

          2. A load may not pass a previous store unless flag 'NoAlias'

        is set.<br>

          3. A load may pass a previous load.<br>

          4. A store may not pass a previous load (regardless of flag

        'NoAlias').<br>

          5. A load has to wait until an older load barrier is fully

        executed.<br>

          6. A store has to wait until an older store barrier is fully

        executed.<br>

        <br>

        Known limitations<br>

        -----------------<br>

        Previous sections described cases where the tool is missing

        information to give<br>

        an accurate report.  For example, the first sections of this

        document explained<br>

        how the lack of knowledge about the processor negatively affects

        the performance<br>

        analysis.  The lack of knowledge is often a consequence of how

        scheduling models<br>

        are defined; as mentioned before, scheduling models

        intentionally don't describe<br>

        processors in fine details.<br>

        <br>

        The accuracy of the performance analysis is also affected by

        assumptions made by<br>

        the processor model used by the tool.<br>

        <br>

        Most recent Intel and AMD processors implement dedicated

        LoopBuffer/OpCache in<br>

        the hardware frontend to speedup the throughput in the presence

        of tight loops.<br>

        The presence of these buffers complicates the decoding logic,

        and requires<br>

        knowledge on the branch predictor too.  Class

        'SchedMachineModel' in tablegen<br>

        provides a field named 'LoopMicroOpBufferSize' which is used to

        describe loop<br>

        buffers.  However, the purpose of that field is to enable loop

        unrolling of<br>

        tight loops; essentially, it affects the cost model used by pass

        loop-unroll.<br>

        <br>

        By design, the tool only cares about the out-of-order portion of

        a processor,<br>

        and consequently doesn't try to predict the frontend

        throughput.  Processors may<br>

        implement complex decoding schemes; statically predicting the

        frontend<br>

        throughput is in general beyond the scope of this tool.  For the

        same reasons,<br>

        this tool intentionally doesn't model branch prediction.  That

        being said, this<br>

        tool could be definitely extended in future to also account for

        the hardware<br>

        frontend when doing performance analysis.  This would inevitably

        require extra<br>

        (extensive) processor knowledge related to all the available

        decoding paths in<br>

        the hardware frontend.<br>

        <br>

        When computing the IPC, the tool assumes a zero-latency

        "perfect" fetch&decode<br>

        stage; the full sequence of decoded instructions is immediately

        visible to the<br>

        dispatch logic from the start.<br>

        <br>

        The tool doesn't know about simultaneous mutithreading. 

        According to the tool,<br>

        processor resources are not statically/dynamically partitioned. 

        Processor<br>

        resources are fully available to the hardware thread executing

        the<br>

        microbenchmark.<br>

        <br>

        The execution model implemented by this tool assumes that

        instructions are<br>

        firstly dispatched in groups to hardware schedulers, and then

        issued to<br>

        pipelines for execution.  The model assumes dynamic scheduling

        of instructions.<br>

        Instructions are placed in a queue and potentially executed

        out-of-order (based<br>

        on the operand availability). The dispatch stage is definitely

        distinct from the<br>

        issue stage.<br>

        <br>

        This model doesn't correctly describe processors where the

        dispatch/issue is a<br>

        single stage. This is what happens for example in VLIW

        processors, where<br>

        instructions are packaged and statically scheduled at compile

        time; it is up to<br>

        the compiler to predict the latency of instructions and package

        issue groups<br>

        accordingly. For such targets, there is no dynamic scheduling

        done by the<br>

        hardware.<br>

        <br>

        Existing classes (DispatchUnit, Scheduler, etc.) could be

        extended/adapted to<br>

        support processors with a single dispatch/issue stage. The

        execution flow would<br>

        require some changes in the way how existing components (i.e. 

        DispatchUnit,<br>

        Scheduler, etc.) interact. This can be a future development.<br>

        <br>

        The following sections describes other known limitations.  The

        goal is not to<br>

        provide an extensive list of limitations; we want to report what

        we believe are<br>

        the most important limitations, and suggest possible methods to

        overcome them.<br>

        <br>

        Load/Store barrier instructions and serializing operations<br>

        ----------------------------------------------------------<br>

        Section "Load/Store Unit and Memory Consistency Model" already

        mentioned how<br>

        LLVM doesn't know about serializing operations and memory

        barriers.  Most of it<br>

        boils down to the fact that class MCInstrDesc (intentionally)

        doesn't expose<br>

        those properties.  Instead, both serializing operations and

        memory barriers<br>

        "have side-effects" according to MCInstrDesc.  That is because,

        at least for<br>

        scheduling purposes, knowing that an instruction has unmodeled

        side effects is<br>

        often enough to treat the instruction like a compiler scheduling

        barrier.<br>

        <br>

        A performance analysis tool could use the extra knowledge on

        barriers and<br>

        serializing operations to generate a more accurate performance

        report.  One way<br>

        to improve this is by reserving a couple of bits in field

        'Flags' from class<br>

        MCInstrDesc: one bit for barrier operations, and another bit to

        mark<br>

        instructions as serializing operations.<br>

        <br>

        Lack of support for instruction itineraries<br>

        -------------------------------------------<br>

        The current version of the tool doesn't know how to process

        instruction<br>

        itineraries.  This is probably one of the most important

        limitations, since it<br>

        affects a few out-of-order processors in LLVM.<br>

        <br>

        As mentioned in section 'Instruction Issue', class Scheduler

        delegates to an<br>

        instance of class ResourceManager the handling of processor

        resources.<br>

        ResourceManager is where most of the scheduling logic is

        implemented.<br>

        <br>

        Adding support for instruction itineraries requires that we

        teach<br>

        ResourceManager how to handle functional units and instruction

        stages.  This<br>

        development can be a future extension, and it would probably

        require a few<br>

        changes to the ResourceManager interface.<br>

        <br>

        Instructions that affect control flow are not correctly modeled<br>

        ---------------------------------------------------------------<br>

        Examples of instructions that affect the control flow are:

        return, indirect<br>

        branches, calls, etc.  The tool doesn't try to predict/evaluate

        branch targets.<br>

        In particular, the tool doesn't model any sort of branch

        prediction, nor does it<br>

        attempt to track changes to the program counter.  The tool

        always assumes that<br>

        the input assembly sequence is the body of a microbenchmark (a

        simple loop<br>

        executed for a number of iterations). The "next" instruction in

        sequence is<br>

        always the next instruction to dispatch.<br>

        <br>

        Call instructions default to an arbitrary high latency of 100cy.

        A warning is<br>

        generated if the tool encounters a call instruction in the

        sequence.  Return<br>

        instructions are not evaluated, and therefore control flow is

        not affected.<br>

        However, the tool still queries the processor scheduling model

        to obtain latency<br>

        information for instructions that affect the control flow.<br>

        <br>

        Possible extensions to the scheduling model<br>

        -------------------------------------------<br>

        Section "Instruction Dispatch" explained how the tool doesn't

        know about the<br>

        register files, and temporaries available in each register file

        for register<br>

        renaming purposes.<br>

        <br>

        The LLVM scheduling model could be extended to better describe

        register files.<br>

        Ideally, scheduling model should be able to define:<br>

         - The size of each register file<br>

         - How many temporary registers are available for register

        renaming<br>

         - How register classes map to register files<br>

        <br>

        The scheduling model doesn't specify the retire throughput (i.e.

        how many<br>

        instructions can be retired every cycle).  Users can specify

        flag<br>

        `-max-retire-per-cycle=<uint>` to limit how many

        instructions the retire control<br>

        unit can retire every cycle.  Ideally, every processor should be

        able to specify<br>

        the retire throughput (for example, by adding an extra field to

        the scheduling<br>

        model tablegen class).<br>

        <br>

        Known limitations on X86 processors<br>

        -----------------------------------<br>

        <br>

        1) Partial register updates versus full register updates.<br>

        <br>

        On x86-64, a 32-bit GPR write fully updates the super-register.

        Example:<br>

              add %edi %eax    ## eax += edi<br>

        <br>

        Here, register %eax aliases the lower half of 64-bit register

        %rax. On x86-64,<br>

        register %rax is fully updated by the 'add' (the upper half of

        %rax is zeroed).<br>

        Essentially, it "kills" any previous definition of (the upper

        half of) register<br>

        %rax.<br>

        <br>

        On the other hand, 8/16 bit register writes only perform a

        so-called "partial<br>

        register update". Example:<br>

              add %di, %ax     ## ax += di<br>

        <br>

        Here, register %eax is only partially updated. To be more

        specific, the lower<br>

        half of %eax is set, and the upper half is left unchanged. There

        is also no<br>

        change in the upper 48 bits of register %rax.<br>

        <br>

        To get accurate performance analysis, the tool has to know which

        instructions<br>

        perform a partial register update, and which instructions fully

        update the<br>

        destination's super-register.<br>

        <br>

        One way to expose this information is (again) via tablegen.  For

        example, we<br>

        could add a flag in the tablegen instruction class to tag

        instructions that<br>

        perform partial register updates. Something like this: 'bit<br>

        hasPartialRegisterUpdate = 1'. However, this would force a `let<br>

        hasPartialRegisterUpdate = 0` on several instruction

        definitions.<br>

        <br>

        Another approach is to have a MCSubtargetInfo hook similar to

        this:<br>

            virtual bool updatesSuperRegisters(unsigned short opcode) {

        return false; }<br>

        <br>

        Targets will be able to override this method if needed.  Again,

        this is just an<br>

        idea. But the plan is to have this fixed as a future

        development.<br>

        <br>

        2) Macro Op fusion.<br>

        <br>

        The tool doesn't know about macro-op fusion. On modern x86

        processors, a<br>

        'cmp/test' followed by a 'jmp' is fused into a single macro

        operation.  The<br>

        advantage is that the fused pair only consumes a single slot in

        the dispatch<br>

        group. <br>

        <br>

        As a future development, the tool should be extended to address

        macro-fusion.<br>

        Ideally, we could have LLVM generate a table enumerating all the

        opcode pairs<br>

        that can be fused together. That table could be exposed to the

        tool via the<br>

        MCSubtargetInfo interface.  This is just an idea; there may be

        better ways to<br>

        implement this.<br>

        <br>

        3) Intel processors: mixing legacy SSE with AVX instructions.<br>

        <br>

        On modern Intel processors with AVX, mixing legacy SSE code with

        AVX code<br>

        negatively impacts the performance.  The tool is not aware of

        this issue, and<br>

        the performance penalty is not accounted when doing the

        analysis. This is<br>

        something that we would like to improve in future.<br>

        <br>

        4) Zero-latency register moves and Zero-idioms.<br>

        <br>

        Most modern AMD/Intel processors know how to optimize out

        register-register<br>

        moves and zero idioms at register renaming stage. The tool

        doesn't know<br>

        about these patterns, and this may negatively impact the

        performance analysis.<br>

        <br>

        Known design problems<br>

        ---------------------<br>

        This section describes two design issues that are currently

        affecting the tool.<br>

        The long term plan is to "fix" these issues.<br>

        <br>

        1) Variant instructions not correctly modeled.<br>

        <br>

        The tool doesn't know how to analyze instructions with a

        "variant" scheduling<br>

        class descriptor. A variant scheduling class needs to be

        resolved dynamically.<br>

        The "actual" scheduling class often depends on the subtarget, as

        well as<br>

        properties of the specific MachineInstr object.<br>

        <br>

        Unfortunately, the tool manipulates MCInst, and it doesn't know

        anything about<br>

        MachineInstr. As a consequence, the tool cannot use the existing

        machine<br>

        subtarget hooks that are normally used to resolve the variant

        scheduling class.<br>

        This is a major design issue which mostly affects ARM/AArch64

        targets.  It<br>

        mostly boils down to the fact that the existing scheduling

        framework was meant<br>

        to work for MachineInstr.<br>

        <br>

        When the tool encounters a "variant" instruction, it assumes a

        generic 1cy<br>

        latency. However, the tool would not be able to tell which

        processor resources<br>

        are effectively consumed by the variant instruction.<br>

        <br>

        2) MCInst and MCInstrDesc.<br>

        <br>

        Performance analysis tools require data dependency information

        to correctly<br>

        predict the runtime performance of the code. This tool must

        always be able to<br>

        obtain the set of implicit/explicit register defs/uses for every

        instruction of<br>

        the input assembly sequence.<br>

        <br>

        In the first section of this document, it was mentioned how the

        tool takes as<br>

        input an assembly sequence. That sequence is parsed into a

        MCInst sequence with<br>

        the help of assembly parsers available from the targets.<br>

        <br>

        A MCInst is a very low-level instruction representation. The

        tool can inspect<br>

        the MCOperand sequence of an MCInst to identify register

        operands. However,<br>

        there is no way to tell register operands that are definitions

        from register<br>

        operands that are uses.<br>

        <br>

        In LLVM, class MCInstrDesc is used to fully describe target

        instructions and<br>

        their operands. The opcode of a machine instruction (a

        MachineInstr object) can<br>

        be used to query the instruction set through method

        `MCInstrInfo::get' to obtain<br>

        the associated MCInstrDesc object.<br>

        <br>

        However class MCInstrDesc describes properties and operands of

        MachineInstr<br>

        objects. Essentially, MCInstrDesc is not meant to be used to

        describe MCInst<br>

        objects.  To be more specific, MCInstrDesc objects are

        automatically generated<br>

        via tablegen from the instruction set description in the target

        .td files.  For<br>

        example, field `MCInstrDesc::NumDefs' is always equal to the

        cardinality of the<br>

        `(outs)` set from the tablegen instruction definition.<br>

        <br>

        By construction, register definitions always appear at the

        beginning of the<br>

        MachineOperands list in MachineInstr. Basically, the (outs) are

        the first<br>

        operands of a MachineInstr, and the (ins) will come after in the

        machine operand<br>

        list. Knowing the number of register definitions is enough to

        identify<br>

        all the register operands that are definitions.<br>

        <br>

        In a normal compilation process, MCInst objects are generated

        from MachineInstr<br>

        objects through a lowering step. By default the lowering logic

        simply iterates<br>

        over the machine operands of a MachineInstr, and

        converts/expands them into<br>

        equivalent MCOperand objects.<br>

        <br>

        The default lowering strategy has the advantage of preserving

        all of the<br>

        above mentioned assumptions on the machine operand sequence.

        That means, register<br>

        definitions would still be at the beginning of the MCOperand

        sequence, and<br>

        register uses would come after.<br>

        <br>

        Targets may still define custom lowering routines for specific

        opcodes. Some of<br>

        these routines may lower operands in a way that potentially

        breaks (some of) the<br>

        assumptions on the machine operand sequence which were valid for

        MachineInstr.<br>

        Luckily, this is not the most common form of lowering done by

        the targets, and<br>

        the vast majority of the MachineInstr are lowered based on the

        default strategy<br>

        which preserves the original machine operand sequence.  This is

        especially true<br>

        for x86, where the custom lowering logic always preserves the

        original (i.e.<br>

        from the MachineInstr) operand sequence.<br>

        <br>

        This tool currently works under the strong (and potentially

        incorrect)<br>

        assumption that register def/uses in a MCInst can always be

        identified by<br>

        querying the machine instruction descriptor for the opcode. This

        assumption made<br>

        it possible to develop this tool and get good numbers at least

        for the<br>

        processors available in the x86 backend.<br>

        <br>

        That being said, the analysis is still potentially incorrect for

        other targets.<br>

        So we plan (with the help of the community) to find a proper

        mechanism to map<br>

        when possible MCOperand indices back to MachineOperand indices

        of the equivalent<br>

        MachineInstr.  This would be equivalent to describing changes

        made by the<br>

        lowering step which affected the operand sequence. For example,

        we could have an<br>

        index for every register MCOperand (or -1, if the operand didn't

        exist in the<br>

        original MachineInstr). The mapping could look like this

        <0,1,3,2>.  Here,<br>

        MCOperand #2 was obtained from the lowering of MachineOperand

        #3. etc.<br>

        <br>

        This information could be automatically generated via tablegen

        for all the<br>

        instructions whose custom lowering step breaks assumptions made

        by the tool on<br>

        the register operand sequence (In general, these instructions

        should be the<br>

        minority of a target's instruction set). Unfortunately, we don't

        have that<br>

        information now.  As a consequence, we assume that the number of

        explicit<br>

        register definitions is the same number specified in

        MCInstrDesc.  We also<br>

        assume that register definitions always come first in the

        operand sequence.<br>

        <br>

        In conclusion: these are for now the strong assumptions made by

        the tool:<br>

          * The number of explicit and implicit register definitions in

        a MCInst<br>

            matches the number of explicit and implicit definitions

        specified by the<br>

            MCInstrDesc object.<br>

          * Register uses always come after register definitions.<br>

          * If an opcode specifies an optional definition, then the

        optional<br>

            definition is always the last register operand in the

        sequence.<br>

        <br>

        Note that some of the information accessible from the

        MCInstrDesc is always<br>

        valid for MCInst. For example: implicit register defs, implicit

        register uses<br>

        and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties

        still apply to<br>

        MCInst. The tool knows about this, and uses that information

        during its<br>

        analysis.<br>

        <br>

        What to do next<br>

        ---------------<br>

        The source code has been uploaded for review on phabricator at

        this link: <a href="https://reviews.llvm.org/D43951"

          moz-do-not-send="true">https://reviews.llvm.org/D43951</a>.<br>

        <br>

        The review covers two patches:<br>

        A first (very small) patch that always enables the generation of

        processor<br>

        resource names in the SubtargetEmitter. Currently, the processor

        resource names<br>

        are only generated for debugging purposes, but are needed by the

        tool to<br>

        generate user friendly reports, so we would like to always

        generate them.<br>

        A second patch with the actual static analysis tool (in

        llvm/tools).<br>

        <br>

        Once these first two patches are committed, the plan is to keep

        working on the<br>

        tool with the help of the community to address all of the

        limitations described<br>

        by the previous sections, and find good solutions/fixes for the

        design issues<br>

        described by section "Known design problems".<br>

        <br>

        We hope the community will find this tool useful like we have.<br>

        <br>

        Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg

        Bedwell who really<br>

        helped me a lot by suggesting improvements and testing the tool.<br>

        <br>

        Thanks for your time.<br>

        -Andrea<br>

        <br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>