[PATCH] D19678: Annotated-source optimization reports (a.k.a. "listing" files)

Thu Apr 28 15:19:13 PDT 2016

rcox2 added a comment.

Actually, the Intel compiler distinguishes between an optimization report (-qopt-report) and an annotated listing (-qopt-report-annotate).  The optimization report lists the info for optimizations in a hierarchical fashion.  To use you example,

  icc -c -O3 -qopt-report=1 -qopt-report-file=stderr v.c 

yields:

  Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:

  -inline-factor: 100
  -inline-min-size: 20
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000

Begin optimization report for: foo()

  Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo()) [1] v.c(2,12)

  Report from: Code generation optimizations [cg]

v.c(2,12):remark #34051: REGISTER ALLOCATION : [foo] v.c:2

  Hardware registers
      Reserved     :    1[ esp]
      Available    :   23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
      Callee-save  :    4[ ebx ebp esi edi]
      Assigned     :    0[ reg_null]

  Routine temporaries
      Total         :       4
          Global    :       0
          Local     :       4
      Regenerable   :       0
      Spilled       :       0

  Routine stack
      Variables     :       0 bytes*
          Reads     :       0 [0.00e+00 ~ 0.0%]
          Writes    :       0 [0.00e+00 ~ 0.0%]
      Spills        :       0 bytes*
          Reads     :       0 [0.00e+00 ~ 0.0%]
          Writes    :       0 [0.00e+00 ~ 0.0%]

  Notes

      *Non-overlapping variables and spills may share stack space,
       so the total stack size might be less than this.

Begin optimization report for: Test(int *, int *, int *, int *, int)

  Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (Test(int *, int *, int *, int *, int)) [2] v.c(4,52)

  -> INLINE: (16,3) foo()
  -> INLINE: (18,3) foo()
  -> INLINE: (18,17) foo()

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

LOOP BEGIN at v.c(8,8)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at v.c(8,8)

  remark #15301: SIMD LOOP WAS VECTORIZED

LOOP END

LOOP BEGIN at v.c(8,8)
<Alternate Alignment Vectorized Loop>
LOOP END

LOOP BEGIN at v.c(8,8)
<Remainder loop for vectorization>

  remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override

LOOP END

LOOP BEGIN at v.c(12,3)

  remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
  remark #15346: vector dependence: assumed FLOW dependence between res[i] (13:5) and d[i] (13:5)
  remark #25436: completely unrolled by 16

LOOP END

  Report from: Code generation optimizations [cg]

v.c(4,52):remark #34051: REGISTER ALLOCATION : [Test] v.c:4

  Hardware registers
      Reserved     :    1[ esp]
      Available    :   23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
      Callee-save  :    4[ ebx ebp esi edi]
      Assigned     :   15[ eax edx ecx ebx ebp esi edi zmm0-zmm7]

  Routine temporaries
      Total         :     123
          Global    :      47
          Local     :      76
      Regenerable   :       5
      Spilled       :       6

  Routine stack
      Variables     :       0 bytes*
          Reads     :       0 [0.00e+00 ~ 0.0%]
          Writes    :       0 [0.00e+00 ~ 0.0%]
      Spills        :       8 bytes*
          Reads     :       5 [1.41e+01 ~ 1.4%]
          Writes    :       3 [3.00e+00 ~ 0.3%]

  Notes

      *Non-overlapping variables and spills may share stack space,
       so the total stack size might be less than this.

while the annotated listing looks like:

//
// ------- Annotated listing with optimization reports for "/export/iusers/rcox2/rgHF/v.c" -------
//
//INLINING OPTION VALUES:
//  -inline-factor: 100
//  -inline-min-size: 20
//  -inline-max-size: 230
//  -inline-max-total-size: 2000
//  -inline-max-per-routine: 10000
//  -inline-max-per-compile: 500000
//
1       void bar();
2       void foo() { bar(); }
//INLINE REPORT: (foo()) [1] /export/iusers/rcox2/rgHF/v.c(2,12)
//
///export/iusers/rcox2/rgHF/v.c(2,12):remark #34051: REGISTER ALLOCATION : [foo] /export/iusers/rcox2/rgHF/v.c:2
//
//    Hardware registers
//        Reserved     :    1[ esp]
//        Available    :   23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
//        Callee-save  :    4[ ebx ebp esi edi]
//        Assigned     :    0[ reg_null]
//
//    Routine temporaries
//        Total         :       4
//            Global    :       0
//            Local     :       4
//        Regenerable   :       0
//        Spilled       :       0
//
//    Routine stack
//        Variables     :       0 bytes*
//            Reads     :       0 [0.00e+00 ~ 0.0%]
//            Writes    :       0 [0.00e+00 ~ 0.0%]
//        Spills        :       0 bytes*
//            Reads     :       0 [0.00e+00 ~ 0.0%]
//            Writes    :       0 [0.00e+00 ~ 0.0%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
3
4       void Test(int *res, int *c, int *d, int *p, int n) {
//INLINE REPORT: (Test(int *, int *, int *, int *, int)) [2] /export/iusers/rcox2/rgHF/v.c(4,52)
//  -> INLINE: (16,3) foo()
//  -> INLINE: (18,3) foo()
//  -> INLINE: (18,17) foo()
//
///export/iusers/rcox2/rgHF/v.c(4,52):remark #34051: REGISTER ALLOCATION : [Test] /export/iusers/rcox2/rgHF/v.c:4
//
//    Hardware registers
//        Reserved     :    1[ esp]
//        Available    :   23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
//        Callee-save  :    4[ ebx ebp esi edi]
//        Assigned     :   15[ eax edx ecx ebx ebp esi edi zmm0-zmm7]
//
//    Routine temporaries
//        Total         :     123
//            Global    :      47
//            Local     :      76
//        Regenerable   :       5
//        Spilled       :       6
//
//    Routine stack
//        Variables     :       0 bytes*
//            Reads     :       0 [0.00e+00 ~ 0.0%]
//            Writes    :       0 [0.00e+00 ~ 0.0%]
//        Spills        :       8 bytes*
//            Reads     :       5 [1.41e+01 ~ 1.4%]
//            Writes    :       3 [3.00e+00 ~ 0.3%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
5         int i;
6
7       #pragma simd
8         for (i = 0; i < 1600; i++) {
//
//LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
//<Peeled loop for vectorization>
//LOOP END
//
//LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
//   remark #15301: SIMD LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
//<Alternate Alignment Vectorized Loop>
//LOOP END
//
//LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
//<Remainder loop for vectorization>
//   remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
//LOOP END
9           res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
10        }
11
12        for (i = 0; i < 16; i++) {
//
//LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(12,3)
//   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
//   remark #15346: vector dependence: assumed FLOW dependence between res[i] (13:5) and d[i] (13:5)
//   remark #25436: completely unrolled by 16
//LOOP END
13          res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
14        }
15
16        foo();
17
18        foo(); bar(); foo();
19      }

essentially, various parts of the optimization report are inserted into a listing at the appropriate line numbers.

(Note that this is just the default level.  More detail can be obtained with -qopt-report=X where X>1 (up to 5 is supported)).

I believe what Hal is proposing in this patch is a very useful light-weight annotation of the source with key information.  But I also believe that there is value for a stand-alone opt report with the kind of detailed information I presented in http://reviews.llvm.org/D19397 and the two follow up patches.  In general, while this info can be interspersed in the source listing, I believe that for most purposes it is a bit too "busy" in text form.  (The Intel compiler also supports annotated html and functionality that feeds into Visual Studio that has received great reviews.)

http://reviews.llvm.org/D19678