[PATCH] D61228: [PowerPC] Set the innermost hot loop(from PGO) to align 32 bytes

Mon Apr 29 19:33:39 PDT 2019

ZhangKang added a comment.

In D61228#1482324 <https://reviews.llvm.org/D61228#1482324>, @hfinkel wrote:

> > For some special cases, the performance can improve more than 30% after adding the patch for ppc.
>
> Any significant regressions?

@hfinkel , I have run spec test after adding this patch, the performance is same. The old code will alignment the loop whose size is more than 32 bytes to align 16 bytes, and if we want get a better performance, the outer loop need very large, and the innermost loop need very small. I think there are few cases meet the condition(the outer loop is very large, and the innermost loop is very small.).

I say that "For some special cases, the performance can improve more than 30%" is for the special small case I write. Below is the test case on P9 <https://reviews.llvm.org/P9>:

`cat foo.c`

  cpp
  struct parm {
    int *arr;
    int m;
    int n;
  };
  void foo(struct parm *arg) {
    struct parm localArg = *arg;
    int m = localArg.m;
    int *s = localArg.arr;
    int n = localArg.n;
    do{
      int k = n;
      do{
        s[++k] = k++;
        s[k++] = k;
        s[k++] = k;
        s[k] = k;
        s[--k] = k--;
        s[k--] = k;
        s[--k] = k;
      }while(k--);
    } while(m--);

    s[n]=0;
  }

`cat main.c`

  cpp
  struct parm {
    int *arr;
    int m;
    int n;
  };
  void foo(struct parm*);
  int main() {
    int a[5000];
    struct parm arg = {a, 2000000000, 5};
    foo(&arg);
    return 0;
  }

`cat run.ksh`

  shell
  set -x
  # profile-generate
  rm t t.* t_* *.o *.s *.profraw *.profdata
  clang -c main.c -O -fprofile-generate
  clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-generate
  clang -o t main.o foo.s -fprofile-generate
  objdump -dr t > t.dis
  time -p ./t

  # merge
  llvm-profdata merge *.profraw -output=merge.profdata

  # profile-use
  clang -c main.c -O -fprofile-use=merge.profdata
  clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-use=merge.profdata
  clang -o t_pgo main.o foo.s -fprofile-use=merge.profdata
  objdump -dr t_pgo > t_pgo.dis
  time -p ./t_pgo

The origin result(not set loop align to 32 bytes) is below:

  real 21.74
  user 21.74
  sys 0.00

After adding the patch, the result is below:

  real 14.37
  user 14.37
  sys 0.00

Note that, the performance speedup rate is different when using different parameters to call the `foo` function. In general, if the outer loop is larget and the inner loop is smaller, the branch prediction is more likely failed, and the speedup rate is larger. For example, the speedup rate of `foo(a, 2000000000, 5)` is larger than `foo(a, 20000000, 500)`.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61228/new/

https://reviews.llvm.org/D61228