[PATCH] D46283: [AArch64] Set vectorizer-maximize-bandwidth as default true

Adhemerval Zanella via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Jun 1 06:25:39 PDT 2018


zatrazz added a comment.

Indeed the machine was using for speccpu2006 was not best suitable, I used a different one now (tx1, A57) with extra care to lower variance (cpu binding, services disabled, etc) and it indeed showed a better result:

| Benchmark      | Difference (%) |
| 400.perlbench  | +1.04          |
| 401.bzip2      | -2.04          |
| 403.gcc        | +0.63          |
| 429.mcf        | +0.05          |
| 445.gobmk      | +0.18          |
| 456.hmmer      | -0.90          |
| 458.sjeng      | +0.29          |
| 462.libquantum | +0.10          |
| 464.h264ref    | +1.73          |
| 471.omnetpp    | +0.18          |
| 473.astar      | -0.21          |
| 483.xalancbmk  | +0.12          |
|



| 433.milc    | -0.36 |
| 444.namd    | +0.86 |
| 447.dealII  | -0.66 |
| 450.soplex  | -0.13 |
| 453.povray  | +0.37 |
| 470.lbm     | +0.08 |
| 482.sphinx3 | +0.54 |
|

I will check if 401.bzip2 slight drop is just noise or something related to this patch, but regardless I do think this change should yield better performance in most scenarios. Taking the example I am trying to get a better auto-vectorization pattern:

  void store_i32_to_i8 (const int *src, int width, unsigned char *dst)
  {
    for (int i = 0; i < width; i++) {
     *dst++ = *src++;
    }
  }

I can only get the maximum throughput when autovectorization do try large vectorization factors. I do try to try optimize the trunc 4 x i32 with a custom LowerTruncStore, but afaiu without either an extra transformation or pass aarch64 backend can't really fuse the high vector instruction (xtn2 in this case) to the maximum throughout.  Something I am investigating is if selecting the largest VF for 'MaximizeVectorBandwidth' is the best strategy or if we should add an architecture hook to enable/disable it.

For geekbench side I will investigate on PDFRendering, but I really think it is missing vectorization tuning and I am not sure if we should consider it a block.


https://reviews.llvm.org/D46283





More information about the llvm-commits mailing list