[PATCH] D46283: [AArch64] Set vectorizer-maximize-bandwidth as default true
Adhemerval Zanella via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Jun 1 06:25:39 PDT 2018
zatrazz added a comment.
Indeed the machine was using for speccpu2006 was not best suitable, I used a different one now (tx1, A57) with extra care to lower variance (cpu binding, services disabled, etc) and it indeed showed a better result:
| Benchmark | Difference (%) |
| 400.perlbench | +1.04 |
| 401.bzip2 | -2.04 |
| 403.gcc | +0.63 |
| 429.mcf | +0.05 |
| 445.gobmk | +0.18 |
| 456.hmmer | -0.90 |
| 458.sjeng | +0.29 |
| 462.libquantum | +0.10 |
| 464.h264ref | +1.73 |
| 471.omnetpp | +0.18 |
| 473.astar | -0.21 |
| 483.xalancbmk | +0.12 |
|
| 433.milc | -0.36 |
| 444.namd | +0.86 |
| 447.dealII | -0.66 |
| 450.soplex | -0.13 |
| 453.povray | +0.37 |
| 470.lbm | +0.08 |
| 482.sphinx3 | +0.54 |
|
I will check if 401.bzip2 slight drop is just noise or something related to this patch, but regardless I do think this change should yield better performance in most scenarios. Taking the example I am trying to get a better auto-vectorization pattern:
void store_i32_to_i8 (const int *src, int width, unsigned char *dst)
{
for (int i = 0; i < width; i++) {
*dst++ = *src++;
}
}
I can only get the maximum throughput when autovectorization do try large vectorization factors. I do try to try optimize the trunc 4 x i32 with a custom LowerTruncStore, but afaiu without either an extra transformation or pass aarch64 backend can't really fuse the high vector instruction (xtn2 in this case) to the maximum throughout. Something I am investigating is if selecting the largest VF for 'MaximizeVectorBandwidth' is the best strategy or if we should add an architecture hook to enable/disable it.
For geekbench side I will investigate on PDFRendering, but I really think it is missing vectorization tuning and I am not sure if we should consider it a block.
https://reviews.llvm.org/D46283
More information about the llvm-commits
mailing list