[PATCH] D40695: Improve loop unrolling performance on T99

Stefan Teleman via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Nov 30 18:57:35 PST 2017


steleman added a comment.

In https://reviews.llvm.org/D40695#941589, @efriedma wrote:

> 128 is a really big number for LoopMicroOpBufferSize.


It appears to be the optimal number for T99, after having spent a **lot** of time moving it up and down in steps of 2 and testing the effects.

Larger than 128 doesn't improve performance or increase unrolling -- at least on SPECcpu2017, libquantum and Google Protobufs, on T99. Smaller than 128 impacts performance - 128 is always better than anything smaller than.

What would be nice is being able to set this parameter as a compile-time -mllvm -aarch64-loop-micro-ops-buffer-size option.

I have not tested other AArch64 micro-arch's, as I have no access to them.

> You might want to consider modifying AArch64TTIImpl::getUnrollingPreferences with some more tailored heuristics.

That's something I would be happy to take a look at, but I am reluctant, for now, to make changes in AArch64TTIImpl::getUnrollingPreferences. That's a more involved change.

> Also, it would be nice to see the impact across a wider set of benchmarks, like the LLVM testsuite, so it's clear what impact more aggressive unrolling has in general.

I make no claims that every single ISA or AArch64 micro-arch will benefit from increasing their LoopMicroOpsBufferSize. This is a micro-arch specific change for T99.

Also, for this T99 specific change, the LLVM testsuite probably isn't the best benchmark. The specific type of loops that benefit most from this change are loops that contain a large number of nested conditionals. There are many loops of this type in SPECcpu2017 and in libquantum (quantum_toffoli). I'm not sure this type of deeply-nested loop is that widespread in the LLVM testsuite.


Repository:
  rL LLVM

https://reviews.llvm.org/D40695





More information about the llvm-commits mailing list