[PATCH] D147336: [IVDescriptors] Add pointer InductionDescriptors with non-constant strides (try 2)
Philip Reames via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Mar 31 10:56:23 PDT 2023
reames added a comment.
In D147336#4237118 <https://reviews.llvm.org/D147336#4237118>, @dmgreen wrote:
> Hello. About the performance changes - I have been looking into them a little. A lot of it looks good, especially where gather/scatter are available which this can make use of. There are some cases where it wasn't doing quite as well under some places in MVE, but https://reviews.llvm.org/D147331 should hopefully sort out the largest of them. Sam/Nick should be back on Monday to review it. We have some other cases where gather/scatter are not available for whatever reason and those don't look as good but the changes are much smaller. Again it likely comes down to adjusting the costmodel, perhaps for the cost of the scalarized loads in getMemInstScalarizationCost needs to be a little higher under MVE. I wouldn't consider any of that blockers considering all the improvements.
This is all good news, thanks!
> The other part under AArch64 is a bit more silly I'm afraid, and falls into your Option 3. It is extracted from the code in x264, using the 16x16 version: https://github.com/mirror/x264/blob/master/common/pixel.c#L53, which I believe is also a part of Spec. I was thinking of adding a phase-ordering test but will just put it here: https://godbolt.org/z/qo54f7Pbv. It now decides to version the loop based on a stride of 1, like you mentioned. The loop-vectorized code is not as good as the slp version though, and never executed. I wouldn't expect that to cause a lot of slowdown, it only really adding an extra compare/branch and some dead code. The differences seemed higher that I expected in places though, around 25% for that function in isolation on some cpus. That might be quite noisy though as some were a lot closer to 0.
Hm, this is less good news.
I took a bit of a look here, and noticed something interesting. The LLVM IR after optimization for the path when stride!=1 is nearly identical to the prior version. However, the runtime check in the assembly appears to have tripped some kind of hoisting optimization in the backend, and as a result, the assembly path for the stride!=1 path is a bit different.
I can see a couple ways of tackling this:
Option 1 - Be more restrictive on stride==1 speculation. I'd meant to do this anyways, but was expecting that to be the piece that had interesting perf swings.
Option 2 - Investigate the hoisting bit. (I don't really have the context to do this)
> I might also be able to provide another example of an assert this is still hitting, with `Assertion `Instance.Lane.isFirstLane() && "cannot get lane > 0 for scalar"' failed`. It is reducing now.
Oops, yeah definitely looking forward to that. I'd assumed all the reports were the same issue since the assert seemed common.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D147336/new/
https://reviews.llvm.org/D147336
More information about the llvm-commits
mailing list