[PATCH] D79163: [Target][ARM] Tune getCastInstrCost for extending masked loads and truncating masked stores
Pierre van Houtryve via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Apr 30 03:01:47 PDT 2020
Pierre-vh created this revision.
Pierre-vh added reviewers: dmgreen, samparker, SjoerdMeijer.
Herald added subscribers: llvm-commits, danielkiss, hiraditya, kristof.beyls.
Herald added a project: LLVM.
Pierre-vh added a parent revision: D79162: [Analysis] TTI: Add CastContextHint for getCastInstrCost.
This patch uses the feature added in D79162 <https://reviews.llvm.org/D79162> to fix the cost of a sext/zext of a masked load, or a trunc for a masked store.
Previously, those were considered cheap or even free, but it's absolutely not the case if the cast's result type doesn't fit in a 128 bits register. They're expensive!
Examples:
The cast fits in a 128 bits register:
// LLVM
define dso_local arm_aapcs_vfpcc <8 x i16> @square(<8 x i8>*, <8 x i8>, <8 x i8>) #0 {
%mask = trunc <8 x i8> %1 to <8 x i1>
%res = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8 (<8 x i8>* %0, i32 4, <8 x i1> %mask, <8 x i8> %2)
%ext = sext <8 x i8> %res to <8 x i16>
ret <8 x i16> %ext
}
// ASM
vpt.i32 ne, q0, zr
vldrbt.s16 q0, [r0]
vmovlb.s8 q1, q1
vpsel q0, q0, q1
The cast doesn't fit in a 128 bits register:
// LLVM
define dso_local arm_aapcs_vfpcc <8 x i32> @square(<8 x i8>*, <8 x i8>, <8 x i8>) #0 {
%mask = trunc <8 x i8> %1 to <8 x i1>
%res = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8 (<8 x i8>* %0, i32 4, <8 x i1> %mask, <8 x i8> %2)
%ext = sext <8 x i8> %res to <8 x i32>
ret <8 x i32> %ext
}
// ASM
vpt.i32 ne, q0, zr
vldrbt.u16 q0, [r0]
vpsel q1, q0, q1
vmov.u16 r0, q1[0]
vmov.32 q0[0], r0
vmov.u16 r0, q1[1]
vmov.32 q0[1], r0
vmov.u16 r0, q1[2]
vmov.32 q0[2], r0
vmov.u16 r0, q1[3]
vmov.32 q0[3], r0
vmov.u16 r0, q1[4]
vmov.32 q2[0], r0
vmov.u16 r0, q1[5]
vmov.32 q2[1], r0
vmov.u16 r0, q1[6]
vmov.32 q2[2], r0
vmov.u16 r0, q1[7]
vmov.32 q2[3], r0
vmovlb.s8 q0, q0
vmovlb.s8 q1, q2
vmovlb.s16 q0, q0
vmovlb.s16 q1, q1
I've updated the costs to better reflect reality, and added a test for it in `test/Analysis/CostModel/ARM/cast.ll`.
I've also added a vectorizer test that showcases the improvement: in some cases, the vectorizer will now choose a smaller VF when tail-predication is enabled, which results in better codegen. (Because if it were to use a higher VF in those cases, the code we see above would be generated, and the vmovs would block tail-predication later in the process, resulting in very poor codegen overall)
Please note that the contents of this patches are subject to changes depending on the outcome of the review of D79162 <https://reviews.llvm.org/D79162>, but the cost calculation logic shouldn't change too much (just the way this case is detected).
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D79163
Files:
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
llvm/test/Analysis/CostModel/ARM/cast.ll
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D79163.261163.patch
Type: text/x-patch
Size: 44394 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200430/e5a1d6b5/attachment-0001.bin>
More information about the llvm-commits
mailing list