[PATCH] D79163: [Target][ARM] Tune getCastInstrCost for extending masked loads and truncating masked stores

Thu Apr 30 03:01:47 PDT 2020

Pierre-vh created this revision.
Pierre-vh added reviewers: dmgreen, samparker, SjoerdMeijer.
Herald added subscribers: llvm-commits, danielkiss, hiraditya, kristof.beyls.
Herald added a project: LLVM.
Pierre-vh added a parent revision: D79162: [Analysis] TTI: Add CastContextHint for getCastInstrCost.

This patch uses the feature added in D79162 <https://reviews.llvm.org/D79162> to fix the cost of a sext/zext of a masked load, or a trunc for a masked store.
Previously, those were considered cheap or even free, but it's absolutely not the case if the cast's result type doesn't fit in a 128 bits register. They're expensive!

Examples:

The cast fits in a 128 bits register:

  // LLVM
  define dso_local arm_aapcs_vfpcc <8 x i16> @square(<8 x i8>*, <8 x i8>, <8 x i8>) #0 {
      %mask = trunc <8 x i8> %1 to <8 x i1>
      %res = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8 (<8 x i8>* %0, i32 4, <8 x i1> %mask, <8 x i8> %2)
      %ext = sext <8 x i8> %res to <8 x i16>
      ret <8 x i16> %ext
  }
  // ASM
          vpt.i32 ne, q0, zr
          vldrbt.s16      q0, [r0]
          vmovlb.s8       q1, q1
          vpsel   q0, q0, q1

The cast doesn't fit in a 128 bits register:

  // LLVM
  define dso_local arm_aapcs_vfpcc <8 x i32> @square(<8 x i8>*, <8 x i8>, <8 x i8>) #0 {
      %mask = trunc <8 x i8> %1 to <8 x i1>
      %res = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8 (<8 x i8>* %0, i32 4, <8 x i1> %mask, <8 x i8> %2)
      %ext = sext <8 x i8> %res to <8 x i32>
      ret <8 x i32> %ext
  }
  // ASM
          vpt.i32 ne, q0, zr
          vldrbt.u16      q0, [r0]
          vpsel   q1, q0, q1
          vmov.u16        r0, q1[0]
          vmov.32 q0[0], r0
          vmov.u16        r0, q1[1]
          vmov.32 q0[1], r0
          vmov.u16        r0, q1[2]
          vmov.32 q0[2], r0
          vmov.u16        r0, q1[3]
          vmov.32 q0[3], r0
          vmov.u16        r0, q1[4]
          vmov.32 q2[0], r0
          vmov.u16        r0, q1[5]
          vmov.32 q2[1], r0
          vmov.u16        r0, q1[6]
          vmov.32 q2[2], r0
          vmov.u16        r0, q1[7]
          vmov.32 q2[3], r0
          vmovlb.s8       q0, q0
          vmovlb.s8       q1, q2
          vmovlb.s16      q0, q0
          vmovlb.s16      q1, q1

I've updated the costs to better reflect reality, and added a test for it in `test/Analysis/CostModel/ARM/cast.ll`.

I've also added a vectorizer test that showcases the improvement: in some cases, the vectorizer will now choose a smaller VF when tail-predication is enabled, which results in better codegen. (Because if it were to use a higher VF in those cases, the code we see above would be generated, and the vmovs would block tail-predication later in the process, resulting in very poor codegen overall)

Please note that the contents of this patches are subject to changes depending on the outcome of the review of  D79162 <https://reviews.llvm.org/D79162>, but the cost calculation logic shouldn't change too much (just the way this case is detected).

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D79163

Files:
  llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
  llvm/test/Analysis/CostModel/ARM/cast.ll
  llvm/test/Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D79163.261163.patch
Type: text/x-patch
Size: 44394 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200430/e5a1d6b5/attachment-0001.bin>