[PATCH] D65884: [ARM] MVE Tail Predication

Thu Aug 8 07:14:26 PDT 2019

dmgreen added a comment.

In D65884#1620474 <https://reviews.llvm.org/D65884#1620474>, @samparker wrote:

> > Why does the llvm_arm_vctp32 not return a <4xi1> directly?
>
> The vctp family are defined like that because the ACLE specifies that they return a mve_pred16_t and I'm assuming this is a scalar - but I can't find a definition! I think that all the user facing predicate generators will produce a scalar and we will need to do the conversion to make it nice and LLVMy.

Sure, the ACLE intrinsic needs to return an i16, but does that mean the IR intrinsic needs to? It could be expanded to two instructions, llvm_arm_vctp32 and llvm_arm_vmrs, with the i16 coming from the vmrs. This kind of thing sounds like it would be useful already for things like masked loads. i.e I'm saying can we invert where the conversion happens?

So if we started with acle:

  mve_pred16_t pred = vctp8q(i)
  l = vldrbq_z_s8(a, pred)

It would get expanded to become:

  // vctp8q
  <4 x i1> t1 = llvm.arm.vctp(i)
  i16 pred = llvm.arm.vmrs(t1)
  // vldrbq_z_s8
  <4 x i1> t2 = llvm.arm.vmsr(pred)
  l = llvm.masked.load(a, t2)

And you could use instcombine to fold out the converts (vmsr(vmrs(a)) == a), into

  t1 = llvm.arm.vctp(i)
  llvm.masked.load(a, t1)

It would work even better for compares that already have predicate that llvm knows about. They whole thing would just become llvm IR and we can let it optimise away. This is getting a bit much into intrinsic design, though, with isn't this patches problem!

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D65884/new/

https://reviews.llvm.org/D65884