[PATCH] D57504: RFC: Prototype & Roadmap for vector predication in LLVM

Thu Feb 6 00:21:18 PST 2020

simoll added a comment.

In D57504#1857458 <https://reviews.llvm.org/D57504#1857458>, @andrew.w.kaylor wrote:

> OK. I was picturing MVL as some sort of maximum supported by the hardware in some sense or context. I think(?) I've got it now.
>
> So let me ask about how you're picturing this working on targets that don't support these non-fixed vector lengths. The comments from lkcl have me concerned that we're going to be asked to emulate this behavior, which is possible I suppose but probably not the best choice performance wise. Consider this call:
>
>   %sum = call <8 x double> @llvm.vp.fadd.f64(<8 x double> %x,<8 x double> %y, <8 x i1> %mask, i32 4)
>
>
> Frankly, I'd hope never to see such a thing. We talked about using -1 for the %evl argument for targets that don't support variable vector length (is that the right phrase?), but what are we supposed to do if something else is used?

For targets that do not support `%evl` they can say so through TTI and the `ExpandVectorPredicationPass` will convert it into:

  %mask.vl = icmp ult <8 x i1> <0,1,2,3,4,5,6,7>, ("splat' <8 x i32> 4)
  %mask.new = and <8 x i1> %mask, %mask.vl
  %sum = call <8 x double> @llvm.vp.fadd.f64(<8 x double> %x,<8 x double> %y, <8 x i1> %mask.new, i32 -1)

Basically, `%evl` never hits the X86 backend and can be ignored. The expansion pass implements one, unified, legalization strategy for all non-VL targets, achieving predictable behavior across targets.

> Disregarding the %evl argument for the moment, the x86 type legalizer might lower this as a masked <8 x double> fadd, or it might lower it as two <4 x double> fadd operations, or it might scalarize it entirely. Even if the target hardware supports 512-bit vectors we might choose to lower it as two <4 x double> fadds. Or we might not. The backend currently considers itself to have the freedom to do anything that meets the semantics of the intrinsic. So that brings up the question of whether we will be expected to honor the %evl argument. In this case, it would be fairly trivial to do so. However, the possibility raises a concern about what the code that generated this IR was trying to do and whether it is a reasonable thing to have done for x86 backends.

I see two sources for VP intrinsics in code:
1.) Hand-written intrinsic code (if we expose VP as C intrinsics in Clang and/or somebody directly implements say a math library in VP, ..)
We do not claim performance portability for VP code. If your actual target is AVX512 and you use VP intrinsics, do not use the `%evl` parameter (or know how the expansion pass is going to lower it and exploit that).

2.) Optimization passes and (vectorizing) frontends
Vectorizers/frontends should query TTI to decide whether they should be using `%evl`.
For VL targets, the loop vectorizer could use `%evl` to implement tail loop predication (as in the DAXPY example https://www.sigarch.org/simd-instructions-considered-harmful/ , linked by @lkcl).
For non-VL targets, you should make the iteration mask the root mask of all other predicates in the loop and set `%evl` to `-1`.

> Basically, I want to actively discourage front ends and optimizations from using the %evl argument in cases where it won't be optimal.

TTI would tell front ends and optimizations that `%evl` is a no-go for your target. Is this enough discouragement?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57504/new/

https://reviews.llvm.org/D57504