[PATCH] D92086: Generalized PatternMatch & InstSimplify

Fri Jun 10 04:25:24 PDT 2022

simoll added a comment.

This is really on two separate things: First on generalized pattern matching, the second on vector predication (which is only one of the intrinsics sets benefiting from this).

In D92086#3572881 <https://reviews.llvm.org/D92086#3572881>, @nikic wrote:

> This fails to build with:
>
>> lib/libLLVMInstCombine.a(InstructionCombining.cpp.o):InstructionCombining.cpp:function llvm::InstCombinerImpl::visitGEPOfGEP(llvm::GetElementPtrInst&, llvm::GEPOperator*): error: undefined reference to 'llvm::Value* llvm::simplifyAddInst(llvm::Value*, llvm::Value*, bool, bool, llvm::SimplifyQuery const&, llvm::MatcherContext&)'
>
> Probably due to illegal use of `inline`. You need to use `static inline` if you don't add `extern inline` definitions. (Or possibly a missing explicit template instantiation, but I think that wouldn't be used by InstCombine with this patch.)

It compiles on my system. I am fixing this with the next patch update.

> I'm generally very skeptical about this proposal. This adds a lot of additional complexity to the area of LLVM that likely already has the highest ratio between lines of code changed and review time / amount of tests. After adding tests for baseline behavior, 16 commuted variants, nowrap/FMF variations, vector splat, undef-splat and non-splat variations, and negative tests for all used conditions (including all pattern match conditions), I definitely would rather not add variation tests for VP and constrained FP intrinsics (with which I am not familiar) as well.

If we want to optimize all those new intrinsics, matrix, constrained, saturating, then we will have to add new logic to do so.
You can have the complexity elsewhere but you cannot get rid of it. I am advocating here for re-using the existing code instead of replicating the same or similar pattern-rewriting logic elsewhere, once for each intrinsic set.
Regarding coverage, we could auto-generate intrinsic tests from the instruction combiner tests. The test generators could also explicitly target issues as the ones you are describing next (masked out zero-lanes).

> I guess one could make an argument that if one uses the generalized matchers, there is nothing that could possibly go wrong, and there is no need to actually test these pattern or understand their correctness. I'm doubtful that this is actually true though. It's almost certainly not true for InstCombine, but even in InstSimplify I expect that there would be some dangerous interaction with predicated div/rem opcodes. Presumably, a zero value in a masked out lane does not render the program undefined, and as such an InstSimplify fold to a poison value would be incorrect.

That's a fair point, the traits have to be carefully crafted as to not allow invalid rewrites. I see a clear strategy here: we start with very strict traits that bail pattern matching early as soon as we are stepping out of our comfort zone. Eg, we can make a trait bail around divisions initially to study the problem better and find the right trait logic to make them work.

> Another concern here is compile-time, but I won't be able to test that until there is an actually buildable patch.

Agreed, also code size if you instantiate for too many traits. I see two approaches to mitigate these issues:

- We can use virtual dispatch instead of template instantiation (or template-instantiate only twice: one for trait-less pattern matching and once again for virtual dispatch).
- We can have a cmake variable that controls instantation and if your distribution does not care about constrained fp or vp, say, you just don't instantiate for it and won't see compile time or size increases. I was hinting in that direction with the `EnabledTraits.def` file.

In D92086#3572881 <https://reviews.llvm.org/D92086#3572881>, @nikic wrote:

> A possibly stupid question from someone who has not followed the VP design: Why does the predication need to happen on every single instruction? Naively, I'd assume that predication is only important for instructions that have trapping behavior, which is basically just load/store and div/rem class intrinsics. For everything else, it should be fine to represent this as a normal vector add, and then a single masking operation at the end. One could then propagate that masking backwards in the backend. That should ensure that normal instructions optimize fine, while instructions where the masking is semantically relevant (like div/rem) do not blindly participate in optimizations.
>
> So basically, `vp.add(X, Y, M)` can be decomposed into `vp.mask(add(X, Y), M)`, where the `vp.mask` intrinsic can be pushed down and combined with other mask intrinsics. So `vp.add(vp.add(X, Y, M), Z, M)` becomes `vp.mask(add(vp.mask(add(X, Y), M), Z), M)` becomes `vp.mask(vp.mask(add(add(X, Y), Z), M), M)` becomes `vp.mask(add(add(X, Y), Z), M)`, at which point `add(add(X, Y), Z)` optimizes as usual.

VP intrinsics exist primarily to have a way to express the dynamic vector length for architectures like RISC-V V extension and SX-Aurora (VE target).
Every single instruction can have a different vector length and that parameter has performance implications (latency).
There exist vector codes that exploit vector length switching for better performance (to blend lanes, to compress sparse vectors and then operate on the dense data, etc).
The mask does not strictly need to be there for every instruction - integer add/sub are examples for this. Some, such as sdiv/srem strictly need it. However, you can exploit the mask by using a compressed_store/expanding_load when a sparse vector register is spilled. Where the mask is not needed you can always pass in an all-ones mask to disable it.
The long term strategy with the VP proposal is to push the mask and evl parameter into something like operand bundles that you can simply tag on regular instructions. Intrinsics are just a bridge technology, if you will. Before we can do that, however, LLVM has to be capable of optimizing masked instructions (we cannot ignore predication for the same reason we cannot have too permissive traits). Making LLVM predication-ready is what prompted generalized pattern matching.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D92086/new/

https://reviews.llvm.org/D92086