[PATCH] D143723: [RISCV] Increase default vectorizer LMUL to 2

Tue Feb 14 08:44:12 PST 2023

reames added a comment.

I chatted w/Luke about this offline before he posted the patch, but let me lay out the major concern and mitigation strategies here.

Moving from LMUL1 to LMUL2 increases the VF used during vectorization.  Since we don't currently tail fold by default, this means both that a) there are more loops too short to benefit from vectorization, and b) that the tail is (on average) longer.  Both of these could potentially negatively impact performance.  The former could result in a higher fraction of wasted code size.

The major mitigation options are:

1. Make the vectorizer smarter about taking into consideration predicted loop length (from profiling), and dynamically select an LMUL to minimize chances of (a).   We could also make the vectorizer take known into account known trip count modulos when selecting VF.  Both of these are tricky with scalable types, and we'd probably have to resort to heuristics using vscalefortuning.

2. Enable tail folding using masking.  This support exists in tree today, and with some quick testing appears to be reasonable robust.  The downside is that masking has a non-zero execution cost.  The major effect of this is to eliminate concern (b), though we're left with a profitability concern around (a).  The bypass heuristic here would probably need some careful tuning.  I don't think we need to enable tail folding *before* moving to LMUL2, but if we see regressions, this would probably be the first knob to try.

3. Pursue tail folding via VL (i.e. VP intrinsic based).  We've talked about this before, but it's a lot of work.  I'm hoping we don't need to have gotten all the way to VL based predication to enable LMUL2, but well, we'll see.

Overall, I think it's reasonable to move forward with the change to LMUL2 (once the unrolling issue is fixed), but we need to be clear this is somewhat speculative and there's a very decent chance we may have to adjust course.

================
Comment at: llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll:32
+; DEFAULT-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[TMP10]], i32 0
+; DEFAULT-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 2 x i64>, ptr [[TMP12]], align 4
+; DEFAULT-NEXT:    [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
----------------
craig.topper wrote:
> Why are we generating 2 loads, 2 adds, and 2 stores now? I thought this should only change the types, not the number of instructions generated
This is the same issue I noticed in the test change.  There appears to be an unexpected interaction between lmul and unrolling going on here.  

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143723/new/

https://reviews.llvm.org/D143723