[PATCH] D129501: Redefine get.active.lane.mask to allow a more scalar lowering

Tue Jul 12 10:20:33 PDT 2022

reames added a comment.

In D129501#3644928 <https://reviews.llvm.org/D129501#3644928>, @SjoerdMeijer wrote:

> I haven't looked into details yet of the proposed change in semantics, but before I do that, my question is whether more efficient lowering is important at all (and thus if we want to put the effort in). Because the way I look at it is this: as soon as get.active.lane.mask is lowered by SelectionDAG something is "wrong". That is, it isn't picked up by a backend pass and lowered to a target intrinsic/instruction, or the intrinsic shouldn't have been emitted in the first place. The SelectionDAG lowering has always been a safety net just in case it wasn't lowered or missed. At least that was the idea when we added this. Not sure if things have changed or if there's another use-case.

I recently switched RISCV over to using get.active.lane.mask as the IR representation, and letting the generic SDAG do the lowering.  See https://reviews.llvm.org/D129221.  This was a preparation step for matching get.active.lane.mask during ISEL lowering to convert some masks into VL predication.

I could reverse course here and make this change in the vectorizer, and then do pattern matching on the new compare in ISEL.  Doing the pattern match for the simple pattern (icmp ult step_vector, N) wouldn't be particular hard.  Do you think that's a better overall approach?

================
Comment at: llvm/docs/LangRef.rst:19962
+numbers and not in machine numbers.  If ``%n`` unsigned less than ``%base``, then
+the result is a poison value. The above is equivalent to:

----------------
efriedma wrote:
> A potential issue I see with this change is that it doesn't play well with unrolling in the vectorizer.  For example, if each iteration of a loop handles 8 elements at a time with vector width 4, you get two calls to llvm.get.active.lane.mask, I think.
Can you expand on why this would be a problem?  I would expect to find the base incremented by the VF for the second unrolled copy, but that still has to be less than TC.

Hm, I'm assuming a loop with a scalar epilogue.  Is there maybe a problem here for tail folding?  I hadn't considered that, let me think a bit and get back to you.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D129501/new/

https://reviews.llvm.org/D129501