[llvm-dev] [RFC] Simple GVN hoist

Wed Sep 15 04:16:46 PDT 2021

On Tue, Sep 14, 2021 at 7:17 PM Philip Reames via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> The vectorize already has to do a forward walk through the function to compute predicate masks.  As we do that walk, we could keep track for each address of the union of predicate masks for each access.  In this case, the union would be all ones, meaning we could use a unconditional access.  In general, we can do a single predicated access using the union of the predicate masks.  This is a non-trivial costing question of whether explicitly forming the union is worthwhile, but I suspect we can handle obvious cases cheaply.
> Still in the vectorizer, we can do a forward walk and track accesses encountered along each path.  Any address which is accessed along each path to the latch can be done unconditionally.  (This is essentially a restricted subset of the former without the generalization to predicate masks.)
> You don't mention your target processor, but one question to ask is the cost model for a predicated load reasonable?  If not, would reducing it to match the actual target cost fix your problem?  In particular, we have *multiple* memory accesses with the *same* mask here.  Does accounting for that difference in lowering cost side step the issue?

The target architecture is AArch64 and the cost model indeed is
entirely unreasonable, in fact, it on purpose sets a high value for
emulated masked loads/stores so as to disable vectorisation.
For the case I have in hand, though, I set on finding a way to not
have to deal with loads/stores in the first place, instead of handling
them better (by fixing the cost model or otherwise).

> We could extend SimplifyCFG.  You mention this, but don't get into *why* we don't handle the last load.  We really should in this case.  (Though, after CSE, there should only be two conditional loads in your inner loop?  Maybe I'm missing something?)
Two of the loads are hoisted by
`SimplifyCFGOpt::HoistThenElseCodeToIf`, which is intentionally
limited to hoist identical instructions in identical order. In this
example, the scan of the two blocks
stops at the first pair of different instructions, which happens to be
before the third load.

> Your in-passing mention that -O3 and unrolling breaks vectorization also concerns me.  It really shouldn't.  That sounds like a probably issue in the SLP vectorizer, and maybe a pass order issue.
I would (maybe naively)  think that loop vectoriser should be given a
chance before loop unrolling and the SLP vectoriser.