[PATCH] D109749: Experimental Partial Mem2Reg

Thu Oct 21 02:18:39 PDT 2021

huntergr added a comment.

In D109749#2999389 <https://reviews.llvm.org/D109749#2999389>, @lebedev.ri wrote:

> I have seen cases where this would be beneficial,
> some of those are just due to lack of inlining, but not all.
>
> I strongly believe this should be part of SROA,
> it should analyze the alloca's ignoring captures,
> and if it is otherwise promoteable, it should:
>
> 1. duplicate the original alloca (only for simplicity, this is fine since we know the old alloca goes away)
> 2. before each capture, load contents of the old alloca, and store it into new alloca
> 3. after each capture, load contents of the new alloca, and store it into old alloca
> 4. change captures to refer to the new alloca
> 5. run AggLoadStoreRewriter on the new alloca - so that all the uses of old alloca we've just introduced are analyzeable by SROA
> 6. proceed with normal handling of the old alloca - mem2reg will now succeed

Hi, thanks for the suggestion (and sorry for the delay in responding).

I've implemented something similar to what you've suggested, but with a slight difference to make it fit the problem at hand -- the openmp reduction present in the loop. There's a key difference which I didn't state in my initial summary (though was present in the unit test), which is the way the alloca is captured -- it's not directly passed as an argument to the function, but the pointer is instead stored into another local memory address first and the pointer for the second memory address is then passed to __kmpc_reduce_nowait. This leads to the code being somewhat messy, as I have to check that the store of the pointer dominates the call, that there aren't other uses of the second alloca that might interfere with conversion, etc.

The way that's done makes me wonder whether libomp needs a lighter-weight interface for reductions involving a single scalar value, rather than just a single generic interface which accepts an arbitrary number of reduction variables. (For comparison, I looked into what gcc does -- it passes a pointer to a shared reduction variable into the outlined function, and it just performs the atomic operation directly instead of calling to the runtime).

So I think that I'll repurpose this patch to only cover the direct case of an alloca being used in a call and separate out the libomp side of things for another patch. I'll update the diff once I've implemented that.

In D109749#2999682 <https://reviews.llvm.org/D109749#2999682>, @Meinersbur wrote:

> I agree this should be part of mem2reg/SROA unless there is a specific reason against it (e.g. computational complexity higher s.t. that it should not also run with every occurance of SROA/mem2reg in the default pipeline).
>
> Your motivational code looks like it should be processable by LICM, s.t. it is promoted to registers while in the loop, then vectorized. Do you know why this doesn't happen?

mem2reg handles promotion to registers, but for LICM specifically there's a couple of things which would stop it.

1. Although the address is loop invariant, the data isn't.
2. For this loop in particular, the store is conditional so might never happen. We *could* add a second boolean reduction to determine whether or not to actually perform a store after the loop, but that's a bit more complicated than just letting mem2reg do what it should.

In D109749#2999818 <https://reviews.llvm.org/D109749#2999818>, @jdoerfert wrote:

> I scanned the diff for `nosync` without hits. I doubt any of this reasoning is valid if I can have synchronization between threads.

That's part of the reason my original patch only changed uses before a capture (the other being possible aliasing within a thread -- a terrible idea, but someone somewhere has probably written something which relies on it). I could restrict it to avoid converting any allocas which use atomic operations.

In D109749#2999837 <https://reviews.llvm.org/D109749#2999837>, @jdoerfert wrote:

> That said, I think we need to use the fact that we know the value stored in the alloca is not captured. There was an email thread on this problem and email threads on how we could encode that it is not captured.
> Given that this occurs in the OpenMP context, `nosync` is probably not an alternative.

I think we can use Roman's approach when the alloca is passed as a 'nocapture' argument at least, which will give us some benefit even if it doesn't solve all of my initial problem. Do you agree?

I'm not sure about the best way of marking the store of the first alloca pointer into the second alloca's memory as nocapture, though. If we have a way of doing it then I can extend the work in a later patch to cover that case, or if not maybe we can change the way clang and libomp handle openmp reductions to make it easier to optimize outlined functions.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109749/new/

https://reviews.llvm.org/D109749