[llvm-dev] Generalizing load/store promotion in LICM
Philip Reames via llvm-dev
llvm-dev at lists.llvm.org
Wed Sep 12 14:41:40 PDT 2018
I'm thinking about making some semi radical changes to load store
promotion works in LICM, and figured it would be a good idea to get buy
in before I actually started writing code. :)
TLDR: legality of sinking stores to exits is hard, can we separate load
handling into a super aggressive form of PRE, and use predicated stores
to avoid solving legality question?
Background
We've been seeing an interesting class of problems recently that looks
roughly like this:
for (int = 0; i < N; i++)
if (a[i] == 0) // some data dependent check
g_count++; // conditional load and store to shared location
The critical detail in this example is that g_count is a global location
which may be accessed concurrently* by multiple threads. The challenge
here is that we don't know whether the store ever executes, and we're
not allowed to insert a store along any path that didn't originally
contain them. Said differently, if all elements in "a" are non-zero,
we're not allowed to store to g_count. We do know that the g_count
location is dereferenceable though.
(*Please, let's avoid the memory model derailment here. I'm simplifying
and C++ language rules aren't real useful for my Java language frontend
anyways. In practice, all the access are atomic, but unordered, but
we'll leave that out of discussion otherwise.)
I have two approaches I'm considering here. These are orthogonal, but I
suspect we'll want to implement both.
Proposal A - Use predicated stores in loop exits
The basic idea is that we don't worry about solving the legality
question above, and just insert a store which is predicated on a
condition which is true exactly when the original store ran. In pseudo
code, this looks something like:
bool StoreLegal = false;
int LocalCount = g_count;
for (int = 0; i < N; i++)
if (a[i] == 0) {
LocalCount++;
StoreLegal = true;
}
if (StoreLegal) g_count = LocalCount;
There are two obvious concerns here:
1. The predicated store might be expensive in practice - true for most
current architectures.
2. We''re introducing an extra boolean phi cycle around the loop.
Talking to a couple of folks offline at the socials over the last few
months, the former seems to be the main objection. I think we can
control this by restricting this transform to when the loop is believed
to have a high trip count and the conditional path is taken with some
frequency. Getting feedback on this particular point is one of my main
reasons for writing this mail.
The second objection can frequently be resolved by finding other
expressions which evaluate to the same boolean. (In this case, if
LocalCount != LocalCountOrig assuming i doesn't overflow.) We already
have a framework with SCEV to do these replacements. Though, from some
quick testing, it would definitely need strengthening. However, SCEV
can't remove the extra phi in all cases, so we have to be okay with the
extra phi cycle in the general case. This seems worthwhile to me, but
thoughts?
Proposal B - Separate load and store handling into distinct transforms
(For folks I've discussed this with before, this part is all new.)
Thinking about the problem, it finally occurred to me that we can
decompose the original example into two steps: getting the loads out of
the loop, and sinking the stores out of the loop. If we can accomplish
the former, but not the later, we've still made meaningful progress.
So, what'd we'd essentially have is a load only transformation which
produces this:
int LocalCount = g_count;
for (int = 0; i < N; i++)
if (a[i] == 0) {
LocalCount++;
g_count = LocalCount;
}
At this point, we've reduced memory traffic by half, and opened up the
possibility that other parts of the optimizer can exploit the simplified
form. The last point is particular interesting since we generally try
to canonicalize loads out of loops, and much of the optimizer is tuned
for a form with as much as possible being loop invariant. As one
example, completely by accident, there's some work going on in the
LoopVectorizer right now to handle such stores to loop invariant
addresses during vectorization. Putting the two pieces together would
let us fully vectorize this loop without needing to sink the stores at all.
In practice, the existing implementation in LICM would cleanly split
along these lines with little problem.
One way of looking at this load specific transform is as an extreme form
of PRE (partial redundancy elimination). Our current PRE implementation
doesn't handle cases this complicated.
Thoughts?
Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180912/3600bb9c/attachment.html>
More information about the llvm-dev
mailing list