[llvm-dev] [RFC] Simple GVN hoist
Philip Reames via llvm-dev
llvm-dev at lists.llvm.org
Tue Sep 14 13:09:37 PDT 2021
Realized that the detailed nature of my response may have missed the
high order bit. I generally think that doing this transformation in GVN
is entirely reasonable, and would be a useful enhancement. That's why I
spent the time to give the detailed feedback on implementation.
Philip
p.s. It's worth noting that what we're implementing here is a restricted
form of "very busy expressions" which basically the dual of PRE.
There's a bunch of existing literature on the general problem.
p.p.s. Once we're through with the hoisting variant, there's a
restricted sinking variant too. If we structure the hoisting code
right, the sinking should be straight forward.
On 9/14/21 12:12 PM, Philip Reames wrote:
>
> Ok, this response is focused on the proposed transform. See inline.
>
> On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:
>> Looking at a particularly hot function in the SPEC/x264, that LLVM fails to
>> vectorise.
>>
>> typedef short int16_t;
>> typedef unsigned short uint16_t;
>>
>> int q(int16_t d[], uint16_t m[], uint16_t b[]) {
>> int n = 0;
>> for (int i = 0; i < 16; i++) {
>> if (d[i] > 0)
>> d[i] = (b[i] + d[i]) * m[i] >> 16;
>> else
>> d[i] = -((b[i] - d[i]) * m[i] >> 16);
>> n |= d[i];
>> }
>> return n;
>> }
>>
>> As it turns out, LLVM adds runtime alias checks for this function and then it
>> considers it legal to vectorise with if-conversion. However, the vectorisation
>> cost model effectively bans that particular if-conversion by assigning a
>> ridiculous cost to emulated masked loads and stores.
>>
>> Originally, each branch of the if statement in the loop body contains three
>> identical loads. Manually hoisting these allows the loop to be vectorised at
>> `-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
>> There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of
>> instructions from the two successors of a block, which subroutine does indeed
>> hoist two of the three loads, but gives up before hoisting the third one.
>>
>> We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist`
>> can do that, but that pass is disabled by default and has been disabled for a
>> long, long time.
>
> I haven't looked at this pass recently, but I would strongly recommend
> distrusting the current implementation. The software engineering
> process that went into writing it doesn't give me any confidence. I
> remember a long string of enable-revert-bugfix without obvious
> justification as to why the bugs being found had to be found with such
> an expensive approach. If you do decide to pursue this, expect the
> need for a) targeted review and b) a bunch of dedicated fuzzing.
>
> Having said that, *conceptually* this pass takes the proper approach
> by using MemorySSA. Your current implementation has a bunch of
> complexity to basically do a local renumbering of memory
> instructions. Some of that can be improved (see below), but if you
> were doing this with MemorySSA, you wouldn't need to do anything fancy
> as GVN would have numbered the loads correctly. NewGVN is a MemorySSA
> version of GVN which I'd nominally say is the right place for this,
> but it has never been productionized. A huge blocker was MemorySSA
> itself which was recently productionized, but I don't know what's left
> for NewGVN itself. (Alina might be able to add context here?)
>
>> As an alternative, I was thinking of a simpler hoisting transformation, that
>> just handles moving instructions from two single-predecessor blocks to their
>> common predecessor. That could be made reasonably fast, by pairing instructions
>> by their GVN value number. Limiting hoisting to a predecessor block (for the
>> most part) would also avoid excessive increase of lifetimes (for the majority of
>> the case) and would also simplify correctness checks.
>>
>> I've written such a transformation as a subroutine to `GVN`, it seemed like a
>> good place for it and is an a similar spirit as various PREs the GVN does. The
>> Phabricator review is athttps://reviews.llvm.org/D109760.
>
> The current structure of algorithm you proposed can probably be
> simplified. Here's what I'd suggest:
>
> Start by handling all of the non-memory instructions. This should be
> as simple as constructing a hash map from value number to instruction
> for one block, and then scanning the other block. If you find the
> same value number in both blocks, the (by assumption scalar)
> instruction must exist in both blocks and can be hoisted. This can be
> reviewed, tested, submitted, etc..
>
> Then add a single pre-pass which "renumbers" all of the memory
> instructions. You can either use a trivial "no clobbers seen" like
> EarlyCSE does, or only handle loads where MD->getDependency() is
> nonlocal. (We already do this query in processLoad, so it should be
> cached.) The same algorithm can be used as above, all that changes is
> where the valuenumber comes from. (The "pre-pass" can probably be
> implemented as a helper function which renumbers memory instructions
> on demand.)
>
> I would specifically *not* add an extra layer of iteration here. If
> you find you need iteration to handle the cases you care about, I
> suspect we should actually be adjusting the value re-numbering scheme
> for loads described just above.
>
> A couple of assorted notes:
>
> * MemDep should be responsible for handling all aliasing barriers.
> If it returns a non-local dependence, reordering that instruction
> to the beginning of the block should be legal w.r.t. memory. Your
> use of isHoistBarrier is confusing to me. You may be trying to
> handle speculation safety (i.e. avoid introducing faults), but if
> so, the special casing around volatile and atomics seems odd?
> (Also, why allow hosting of the barrier instruction?) I suspect
> there's a particular case you're trying to handle. I would
> strongly advice dropping that from the initial patch, then add
> back the complexity in a separate change which is well motivated
> with tests, etc..
> * In terms of the actual merge logic, I'd suggest implementing it as
> hoist-one, then CSE. That would allow you to reuse the existing
> patching logic and avoid reimplementing a bunch of complexity.
>
>
>> Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):
>>
>> 500.perlbench_r 1.13%
>> 502.gcc_r 0.00%
>> 505.mcf_r -1.89%
>> 520.omnetpp_r 0.00%
>> 523.xalancbmk_r 0.00%
>> 525.x264_r 7.67%
>> 531.deepsjeng_r 0.60%
>> 541.leela_r 0.24%
>> 548.exchange2_r 0.00%
>> 557.xz_r 0.75%
> This looks generally positive. Do you know why MCF regresses? If
> not, taking a glance to make sure you understand the interaction is
> probably warranted.
>> Comments?
>>
>> ~chill
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210914/481cd554/attachment.html>
More information about the llvm-dev
mailing list