[llvm-dev] [RFC] Simple GVN hoist

Tue Sep 14 13:09:37 PDT 2021

Realized that the detailed nature of my response may have missed the 
high order bit.  I generally think that doing this transformation in GVN 
is entirely reasonable, and would be a useful enhancement.  That's why I 
spent the time to give the detailed feedback on implementation.

Philip

p.s. It's worth noting that what we're implementing here is a restricted 
form of "very busy expressions" which basically the dual of PRE.  
There's a bunch of existing literature on the general problem.

p.p.s. Once we're through with the hoisting variant, there's a 
restricted sinking variant too.   If we structure the hoisting code 
right, the sinking should be straight forward.

On 9/14/21 12:12 PM, Philip Reames wrote:
>
> Ok, this response is focused on the proposed transform.  See inline.
>
> On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:
>> Looking at a particularly hot function in the SPEC/x264, that LLVM fails to
>> vectorise.
>>
>> typedef short int16_t;
>> typedef unsigned short uint16_t;
>>
>> int q(int16_t d[], uint16_t m[], uint16_t b[]) {
>>    int n = 0;
>>    for (int i = 0; i < 16; i++) {
>>      if (d[i] > 0)
>>        d[i] = (b[i] + d[i]) * m[i] >> 16;
>>      else
>>        d[i] = -((b[i] - d[i]) * m[i] >> 16);
>>      n |= d[i];
>>    }
>>    return n;
>> }
>>
>> As it turns out, LLVM adds runtime alias checks for this function and then it
>> considers it legal to vectorise with if-conversion.  However, the vectorisation
>> cost model effectively bans that particular if-conversion by assigning a
>> ridiculous cost to emulated masked loads and stores.
>>
>> Originally, each branch of the if statement in the loop body contains three
>> identical loads.  Manually hoisting these allows the loop to be vectorised at
>> `-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
>> There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of
>> instructions from the two successors of a block, which subroutine does indeed
>> hoist two of the three loads, but gives up before hoisting the third one.
>>
>> We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist`
>> can do that, but that pass is disabled by default and has been disabled for a
>> long, long time.
>
> I haven't looked at this pass recently, but I would strongly recommend 
> distrusting the current implementation.  The software engineering 
> process that went into writing it doesn't give me any confidence.  I 
> remember a long string of enable-revert-bugfix without obvious 
> justification as to why the bugs being found had to be found with such 
> an expensive approach.  If you do decide to pursue this, expect the 
> need for a) targeted review and b) a bunch of dedicated fuzzing.
>
> Having said that, *conceptually* this pass takes the proper approach 
> by using MemorySSA.  Your current implementation has a bunch of 
> complexity to basically do a local renumbering of memory 
> instructions.  Some of that can be improved (see below), but if you 
> were doing this with MemorySSA, you wouldn't need to do anything fancy 
> as GVN would have numbered the loads correctly.  NewGVN is a MemorySSA 
> version of GVN which I'd nominally say is the right place for this, 
> but it has never been productionized.  A huge blocker was MemorySSA 
> itself which was recently productionized, but I don't know what's left 
> for NewGVN itself.  (Alina might be able to add context here?)
>
>> As an alternative, I was thinking of a simpler hoisting transformation, that
>> just handles moving instructions from two single-predecessor blocks to their
>> common predecessor. That could be made reasonably fast, by pairing instructions
>> by their GVN value number. Limiting hoisting to a predecessor block (for the
>> most part) would also avoid excessive increase of lifetimes (for the majority of
>> the case) and would also simplify correctness checks.
>>
>> I've written such a transformation as a subroutine to `GVN`, it seemed like a
>> good place for it and is an a similar spirit as various PREs the GVN does. The
>> Phabricator review is athttps://reviews.llvm.org/D109760.
>
> The current structure of algorithm you proposed can probably be 
> simplified.  Here's what I'd suggest:
>
> Start by handling all of the non-memory instructions.  This should be 
> as simple as constructing a hash map from value number to instruction 
> for one block, and then scanning the other block.  If you find the 
> same value number in both blocks, the (by assumption scalar) 
> instruction must exist in both blocks and can be hoisted.  This can be 
> reviewed, tested, submitted, etc..
>
> Then add a single pre-pass which "renumbers" all of the memory 
> instructions.  You can either use a trivial "no clobbers seen" like 
> EarlyCSE does, or only handle loads where MD->getDependency() is 
> nonlocal.  (We already do this query in processLoad, so it should be 
> cached.)  The same algorithm can be used as above, all that changes is 
> where the valuenumber comes from.  (The "pre-pass" can probably be 
> implemented as a helper function which renumbers memory instructions 
> on demand.)
>
> I would specifically *not* add an extra layer of iteration here.  If 
> you find you need iteration to handle the cases you care about, I 
> suspect we should actually be adjusting the value re-numbering scheme 
> for loads described just above.
>
> A couple of assorted notes:
>
>   * MemDep should be responsible for handling all aliasing barriers. 
>     If it returns a non-local dependence, reordering that instruction
>     to the beginning of the block should be legal w.r.t. memory.  Your
>     use of isHoistBarrier is confusing to me.  You may be trying to
>     handle speculation safety (i.e. avoid introducing faults), but if
>     so, the special casing around volatile and atomics seems odd? 
>     (Also, why allow hosting of the barrier instruction?)  I suspect
>     there's a particular case you're trying to handle.  I would
>     strongly advice dropping that from the initial patch, then add
>     back the complexity in a separate change which is well motivated
>     with tests, etc..
>   * In terms of the actual merge logic, I'd suggest implementing it as
>     hoist-one, then CSE.  That would allow you to reuse the existing
>     patching logic and avoid reimplementing a bunch of complexity.
>
>
>> Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):
>>
>> 500.perlbench_r    1.13%
>> 502.gcc_r          0.00%
>> 505.mcf_r         -1.89%
>> 520.omnetpp_r      0.00%
>> 523.xalancbmk_r    0.00%
>> 525.x264_r         7.67%
>> 531.deepsjeng_r    0.60%
>> 541.leela_r        0.24%
>> 548.exchange2_r    0.00%
>> 557.xz_r           0.75%
> This looks generally positive.  Do you know why MCF regresses?  If 
> not, taking a glance to make sure you understand the interaction is 
> probably warranted.
>> Comments?
>>
>> ~chill
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210914/481cd554/attachment.html>