[PATCH] D16435: [RS4GC] Effective rematerialization at non-entry polls

Fri Jan 22 17:06:04 PST 2016

Hi

On 2016-01-22 02:10, Philip Reames wrote:
> reames created this revision.
> reames added reviewers: JosephTremoulet, sanjoy, igor-laevsky, mjacob.
> reames added a subscriber: llvm-commits.
> Herald added subscribers: mcrosier, sanjoy, MatzeB.
> 
> This is an attempt at addressing 26223.  Specifically, try to avoid
> unfortunate register spilling by trying to place rematerializations
> introduced by rewrite-statepoints-for-gc in order to maximize folding
> and simplification opportunities rather than to minimize execution
> frequency.
> 
> If we have a bit of code like this:
> %addr = gep %o, 8
> loop {
>   if (poll) {
>      safepoint();
>   }
>   load %addr
> }
> 
> We currently end up rewriting this as:
> %addr = gep %o, 8
> loop {
>   %addr1 = phi (%addr, %addr2)
>   if (poll) {
>      safepoint();
>      %remat = gep %o.relocated, 8
>   }
>   %addr2 = phi (%addr1, %remat)
>   load %addr2
> }
> This ends up forcing us to rematerialize the address explicitly and
> likely will cause us to spill/fill the address if register
> constrained.  This creates a bunch of dependent loads (fill from
> stack, load from result) which show up as hot in a couple of
> benchmarks.
> 
> A much better result would be:
> %addr = gep %o, 8
> loop {
>   if (poll) {
>      safepoint();
>   }
>   %remat = gep %o.relocated, 8
>   load %remat
> }
> 
> This version allows the GEP to be folded directly into x86's native
> addressing modes.
> 
> (Note: For conciseness, I'm not writing the phis for relocating %o,
> assume they're all there.)
> 
> The particular heuristic chosen here is to push each given remat as
> late as possible.  This has the effect of moving remats closer to uses
> and preventing the creation of unnecessary and confusing PHI nodes.
> Empirically, this does appear to help in some of the benchmarks when I
> encountered this, but I'm getting increasing uncomfortable with the
> coupling between RS4GC and CGP.  In particular, a better version of
> this heuristic is already present in CGP.
> 
> I think we should probably take this incremental step, but before
> going much further, factoring the code to share parts of the
> implementation of CGP might be a good idea.  The generally problem is
> that many CGP transforms are hard to perform after RS4GC has run.  It
> may make sense to selectively run them before hand.

I understand that RS4GC can make CGP's work harder.  However I don't see 
how this is relevant here.  Where is the difference between doing the 
addressing-mode-aware placement in RS4GC vs. doing it in CGP afterwards?

-Manuel

> http://reviews.llvm.org/D16435
> 
> Files:
>   lib/Transforms/Scalar/RewriteStatepointsForGC.cpp
>   test/Transforms/RewriteStatepointsForGC/remat-schedule.ll
>   
> test/Transforms/RewriteStatepointsForGC/rematerialize-derived-pointers.ll