[LLVMdev] "Machine LICM" for Constants?

Wed Mar 7 22:57:40 PST 2012

On Mar 7, 2012, at 6:28 PM, Matt Johnson <johnso87 at crhc.illinois.edu> wrote:

> Thanks for the tip!  I looked into it and it looks like the problem as of SVN HEAD is that the lui and ori instructions in Mips are considered cheap (1-cycle def-use latency) by MachineLICM::IsCheapInstruction(), but are not trivially materializable because their register operands are not always available.  This makes MachineLICM::IsProfitableToHoist() return false, preventing the hoist even though MachineLICM::IsLoopInvariantInst() returns true.
> 
> The comment in IsProfitableToHoist() is:
> 
> // If the instruction is cheap, only hoist if it is re-materilizable [sic]. LICM
> // will increase register pressure. It's probably not worth it if the
> // instruction is cheap.
> 
> The function then proceeds to actually *estimate* register pressure for non-cheap instructions to determine whether or not to hoist them.
> This heuristic seems reasonable, but doesn't seem to do the right thing in this case.  Hacking the instruction itineraries to make the instructions not seem cheap doesn't seem like the right answer either.  I'm guessing the motivation for this heuristic is that, in a loop with many possible hoists, some cheap and some expensive, we would prefer to hoist the expensive ones rather than wasting all our register slack on the cheap ones.
> 
> Is there another way to accomplish this goal while still performing the hoist in situations where register pressure is low enough?  Say, considering the instructions in a loop for hoisting in descending order of cost, rather than in program order?
> 
> Note that ARM gets around this by creating a pseudo-instruction for 32-bit immediate loads (MOVi32imm) , rather than putting a pattern directly in ARMInstrInfo.td.  This fused instruction *is* rematerializable (since it defines the entire register), even though either of the two half-register instructions by themselves cannot be.  This is one way my target and Mips could hack around the problem, but for my target at least it has the disadvantage of having to add an ExpandPseudo pass to my backend and put logic in C++ that seems (IMO) to belong in TableGen.

The pseudo instruction approach is the only quick solution I can think of. MachineLICM is designed to be conservative so I don't think we should change the register pressure heuristics.

Evan

> 
> -Matt
> 
> On 03/07/2012 02:45 PM, Evan Cheng wrote:
>> Yes machine-licm can and should hoist constant materialization instructions out of the loop. If it's not doing that, it's probably because the target is not modeling the instruction correctly. I would walk through MachineLICM::IsLoopInvariantInst() in the debugger to figure it out. You can also try compiling the same bitcode for a target like ARM or X86 as a comparison.
>> 
>> Evan
>> 
>> On Mar 7, 2012, at 10:38 AM, Matt Johnson wrote:
>> 
>>> Hi All,
>>>     I work on a backend for a target similar to Mips, where large
>>> immediates are loaded into registers with 2 instructions, 1 to load the
>>> MSBits and 1 to load the LSBits.  I've noticed a recurring pattern
>>> where, despite low register pressure, these constants will be
>>> rematerialized in every iteration of a loop, rather than being hoisted.
>>> Here's an example using the mips-unknown-unknown target and Clang/LLVM
>>> HEAD.  From newlib's implementation of strncat:
>>> 
>>> #define DETECTNULL(X) (((X) - 0x01010101)&  ~(X)&  0x80808080)
>>> while (!DETECTNULL (*aligned_s1))
>>>   aligned_s1++;
>>> 
>>> This loop gets lowered under -O3 to:
>>> 
>>> $BB0_5:
>>>   lui $3, 32896
>>>   lui $7, 65278
>>>   ori $3, $3, 32896 ###### Materialize 0x80808080
>>>   lw  $8, 4($2)
>>>   nop
>>>   and $9, $8, $3
>>>   ori $7, $7, 65279 ###### Materialize -(0x01010101)
>>>   addiu $2, $2, 4
>>>   xor $3, $9, $3
>>>   addu  $7, $8, $7
>>>   and $3, $3, $7
>>>   beq $3, $zero, $BB0_5
>>> 
>>> 
>>> There are a ton of unused caller-saved registers in this small function,
>>> so I expected the constant materialization to be hoisted out of the
>>> tight loop.  I'm still learning about the new register allocator and am
>>> not immediately able to make sense of its debug output (and the
>>> 'problem' may be elsewhere in any case).  I'm happy to post the results
>>> of -debug-only regalloc if they're useful.
>>> 
>>> Is my desire to hoist the constants out of the loop reasonable?  Is
>>> there something I can do (hints or passes in my backend, clang/opt flag,
>>> etc.) to make this happen today?  If not, what is the root cause?  Maybe
>>> there's no way to hoist things out of a loop once IR is lowered into a
>>> SelectionDAG?
>>> 
>>> Thanks,
>>> Matt
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev