Folding nodes in instruction selection - please review

Thu Jul 11 06:04:06 PDT 2013

I'm trying to do a slightly different thing, but not correctly explained myself.
Let's forget about loads, I just have the following graph:

           A 
           |
           B
      /       \
    C1      C2

And I have the patterns:

def : Pat<( (C1 $src1, (B $src2, (A $src3))),
            (D1 $src1, $src2, $src3)>;

def : Pat<( (C2 $src1, (B $src2, (A $src3))),
            (D2 $src1, $src2, $src3)>;

(C1 (B ..(A))) may be replaced with D1
(C2 (B ..(A))) may be replaced with D2

How does  it work? 
At OPC_CheckFoldableChainNode it checks whether A may be folded in D1. But B has multiple uses and the automatic answer in this case is "not". 
I want to take "HasMultipleUses" to target specific analysis and check opcodes there. The default behavior will not be changed.

    case OPC_CheckFoldableChainNode: {
      assert(NodeStack.size() != 1 && "No parent node");
      // Verify that all intermediate nodes between the root and this one have
      // a single use.
      bool HasMultipleUses = false;
      for (unsigned i = 1, e = NodeStack.size()-1; i != e; ++i)
        if (!NodeStack[i].hasOneUse()) {
          HasMultipleUses = true;
          break;
        }
      if (HasMultipleUses) break;  // << I want to replace this "break" with more deep target specific analysis.

-  Elena

-----Original Message-----
From: Jakob Stoklund Olesen [mailto:stoklund at 2pi.dk] 
Sent: Wednesday, July 10, 2013 21:51
To: Demikhovsky, Elena
Cc: llvm-commits at cs.uiuc.edu; Nadav Rotem
Subject: Re: Folding nodes in instruction selection - please review

On Jul 8, 2013, at 4:47 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:

> Hi,
>  
> I analyzed the folding DAG nodes algorithm in LLVM, compared to code generated by Intel compiler and got into conclusion that the LLVM code is not always optimal.
> This is an example:
>  
>   %b1 = fadd <8 x float> %a1, <float AAA, float AAA, float AAA, float AAA, float AAA, float AAA, float AAA, float AAA >
>   %b2 = fadd <8 x float> %a2, <float AAA, float AAA, float AAA, float AAA, float AAA, float AAA, float AAA, float AAA >
>   %c = fmul <8 x float> %b1, %b2
>  
> The result (1) bellow is not better than (2), because loading constant is not a problem, but spilling %ymm2 that may be required  in (1) is not cheap.
>  
> (1)
>         vmovaps .LCPI1_0(%rip), %ymm2
>         vaddps  %ymm2, %ymm1, %ymm1
>         vaddps  %ymm2, %ymm0, %ymm0
>         vmulps  %ymm1, %ymm0, %ymm0
>  
> (2)
>         vaddps  .LCPI1_0(%rip), %ymm1, %ymm1
>         vaddps  .LCPI1_0(%rip),  %ymm0, %ymm0
>         vmulps  %ymm1, %ymm0, %ymm0

Hi Elena,

(2) has more micro-ops in the load/store unit, so I don't believe that it is always better. The instruction selector can't make this decision without knowing the register pressure.

The register allocator should turn (1) into (2) when it runs out of registers, it shouldn't spill %ymm2. Make sure that the MI::canFoldAsLoad() property is working correctly. See InlineSpiller.cpp:

  // Before rematerializing into a register for a single instruction, try to
  // fold a load into the instruction. That avoids allocating a new register.
  if (RM.OrigMI->canFoldAsLoad() &&
      foldMemoryOperand(Ops, RM.OrigMI)) {
    Edit->markRematerialized(RM.ParentVNI);
    ++NumFoldedLoads;
    return true;
  }

Thanks,
/jakob
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.