Folding nodes in instruction selection - please review

Wed Jul 10 11:51:06 PDT 2013

On Jul 8, 2013, at 4:47 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:

> Hi,
>  
> I analyzed the folding DAG nodes algorithm in LLVM, compared to code generated by Intel compiler and got into conclusion that the LLVM code is not always optimal.
> This is an example:
>  
>   %b1 = fadd <8 x float> %a1, <float AAA, float AAA, float AAA, float AAA, float AAA, float AAA, float AAA, float AAA >
>   %b2 = fadd <8 x float> %a2, <float AAA, float AAA, float AAA, float AAA, float AAA, float AAA, float AAA, float AAA >
>   %c = fmul <8 x float> %b1, %b2
>  
> The result (1) bellow is not better than (2), because loading constant is not a problem, but spilling %ymm2 that may be required  in (1) is not cheap.
>  
> (1)
>         vmovaps .LCPI1_0(%rip), %ymm2
>         vaddps  %ymm2, %ymm1, %ymm1
>         vaddps  %ymm2, %ymm0, %ymm0
>         vmulps  %ymm1, %ymm0, %ymm0
>  
> (2)
>         vaddps  .LCPI1_0(%rip), %ymm1, %ymm1
>         vaddps  .LCPI1_0(%rip),  %ymm0, %ymm0
>         vmulps  %ymm1, %ymm0, %ymm0

Hi Elena,

(2) has more micro-ops in the load/store unit, so I don’t believe that it is always better. The instruction selector can’t make this decision without knowing the register pressure.

The register allocator should turn (1) into (2) when it runs out of registers, it shouldn’t spill %ymm2. Make sure that the MI::canFoldAsLoad() property is working correctly. See InlineSpiller.cpp:

  // Before rematerializing into a register for a single instruction, try to
  // fold a load into the instruction. That avoids allocating a new register.
  if (RM.OrigMI->canFoldAsLoad() &&
      foldMemoryOperand(Ops, RM.OrigMI)) {
    Edit->markRematerialized(RM.ParentVNI);
    ++NumFoldedLoads;
    return true;
  }

Thanks,
/jakob