[LLVMbugs] [Bug 4637] New: Some FP optimization opportunities missed

Tue Jul 28 00:19:48 PDT 2009

http://llvm.org/bugs/show_bug.cgi?id=4637

           Summary: Some FP optimization opportunities missed
           Product: new-bugs
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: new bugs
        AssignedTo: unassignedbugs at nondot.org
        ReportedBy: edwintorok at gmail.com
                CC: llvmbugs at cs.uiuc.edu

The following portion of code from Blender's fluid simulation takes up ~20-25%
of fluid simulation time (with both gcc/llvm). It looks like we are missing
some optimization opportunities here:

void bar(float *y);
void foo(float rho, float uy, float ux, float usqr, float uz) {
float lcsmeq[19];
 lcsmeq[1 ] = ( (1.0/18.0)*(rho + uy*(4.5*uy + 3.0) - usqr)) ;
 lcsmeq[2 ] = ( (1.0/18.0)*(rho + uy*(4.5*uy - 3.0) - usqr)) ;
 lcsmeq[3 ] = ( (1.0/18.0)*(rho + ux*(4.5*ux + 3.0) - usqr)) ;
 lcsmeq[4 ] = ( (1.0/18.0)*(rho + ux*(4.5*ux - 3.0) - usqr)) ;
 lcsmeq[5 ] = ( (1.0/18.0)*(rho + uz*(4.5*uz + 3.0) - usqr)) ;
 lcsmeq[6 ] = ( (1.0/18.0)*(rho + uz*(4.5*uz - 3.0) - usqr)) ;
 lcsmeq[7] = ( (1.0/36.0)*(rho + (+ux+uy)*(4.5*(+ux+uy) + 3.0) - usqr));
 lcsmeq[8] = ( (1.0/36.0)*(rho + (-ux+uy)*(4.5*(-ux+uy) + 3.0) - usqr));
 lcsmeq[9] = ( (1.0/36.0)*(rho + (+ux-uy)*(4.5*(+ux-uy) + 3.0) - usqr));
 lcsmeq[10] = ( (1.0/36.0)*(rho + (-ux-uy)*(4.5*(-ux-uy) + 3.0) - usqr));
 lcsmeq[11] = ( (1.0/36.0)*(rho + (+uy+uz)*(4.5*(+uy+uz) + 3.0) - usqr));
 lcsmeq[12] = ( (1.0/36.0)*(rho + (+uy-uz)*(4.5*(+uy-uz) + 3.0) - usqr));
 lcsmeq[13] = ( (1.0/36.0)*(rho + (-uy+uz)*(4.5*(-uy+uz) + 3.0) - usqr));
 lcsmeq[14] = ( (1.0/36.0)*(rho + (-uy-uz)*(4.5*(-uy-uz) + 3.0) - usqr));
 lcsmeq[15] = ( (1.0/36.0)*(rho + (+ux+uz)*(4.5*(+ux+uz) + 3.0) - usqr));
 lcsmeq[16] = ( (1.0/36.0)*(rho + (+ux-uz)*(4.5*(+ux-uz) + 3.0) - usqr));
 lcsmeq[17] = ( (1.0/36.0)*(rho + (-ux+uz)*(4.5*(-ux+uz) + 3.0) - usqr));
 lcsmeq[18] = ( (1.0/36.0)*(rho + (-ux-uz)*(4.5*(-ux-uz) + 3.0) - usqr));
 bar(lcsmeq);
}

Attached is the .bc file we produce, and the assembly has these:
cvtss2sd        %xmm1, %xmm5                                 
        movsd   .LCPI1_0, %xmm6                                      
        movapd  %xmm5, %xmm7                                         
        mulsd   %xmm6, %xmm7                                         
        movsd   .LCPI1_1, %xmm8                                      
        movapd  %xmm7, %xmm9                                         
        addsd   %xmm8, %xmm9                                         
        mulsd   %xmm5, %xmm9                                         
        cvtss2sd        %xmm0, %xmm0                                 
        addsd   %xmm0, %xmm9                                         
        cvtss2sd        %xmm3, %xmm3                                 
        subsd   %xmm3, %xmm9                                         
        movsd   .LCPI1_2, %xmm10                                     
        mulsd   %xmm10, %xmm9                                        
        cvtsd2ss        %xmm9, %xmm9                                 
        movss   %xmm9, 16(%rsp)                                      
        movsd   .LCPI1_3, %xmm9                                      
        addsd   %xmm9, %xmm7                                         
        mulsd   %xmm5, %xmm7                                         
        addsd   %xmm0, %xmm7                                         
        subsd   %xmm3, %xmm7                                         
        mulsd   %xmm10, %xmm7          
...

I see several opportunities to optimize here:

1) rho-usqr is computed each time:

        %13 = fadd double %0, %12               ; <double> [#uses=1]
        %14 = fsub double %13, %6               ; <double> [#uses=1]
        %15 = fmul double %14, 0x3FAC71C71C71C71C               ; <double>
[#uses=1]
        %16 = fptrunc double %15 to float               ; <float> [#uses=1]

        %22 = fadd double %0, %21               ; <double> [#uses=1]
        %23 = fsub double %22, %6               ; <double> [#uses=1]
        %24 = fmul double %23, 0x3FAC71C71C71C71C               ; <double>
[#uses=1]
        %25 = fptrunc double %24 to float       

Instead a temporary could be introduced that stores the result of (rho-usqr).
FADD/FSUB isn't associative, so this transform is not safe in general (unless
maybe there is unsafe-fpmath?), but in this case the result is truncated to a
float, so I think the following is sufficient to guarantee same results:

Calculate the rounding error introduced by applying associativity to fadd/fsub,
it can only be the last bit of the mantissa, now apply any operations that are
done on the result (in this case multiply by 1/18), and convert to float. If
the result is zero, then we can apply associativity since it only changes bits
of the double's mantissa that get truncated anyway.

2) There are lots of floating point extensions of the fsub:
        %60 = fsub float %uy, %ux               ; <float> [#uses=1]
        %61 = fpext float %60 to double         ; <double> [#uses=2]

But uy and uy are already available in extended form:
        %1 = fpext float %uy to double          ; <double> [#uses=3]
        %18 = fpext float %ux to double         ; <double> [#uses=3]

%61 could be calculated as:
%61 = fsub double %1, %18

Or would that violate some IEEE FP rules?

3) This sort of code is a very good candidate for vectorization, since it is
the exact same operations applied to different operands

If any of my above optimizations are still unsafe, then maybe we should do them
at least for -ffast-math, or -ffinite-math-only -fno-trapping-math
-fno-signaling-nans

-- 
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.