Hi there, I notice that our current implementation of fast division transformation (turn `a / b` into `a * (1/b)`) is worse in precision compared with GCC.  Like this case in ppc64le:

        float fdiv(unsigned int a, unsigned int b) {
                return (float)a / (float)b;

Result of Clang -Ofast is 41A00001 (in Hex), while GCC produces 41A00000 which is the same as no optimizations opened.

Currently, DAGCombiner uses `BuildReciprocalEstimate` to calculate the reciprocal (`1/b`) first and multiply it with `a`.  But if we put the operand `a` into iterations in the estimate function, the result would be better.

Patching such a change may break several existing test cases in different platforms since it’s target-independent code.  So any suggestions are welcome.  Thanks.

