[LLVMdev] MI scheduler produce badly code with inline function

Zakk zakk0610 at gmail.com
Tue Oct 15 21:28:03 PDT 2013


Hi Andy, thanks for your help!!
The scheduled code by method A is same as B when using the new machine
model.
it's make sense, but there is the another problem, the scheduled code is
badly.

load/store instruction always reuse the same register

Source:

#define N  2000000
static double b[N], c[N];
void Scale () {
    double scalar = 3.0;
    for (int j=0;j<N;j++)
        b[j] = scalar*c[j];
}

$clang -O3 foo.c -static -S -o foo.s  -mllvm -unroll-count=4
-mcpu=cortex-a9 -fno-vectorize -fno-slp-vectorize --target=arm
-mfloat-abi=hard -mllvm -enable-misched -mllvm -scheditins=false

per-operand cost model :
Scale:
  push  {lr}
  movw  r12, :lower16:c
  movw  lr, :lower16:b
  movw  r3, #9216
  movt  r12, :upper16:c
  mov r1, #0
  vmov.f64  d16, #3.000000e+00
  movt  lr, :upper16:b
  movt  r3, #244
.LBB0_1:
  add r0, r12, r1
  add r2, lr, r1
  *vldr  d17, [r0]*
  add r1, r1, #32
  vmul.f64  d17, d17, d16
  cmp r1, r3
  vstr  d17, [r2]
*  vldr  d17, [r0, #8]*
  vmul.f64  d17, d17, d16
* * vstr  d17, [r2, #8]
*  vldr  d17, [r0, #16]*
  vmul.f64  d17, d17, d16
  vstr  d17, [r2, #16]
*  vldr  d17, [r0, #24]*
  vmul.f64  d17, d17, d16
  vstr  d17, [r2, #24]
  bne .LBB0_1
  pop {lr}
  bx  lr
.Ltmp0:

Using Itinerary will generate better scheduled code:
clang -O3 foo.c -static -S -o foo.s -mllvm -unroll-count=4 -mcpu=cortex-a9
-fno-vectorize -fno-slp-vectorize --target=arm -mfloat-abi=hard -mllvm
-enable-misched

Scale: movw r12, :lower16:c movw r2, :lower16:b movw r3, #9216 movt r12,
:upper16:c mov r1, #0 vmov.f64 d16, #3.000000e+00 movt r2, :upper16:b movt
r3, #244 .LBB0_1: add r0, r12, r1 * vldr d17, [r0]* * vldr **d18**, [r0, #8]
* vmul.f64 d17, d17, d16 * vldr **d19**, [r0, #16]* * vldr **d20**, [r0,
#24]* add r0, r2, r1 vmul.f64 d18, d18, d16 add r1, r1, #32 cmp r1, r3
vmul.f64 d19, d19, d16 vmul.f64 d20, d20, d16 vstmia r0, {d17, d18, d19,
d20} bne .LBB0_1 bx lr

this is just because A9's per-operand machine model is not implemented
well?
By the way, why do you want to use the new machine model for mi-sched?

Thanks,

Kind regards
Kuan-Hsu



2013/10/15 Andrew Trick <atrick at apple.com>

>
> On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote:
>
> Hi all,
> I meet this problem when compiling the TREAM benchmark (
> http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched
>
> The small function will be scheduled as good code, but if opt inline this
> function, the inline part will be scheduled as bad code.
>
>
> A bug for this is welcome. Pretty soon, I’ll be verifying A9 performance
> and changing the default scheduler. When I do this, I’ll be using the new
> machine model:
>
> (-mllvm) -sched-itins=false
>
> However, some scheduler changes are required for that mode to fully
> enforce pipeline hazards.
>
> so I rewrite a simple code as attached link (foo.c), and compiled with two
> different methods:
>
> *method A:*
> *$clang -O3 foo.c -static -S -o foo.s -mllvm -enable-misched  -mllvm
> -unroll-count=4 --target=arm -mfloat-abi=hard -mcpu=cortex-a9
> -fno-vectorize -fno-slp-vectorize*
> *
> *
> *and*
> *
> *
> *method B:*
> *$clang foo.c -S -emit-llvm -o foo.bc --target=arm -mfloat-abi=hard
> -mcpu=cortex-a9
> *
> *$opt foo.bc -O3 -unroll-count=4 -o foo.opt.bc*
> * *
> *$llc foo.opt.bc -o foo.opt.s -march=arm -mcpu=cortex-a9 -enable-misched*
>
>
> You can try “clang -O3 -mllvm -disable-llvm-optzns …”. clang should
> generate the same bitcode, but skip the “opt” step.
>
> If that doesn’t work it can be a nightmare trying to decompose the
> compilations steps with fidelity. You can try:
> - clang -### …
> - clang -mllvm -print-options …
> - Passing a full triple to all tools with -mtriple
> - Debug the TargetOptions fields
> - -print-after-all to see which phase is different
>
> Even if you get all the options right, the process of serializing and
> rereading the IR can affect the optimizations.
>
> Sorry. I’ve been trying to think of a way to improve this situation.
>
> -Andy
>
>  (ps. I had checked with debug-pass=structure, so I think they are
> equivalently)
>
> but the result is different:
> You can find the LBB1_4 of foo.s, it always reuses the same reg for
> computation, but LBB1_4 of foo.opt.s doesn't.
>
> My question is how to just use clang (method A) to achieve B result?
> Or i am missing something here?
>
> I really appreciate any help and suggestions.
> Thanks
>
> Kuan-Hsu
>
> ------- file link -------
> foo.c: http://goo.gl/nVa2K0
> foo.s: http://goo.gl/ML9eNj
> foo.opt.s: http://goo.gl/31PCnf
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>


-- 
Best regards,
Kuan-Hsu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131016/7b2341e7/attachment.html>


More information about the llvm-dev mailing list