[LLVMdev] MI scheduler produce badly code with inline function

Thu Nov 7 02:38:58 PST 2013

Hi all,
The problem is stack-coloring opt will impact MI scheduler,
so I give the -no-stack-coloring option to workaround it.

I don't know it's a potential bug or I miss something.
but i found this (http://marc.info/?l=llvm-commits&m=134635331431884&w=2)
this "old" mail said: "Merging stack slots before the MI scheduler could
invalidate this form of alias analysis since two IR allocas can share a
stack slot."

===========================
simple test program: http://goo.gl/zxmvOV

MI scheduler with new model:

foo.new.noSC.s (*-no-stack-coloring* *-scheditins=false)*
.LBB2_4:
  add r1, r8, r0
  add r2, r4, r0
  vldr  d16, [r1]
  add r0, r0, #128
  vldr  d17, [r2]
  cmp r0, r10
  vmla.f64  d17, d16, d10   <====
  vldr  d16, [r1, #8]
  vstr  d17, [r2]           <====
  vldr  d17, [r2, #8]
  vmla.f64  d17, d16, d10
  vldr  d16, [r1, #16]
  vstr  d17, [r2, #8]
  vldr  d17, [r2, #16]
  vmla.f64  d17, d16, d10
  vldr  d16, [r1, #24]
  vstr  d17, [r2, #16]
  vldr  d17, [r2, #24]
  vmla.f64  d17, d16, d10
...

foo.new.s   (-scheditins=false)
  .LBB2_4:
    add r1, r4, r0
    add r2, r6, r0
    vldr  d16, [r1]
    add r0, r0, #128
    vldr  d17, [r2]
    cmp r0, r8
    vmla.f64  d17, d16, d10 <====
    vstr  d17, [r2]  <====
    vldr  d16, [r1, #8]
    vldr  d17, [r2, #8]
    vmla.f64  d17, d16, d10
    vstr  d17, [r2, #8]
    vldr  d16, [r1, #16]
    vldr  d17, [r2, #16]
    vmla.f64  d17, d16, d10
    vstr  d17, [r2, #16]
    vldr  d16, [r1, #24]
    vldr  d17, [r2, #24]
    vmla.f64  d17, d16, d10

MI scheduler with Itinerary model:

foo.old.noSC.s *(-no-stack-coloring)*
.LBB2_4:
  add r1, r8, r0
  vldr  d22, [r1, #48]
  vldr  d23, [r1, #56]<==
  vldr  d24, [r1, #64]
....
  vmul.f64  d22, d22, d9
  vmul.f64  d23, d23, d9<==
  vmul.f64  d24, d24, d9
...

foo.old.s   (always reuse d16)
.LBB2_4:
  add r1, r4, r0
  add r2, r6, r0
  vldr  d16, [r1]
  add r0, r0, #128
  cmp r0, r9
  vmul.f64  d16, d16, d9
  vstr  d16, [r2]
  vldr  d16, [r1, #8] <==
  vmul.f64  d16, d16, d9 <==
  vstr  d16, [r2, #8] <==
  vldr  d16, [r1, #16]
  vmul.f64  d16, d16, d9
  vstr  d16, [r2, #16]
  vldr  d16, [r1, #24]
  vmul.f64  d16, d16, d9

By the way, has anyone use the new machine model to describe in-order
pipeline machine?
Does new model support it now?

Thanks,
Kuan-Hsu

2013/10/21 Zakk <zakk0610 at gmail.com>

> Hi Andy, I'm working on defining new machine model for my target,
> But I don't understand how to define the in-order machine (reservation
> tables) in new model.
>
> For example, if target has IF ID EX WB stages
>
> should I do:
>
> let BufferSize=0 in {
> def IF: ProcResource<1>; def ID: ProcResource<1>;
> def EX: ProcResource<1>; def WB: ProcResource<1>;
> }
> def : WriteRes<WriteALU, [IF, ID, EX1, WB]> ;
> or
>
> define each stage as SchedWrite type and use WriteSequence to define this
> sequence?
>
> Thanks,
> Kuan-Hsu
>
>
> 2013/10/16 Andrew Trick <atrick at apple.com>
>
>>
>> On Oct 15, 2013, at 9:28 PM, Zakk <zakk0610 at gmail.com> wrote:
>>
>> Hi Andy, thanks for your help!!
>> The scheduled code by method A is same as B when using the new machine
>> model.
>> it's make sense, but there is the another problem, the scheduled code is
>> badly.
>>
>> load/store instruction always reuse the same register
>>
>>
>> I filed PR17593 with this information. However, I see opposite results
>> from what you’re expecting. The code that uses fewer registers runs 4%
>> faster on my cortex-a9. The integer unit is out-of-order.
>>
>> this is just because A9's per-operand machine model is not implemented
>> well?
>> By the way, why do you want to use the new machine model for mi-sched?
>>
>>
>> I want to move all the targets we support to the new machine model so it
>> will be easier to maintain the scheduler. Additionally, the new model is
>> much more efficient and simpler (if you don’t use special features). It is
>> also correct for both preRA and postRA. Note that in the case of A9, the
>> .td file for the new machine model is horribly complicated because it
>> handles load multiple instructions. The A9 itinerary doesn’t even attempt
>> to do that. (This was done mainly to demonstrate the feature set of the new
>> model, not because it’s terribly important). The new model for A9 is also
>> complicated by a mapping from the old itinerary classes to the new machine
>> model.
>>
>> -Andy
>>
>> Thanks,
>>
>> Kind regards
>> Kuan-Hsu
>>
>>
>>
>> 2013/10/15 Andrew Trick <atrick at apple.com>
>>
>>>
>>> On Oct 14, 2013, at 3:27 AM, Zakk <zakk0610 at gmail.com> wrote:
>>>
>>> Hi all,
>>> I meet this problem when compiling the TREAM benchmark (
>>> http://www.cs.virginia.edu/stream/FTP/Code/) with enable-misched
>>>
>>> The small function will be scheduled as good code, but if opt inline
>>> this function, the inline part will be scheduled as bad code.
>>>
>>>
>>> A bug for this is welcome. Pretty soon, I’ll be verifying A9 performance
>>> and changing the default scheduler. When I do this, I’ll be using the new
>>> machine model:
>>>
>>> (-mllvm) -sched-itins=false
>>>
>>> However, some scheduler changes are required for that mode to fully
>>> enforce pipeline hazards.
>>>
>>> so I rewrite a simple code as attached link (foo.c), and compiled with
>>> two different methods:
>>>
>>> *method A:*
>>> *$clang -O3 foo.c -static -S -o foo.s -mllvm -enable-misched  -mllvm
>>> -unroll-count=4 --target=arm -mfloat-abi=hard -mcpu=cortex-a9
>>> -fno-vectorize -fno-slp-vectorize*
>>>
>>> *and*
>>>
>>> *method B:*
>>>
>>> *$clang foo.c -S -emit-llvm -o foo.bc --target=arm -mfloat-abi=hard
>>> -mcpu=cortex-a9*
>>> *$opt foo.bc -O3 -unroll-count=4 -o foo.opt.bc*
>>> *$llc foo.opt.bc -o foo.opt.s -march=arm -mcpu=cortex-a9 -enable-misched*
>>>
>>>
>>> You can try “clang -O3 -mllvm -disable-llvm-optzns …”. clang should
>>> generate the same bitcode, but skip the “opt” step.
>>>
>>> If that doesn’t work it can be a nightmare trying to decompose the
>>> compilations steps with fidelity. You can try:
>>> - clang -### …
>>> - clang -mllvm -print-options …
>>> - Passing a full triple to all tools with -mtriple
>>> - Debug the TargetOptions fields
>>> - -print-after-all to see which phase is different
>>>
>>> Even if you get all the options right, the process of serializing and
>>> rereading the IR can affect the optimizations.
>>>
>>> Sorry. I’ve been trying to think of a way to improve this situation.
>>>
>>> -Andy
>>>
>>>  (ps. I had checked with debug-pass=structure, so I think they are
>>> equivalently)
>>>
>>> but the result is different:
>>> You can find the LBB1_4 of foo.s, it always reuses the same reg for
>>> computation, but LBB1_4 of foo.opt.s doesn't.
>>>
>>> My question is how to just use clang (method A) to achieve B result?
>>> Or i am missing something here?
>>>
>>> I really appreciate any help and suggestions.
>>> Thanks
>>>
>>> Kuan-Hsu
>>>
>>> ------- file link -------
>>> foo.c: http://goo.gl/nVa2K0
>>> foo.s: http://goo.gl/ML9eNj
>>> foo.opt.s: http://goo.gl/31PCnf
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>>
>>
>>
>> --
>> Best regards,
>> Kuan-Hsu
>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131107/aee83d3f/attachment.html>