[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!
Eric Christopher via llvm-dev
llvm-dev at lists.llvm.org
Tue May 9 11:55:01 PDT 2017
Hi Quentin,
On Tue, May 9, 2017 at 11:47 AM Quentin Colombet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi Kristof,
>
> On May 9, 2017, at 3:41 AM, Kristof Beyls <kristof.beyls at arm.com> wrote:
>
> Great Quentin :).
>
> I've rerun the benchmarks comparing "-O0 -g" with "-O0 -g -mllvm
> -global-isel -mllvm -global-isel-abort=2" on r302453, on AArch64 Cortex-A57.
> I indeed see almost no moves between GPR and FPR registers anymore (see
> details below for where I still see some).
> On geomean, I see 13% slow down (down from 17% on my previous run).
> On geomean, code size increase is about 3% (down from 11% on my previous
> run).
> I also checked debug info quality with the DIVA tool on the test-suite;
> and asked one of our DS-5 validation guys to test debug experience for
> GlobalISel in combination with the ARM DS-5 Development Studio
> <https://developer.arm.com/products/software-development-tools/ds-5-development-studio> debugger.
> Both of us didn't find any issues with the debug info produced at -O0 -g
> with GlobalISel.
> I didn't check compile time, nor do I intend to check that as I don't have
> a good infrastructure set up for that.
>
> Overall, my impression is that GlobalISel is ready to be enabled by
> default for AArch64 at -O0, if others also believe it's ready.
>
>
> Sounds good to me!
>
> Renato, Diana, how does that sound?
>
I'd definitely like to give this a run through on my side as well before
you flip the switch. I'll try to do that this week.
-eric
>
>
> Thanks,
>
> Kristof
>
>
>
> Detailed observations:
>
>
> 1. Debug info analysis with DIVA
> <https://github.com/SNSystems/DIVA/blob/master/linux/doc/DIVA_user_guide.pdf>
> :
> I used DIVA options "-a --show-generated --show-level --show-location"
> to dump and diff debug info differences for the test-suite for "-O0 -g"
> comparing GlobalISel with non-GlobalISel.
> The only difference I found was on MultiSource/Benchmarks/tramp3d-v4:
> 001 37577 {Function} "operator*<double>" -> "Zero<double>"
> - No declaration
> - Template
> 002 {TemplateParameter} "T" <- "double"
> - 002 37577 {Parameter} <- "Zero<double>"
> 002 37577 {Parameter} <- "const double &"
> Which indicates that for non-GlobalISel, the Zero<double> parameter
> isn't present in the debug info for the instantiation of the below template
> function with T=double:
> template<class T>
> inline Zero<T> operator*(Zero<T>, const T&) { return Zero<T>(); }
> In other words, the debug info looks ever so slightly better with
> GlobalISel.
> 2. Quick analysis of reasons for slow down for the 10 programs that
> regress the most when enabling GlobalISel at -O0:
> - SingleSource/Benchmarks/BenchmarkGame/nsieve-bits (114%
> slowdown): no conversion of divide by power-of-2 to shift right. I think
> this is an improvement for -O0: no such optimization should be done when
> not requesting optimizations.
> - MultiSource/Benchmarks/MallocBench/gs/gs (88%): "interp"
> function: switch statement lowered as
> cascaded-sequence-of-conditional-branches.
> Same issue causes MultiSource/Applications/sqlite3/sqlite3 (71%).
> Same issue causes MultiSource/Applications/lua/lua (46%).
> - SingleSource/Benchmarks/Misc/flops-2 (75%): Poor lowering of
> fneg:
> - FastISel:
> ldur d0, [x29,#-16]
> fneg d0, d0
> stur d0, [x29,#-16]
> - GlobalISel:
> ldur d0, [x29,#-64]
> orr x8, xzr, #0x8000000000000000
> fmov d1, x8
> fsub d0, d1, d0
> fmov x8, d0
> stur x8, [x29,#-64]
> - MultiSource/Benchmarks/Prolangs-C++/city/city (74%): a call to
> memcpy for copying 4 bytes is present with GlobalISel that isn't present
> with FastISel, in function vehicle::select_move().
> Same issue causes
> SingleSource/Benchmarks/Shootout-C++/Shootout-C++-moments (58.5%)
> - MultiSource/Applications/siod/siod (72%): Seems to be mainly due
> to loading constants in the entry BB, but I'm not sure that indeed is the
> biggest cause. (https://bugs.llvm.org//show_bug.cgi?id=32561)
> - MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset
> (48%): Due to creating all constants in the entry BB, see function
> CopyObject. (https://bugs.llvm.org//show_bug.cgi?id=32561)
> - MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode
> (46%): Function Reference_IDCT: Probably due to creating all constants in
> the entry BB + spilling floating point data through an X register:
> - FastISel:
> fadd d0, d1, d0
> str d0, [sp,#528]
> - GlobalISel:
> fadd d0, d1, d0
> fmov x9, d0
> stur x9, [x29,#-48]
>
>
> Good finding, I forgot to do stores in my previous fix. I’ll do them
> shortly.
>
>
> 1.
> -
> 2. Other overall impression when comparing code generation: The code
> size increase is probably mainly due to creating constants always in the
> entry BB of the function. For functions with many constants, this leads to
> lots and lots of constant creation and then immediately spilling it to the
> stack. (https://bugs.llvm.org//show_bug.cgi?id=32561)
>
>
> About a month (r299283), I mistakenly committed a prototype I was working
> on to solve this kind of issue. I reverted it in r299287, but you can
> give it a try.
>
> Basically, that’s an additional generic pass that runs only at O0. It
> localizes (as in basic block local) the definition of the constants to
> workaround the limitations of O0 regalloc.
>
> Cheers,
> -Quentin
>
>
>
> On 8 May 2017, at 20:38, Quentin Colombet <qcolombet at apple.com> wrote:
>
> Hi Kristof,
>
> I made a fix for PR32550, that is much less involved in r302453.
>
> In particular, it does not rely on the greedy mode of the regbankselect
> pass, thus there is no compile time implication.
>
> Could you try it and check where we stand?
>
> Thanks,
> -Quentin
>
> On Apr 27, 2017, at 10:40 AM, Quentin Colombet via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hi Kristof,
>
> On Apr 27, 2017, at 9:47 AM, Kristof Beyls <kristof.beyls at arm.com> wrote:
>
> Hi Quentin,
>
> On 27 Apr 2017, at 00:48, Quentin Colombet <qcolombet at apple.com> wrote:
>
> Hi Kristof,
>
> On Apr 6, 2017, at 6:53 AM, Kristof Beyls <kristof.beyls at arm.com> wrote:
>
> I've been digging a little bit deeper into the biggest performance
> regressions I've observed.
>
> What I've observed so far is:
> * A lot of the biggest regressions are caused by unnecessarily moving
> floating point values through general purpose registers. I've raised
> http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one
> definitely needs fixing before enabling GlobalISel by default at -O0.
>
>
> I commented in the PR. This is a known problem and we have a solution.
> Given this is an optimization in the sense that it does not affect the
> correctness of the program, we didn’t push for fixing it now.
>
> For O0 we wanted to focus ourselves on generating correct code. Unless the
> regressions you are seeing are preventing debugging/running of the program,
> I wouldn’t block the flip of the switch on that.
>
> What do you think?
>
>
> For O0, I think most users care about fast compile time and an excellent
> debug experience.
> Next to that, I am told that some users also use -O0 to have a
> straightforward, simple, mapping between source code and the generated
> assembly.
>
> Out of the issues I've seen so far, I think this is the single
> "optimization" issue that I feel gives a negative first impression of
> GlobalISel.
> If users look at the generated assembly for floating point code it looks
> more like active de-optimization rather than "no optimization".
> My guess is also that this may be the biggest reason for the about 20%
> performance slow-down and 11% code size increase I measured earlier.
> OTOH, I clearly agree this is an optimization issue, not a correctness
> issue.
> Combining the above, I think it would be best if this issue got fixed
> before we turn on GlobalISel by default at -O0, or we may end up hearing
> quite a few (non-critical) complaints about this from users.
> Basically I think this is a tradeoff between giving a better first
> impression of GlobalISel vs getting more people to use and test it earlier.
>
> Thanks for the write-up on the PR, that is very useful.
> Do you have any rough idea how much effort would be involved in getting
> this fixed?
>
>
> I’d say 2-3 weeks. Could actually be shorter if we don’t do all the
> refactoring I have in mind to make the table for the alternative mappings
> smaller, but I don’t think it is worth taking any shortcut here.
>
> I got the impression Daniel is making good progress on the tablegen
> generation, which is key to getting this issue fixed?
>
>
> To be accurate, Daniel’s work on table gen is for the select phase, not
> regbankselect. In other words, right now, all the table for mappings for
> regbankselect are hand written and Daniel’s work is not changing that. The
> (weak) rationale is that it is not on the critical path :).
>
> I think that no matter whether we decide to switch the default before or
> after fixing this issue, this is one of the most urgent issues to fix as
> far as I can see.
>
>
> Agree. This is one of the top items of my todo list.
>
>
> If there is some way I can help contribute to fixing PR32550, I would like
> to help; but with the dependency on tablegen generation, I'm not sure what
> the best way is to help make that PR get fixed faster?
>
>
> Thanks for offering to help. I agree that with the dependency on the
> tabelgen generation in the way, this is probably not a good use of your
> time. Depending when I can spare some time on this, I’ll either do it or
> explain the refactoring in the google doc.
>
> In the meantime, I’d suggest to focus on validating the debug info on your
> side.
>
>
>
> * FastISel seems to transform division-by-constant-power-of-2 into right
> shift (see
> https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468).
> GlobalISel does not. It seems to me that at -O0 there may be reasons not
> perform this transformation, but maybe there is a good reason why FastISel
> does this?
>
>
> I think FastISel tries to generate the best code it can no matter what.
> For GISel O0 however, not doing this optimization sounds sensible to me.
> Now, I would say that the same remark as the previous bullet point apply:
> we shouldn’t do it unless it gets in the way of running/debugging the
> program.
>
>
> I agree that these optimizations should not be done at -O0. I think not
> doing them is actually an improvement: you give the user what they asked,
> i.e. "no optimization", and an as-straigthforward-as-possible mapping from
> source to assembly.
>
> * FastISel doesn’t\ seem to handle functions with switch statements, so it
> falls back to DAGISel. DAGISel produces code that's a lot better than
> GlobalISel for switch statement at -O0. I'm not sure if we need to do
> something here before enabling GlobalISel by default. I'm thinking we may
> need to add a smarter way to lower switch statements rather than just a
> cascaded sequence of conditional branches.
>
>
> Sounds optimization-ish to me. Same remark.
>
>
> Agreed, optimization-ish. In comparison to the above point on FastISel
> "peepholes", I think that lowering switch statements to something else than
> a cascaded sequence of conditional branches doesn't make the generated code
> harder to map to the source. So, in comparison to the above point on
> FastISel "peepholes", I'm not actively against being smarter about lowering
> switch statements at -O0.
> But I agree, this shouldn't hold up turning on GlobalISel by default at
> -O0.
>
>
> Other than the above remarks, before turning on GlobalISel by default,
> we'd better test/verify that debug info quality remains good.
> I haven't looked into that at all, but am hoping to start looking into
> that soon, with the help of the DIVA tool (
> https://github.com/SNSystems/DIVA) presented at last EuroLLVM (
> http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf).
> I don't recall anyone so far making any statements about the quality of the
> debug info they've measured or experienced with GlobalISel enabled?
>
>
> We ran the lldb test suite with GISel. IIRC, Tim has the details, GISel
> was on part with SDISel.
>
>
> Other than the all of the above, I just wanted to mention that Oliver
> recently started running csmith on AArch64 GlobalISel and found one issue
> so far that already got fixed (
> https://bugs.llvm.org//show_bug.cgi?id=32733).
> If he finds other correctness issues, I'm sure he'll keep on reporting
> them.
>
>
> Great!
>
> Thanks,
> -Quentin
>
>
> I'll try to add the above content to the document Diana created at
> https://goo.gl/IS2Bdw too.
>
> Thanks,
>
> Kristof
>
>
>
> On 3 Apr 2017, at 17:10, Kristof Beyls <Kristof.Beyls at arm.com> wrote:
>
> I've kicked off a run to compare "-O0 -g" versus "-O0 -g -mllvm
> -global-isel -mllvm -global-isel-abort=2".
> I've selected the test-suite (albeit a version which is a couple of months
> old now) and a few short-running proprietary benchmarks to get data back
> quickly for an initial feel of where things are.
> This was running on Cortex-A57 AArch64 Linux.
>
> I saw one assertion failure in GlobalISel, see
> http://bugs.llvm.org/show_bug.cgi?id=32471. This is in a program compiled
> at -O2 (my out-dated test-suite still overrides -O0 and instead uses -O for
> that program). The root cause of the failure seems to be due to
> LowLevelType not supporting vectors of pointers. I think this demonstrates
> that for correctness, we should be trying to test more than -O0, or even
> more than just LLVM-IR produced by clang, as other front-ends could run
> into this even at -O0.
>
> Due to this assertion failure and the infrastructure I used, the numbers
> below do not include test-suite/MultiSource/Benchmarks results.
>
> On the non-correctness aspects, LNT tells me that:
> - The programs that report execution time, on geomean are about 17% slower.
> - The programs that report scores, on geomean are about 21% slower.
> - Code size is up on geomean about 11%.
> I'm afraid I don't have compile time numbers, nor any feel for debug info
> quality.
>
> I'll need quite a bit more time to dig into the details to come up with
> something actionable, although the fact that LowLevelType doesn't support
> vectors of pointers is already actionable.
> Nevertheless, I thought to share what I see as is, to see if others see
> similar results so far.
>
> I thought Diana was going to look into fallback rate on the test-suite on
> AArch64 linux?
>
> Thanks,
>
> Kristof
>
> On 30 Mar 2017, at 10:54, Renato Golin <renato.golin at linaro.org> wrote:
>
> On 30 March 2017 at 00:27, Quentin Colombet <qcolombet at apple.com> wrote:
>
> On iOS we are at 100% pass rate in 00 g for the LLVM test suite, standard
> benchmarks and unit tests. In about 5% of all functions GlobalIsel falls
> back to SDIsel.
> (Kristof Beyls would have the linux numbers.)
> The self host compiler correctly builds and runs the LLVM test suite in O0.
>
>
> Having done no tests at all on my side, I think we need to have
> similar numbers on Linux to be able to flip across the board.
>
> I don't want to flip it only for Darwin and not Linux, as that will
> fragment the effort too much.
>
> I'll check with Diana and Kristof to know what's the best way forward,
> but it should be reasonably quick.
>
> cheers,
> --renato
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170509/70a9ed5b/attachment-0001.html>
More information about the llvm-dev
mailing list