[llvm-dev] GlobalISel design update and goals

Thu Aug 2 20:56:20 PDT 2018

Hi LLVM and HotSpot developers,

I just experienced 4+ months HotSpot C1 MIPS porting. And my sincere 
thanks will goto HotSpot developers who taught me a lot!

As an apprentice in the compiler world, I have some questions:

* There is no instruction selection "concept" equivalent to LLVM's 
SelectionDAG and GlobalISel in HotSpot C1? Because I manually write 
assembly[1] lowing HIR to LIR in HotSpot C1. So which one is better? 
LLVM or HotSpot selection by human?

* Why not use Greedy, just like LLVM's RegAllocGreedy, to take place of 
Linear Scan[2] for HotSpot C1's register allocation?

Please teach me, thanks a lot!

1. 
http://hg.loongnix.org/jdk8-mips64-public/hotspot/file/tip/src/cpu/mips/vm/c1_LIRAssembler_mips.cpp#l1542

2. 
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2018-March/028545.html

在 2018年08月03日 10:46, Leslie Zhai 写道:
> Hi Amara,
>
> Thanks for your great job!
>
> MIPS, RISCV and other targets have refactory requirement 
> http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html
>
> Please give us some suggestion for supporting custom CCState, 
> CCAssignFn in D41700. And also RegisterBank in D41653. because it 
> needs to consider about how to support variable-sized register classes 
> concept implemented in D24631.
>
> I am building Linux Kernel and OpenJDK8 with LLVM toolchain for mips64el:
>
> http://lists.llvm.org/pipermail/llvm-dev/2018-July/124620.html
>
> http://lists.llvm.org/pipermail/llvm-dev/2018-July/124717.html
>
> And migrate to GlobalISel and Machine Scheduler for LoongISA 
> http://lists.llvm.org/pipermail/llvm-dev/2018-May/123608.html
>
> My sincere thanks will goto LLVM, Linux Kernel and OpenJDK developers 
> who teach me a lot!
>
>
> 在 2018年07月30日 22:01, Amara Emerson via llvm-dev 写道:
>> Hi all,
>>
>> Over the past few months we’ve been doing work on the foundations for 
>> the next stages of GlobalISel development. In terms of changes from 
>> this time last year, the IR translator, the legalizer, and 
>> instruction selector have seen moderate to major changes. The most 
>> significant of these was the change to the legalizer API, allowing 
>> targets to use predicates to express legality, which gives more 
>> precise control over what forms of instructions are legal, and how to 
>> legalize them. This was necessary to implement support for the new 
>> extending loads and truncating stores, but also results in more 
>> concise and elegant expressions of legality for each target. For 
>> example, you can now apple a single definition to apply to multiples 
>> opcodes (G_ADD, G_SUB, G_MUL etc).
>>
>> The IR translator has been modified to split aggregates rather than 
>> handling them as one single large scalar. This change fixed some bugs 
>> and was necessary in order handle big endian code correctly in future.
>>
>> The tablegen instruction selector also saw significant improvements 
>> in performance, helping to keep overall compile time regression vs 
>> fastisel to be <5% geomean on CTMark. There are still a few outliers 
>> like sqlite3 which has a significant regression compared to FastISel, 
>> but most of the other benchmarks show little difference or even 
>> improvement.
>>
>> The tablegen importer has had improvements made to it, so that we can 
>> import more SelectionDAG selection rules. For example, currently on 
>> AArch64 we have about 40% of the rules being successfully imported.
>>
>> New additions from last year include the beginnings of a new 
>> combiner, although there’s still significant work to be done here in 
>> terms of the final design. The combiner will become a critical part 
>> of the pipeline in order to begin improving runtime performance.
>>
>> *High levels goals*
>>
>> Going forward, we plan to improve GlobalISel in a number of key areas 
>> to achieve the following targets:
>>  * Keeping compile time under control, ideally within 5% of FastISel, 
>> and when optimizations are enabled to maintain a compile time 
>> advantage of SelectionDAG.
>>  * Begin improving runtime performance by adding the most important 
>> optimizations required to be competitive at -Os. We will be targeting 
>> and measuring AArch64 for this goal but will endeavor to implement as 
>> many optimizations as possible in generic code to benefit other targets.
>>  * Improving overall stability and test coverage. Maintaining a high 
>> level of code quality and minimizing regressions in correctness and 
>> performance will be a significant challenge.
>>  * Ensure that the overall design meets the needs of general targets, 
>> not being overly tuned to a specific implementation.
>>
>> *Design work planned*
>>
>> These are some design changes coming in the near to medium term future:
>>
>>  * The G_MERGE and G_UNMERGE opcodes will be split into separate 
>> opcodes to handle different use cases. At the moment the opcode is 
>> too powerful, resulting in overly complex handling in places like the 
>> legalizer. G_MERGE will be split so that it only handles merging of 
>> scalars into one larger scalar. For other cases like merging scalars 
>> into a vector we will create a new G_BUILD_VECTOR opcode, with a new 
>> counterpart opcode for doing the opposite. For the current vector + 
>> vector case a new G_CONCAT_VECTOR will be introduced. With these 
>> changes it should simplify implementations for all targets.
>>
>>  * Constant representation at the MI level needs some investigation. 
>> We currently represent constants as generic instructions, with each 
>> instance of a constant being largely independent of each other, being 
>> stored in the entry block except for a few places in IR translation 
>> where we emit at the point of use. As a result we run a localizer 
>> pass in an effort to reduce the live ranges of the constants (and the 
>> consequent spilling), using some heuristics to decide where to sink 
>> the constant definitions to.
>>
>> Since we don’t do any real caching of MI constants, multiple 
>> G_CONSTANT definitions can exist for the same constant. This can also 
>> result in a lot of redundant constants being created, especially for 
>> things like address computation. Reducing the number of constants can 
>> help reduce compile time and memory usage. Given this situation, one 
>> possible approach is to encode constants into the operands of the 
>> users, rather than have dedicated machine instructions. At 
>> instruction selection time the constant can then be materialized into 
>> a register or encoded as an immediate. Further investigation is 
>> needed to find the right way forward here.
>>
>>  * For optimizations to be supported, the combiner will become a 
>> crucial part of the GISel pipeline. We have already done some 
>> preliminary work in a generic combiner, which will be used to 
>> eventually support combines of extloads/truncstores. We’ve had 
>> discussions on and off list about what we need from the new combiner. 
>> The summary is that we want the combiner to be flexible for each 
>> target to select from a library of combines, being as efficient as 
>> possible. The expression of the combines are currently written in 
>> C++, but one piece of investigation work we might do is to prototype 
>> using the same tablegen driven instruction selector code to match 
>> declarative combine patterns written in tablegen. Regardless, we will 
>> need to support the custom C++ use case.
>>
>>  * CSE throughout the pipeline. From a theoretical perspective, 
>> having a self contained CSE pass that operates as a single phase in 
>> the pipeline is attractive for the simplicity and elegance. However, 
>> we know empirically that this is expensive in compile time. Not only 
>> does the CSE pass itself take a non-negligible time to run, but 
>> having it as a late pass can result in the non-CSE’d code from the 
>> IRTranslator onwards surviving for a long time, taking up time in 
>> analysis at each stage of compilation. We believe running a light 
>> weight CSE early is a win. SelectionDAG currently does CSE by default 
>> when building the DAG, and this is something we could explore as part 
>> of a custom IRBuilder.
>>
>>  * Known bits computation. Some optimizations require the knowledge 
>> of which bits in a value are known to be 1 or 0, and do this by using 
>> the computeKnownBits() capability for SelectionDAG nodes. We will 
>> need some way of getting the same information. In an ideal scenario 
>> the replacement infrastructure for this will be more efficient, as 
>> this part of the codebase seems to be disproportionately responsible 
>> for pathological compile time regressions.
>>
>>  * Load/store ordering needs some thought, as we currently don’t have 
>> a way to easily check at the MI level what the ordering requirements 
>> are on a set of memory operations. SelectionDAG uses the chains to 
>> ensure that they’re scheduled to respect the orderings. How to 
>> achieve the same thing remains an open question for GlobalISel.
>>
>>  * More extensive tests that exercise multiple stages of the 
>> pipeline. One advantage of using MIR with GISel is that individual 
>> passes can be easily tested by feeding the exact input expected for a 
>> particular pass, and checking the immediate output of the pass. 
>> However this approach can leave holes in the test coverage. To help 
>> mitigate this, we will be exploring writing/generating whole pipeline 
>> tests, tracking some IR through each pass and checking how the MIR is 
>> mutated. We currently also have a proposed change to allow usage of 
>> FileCheck as a library, not just as a stand-alone tool. This would 
>> allow us to use FileCheck style checks and Improve testing of 
>> currently unused code paths.
>>
>>
>> *Roadmap for enabling optimizations*
>>
>> I’ve filed a few PRs that people can follow or comment on to track 
>> the progress towards enabling the -Os optimization level. The rough 
>> outline is:
>>
>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or 
>> benchmarks (Darwin)
>> PR 38366 - GlobalISel: Lightweight CSE
>> PR 32561 - GlobalISel: placement of constants in the entry-block and 
>> fast regalloc result in lots of reloaded constant
>> PR 38367 - GlobalISel: Implement support for obtaining known bits 
>> information
>> PR 38368 - GlobalISel: Investigate an efficient way to ensure 
>> load/store orderings
>>
>> These, along with general design and implementation work on the 
>> combiner, will then lead onto a long road of performance analysis, 
>> inevitable bug fixing, and implementing more optimizations.
>>
>> If anyone is interested in discussing in more detail, feel free to 
>> reach out on the list, or to any of the GlobalISel developers. We’d 
>> especially like to hear about any issues or concerns about porting 
>> targets to GlobalISel.
>>
>> Thanks,
>> Amara
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Regards,
Leslie Zhai