[llvm-dev] GlobalISel design update and goals

Tue Jul 31 08:40:16 PDT 2018

> On 30 Jul 2018, at 16:04, Quentin Colombet via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Hi Amara,
> 
> Thanks for sharing the plan going forward.
> 
> Inlined a couple of comments.
> 
> 2018-07-30 7:01 GMT-07:00 Amara Emerson via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>:
>> Hi all,
>> 
>> Over the past few months we’ve been doing work on the foundations for the
>> next stages of GlobalISel development. In terms of changes from this time
>> last year, the IR translator, the legalizer, and instruction selector have
>> seen moderate to major changes. The most significant of these was the change
>> to the legalizer API, allowing targets to use predicates to express
>> legality, which gives more precise control over what forms of instructions
>> are legal, and how to legalize them. This was necessary to implement support
>> for the new extending loads and truncating stores, but also results in more
>> concise and elegant expressions of legality for each target. For example,
>> you can now apple a single definition to apply to multiples opcodes (G_ADD,
>> G_SUB, G_MUL etc).
>> 
>> The IR translator has been modified to split aggregates rather than handling
>> them as one single large scalar. This change fixed some bugs and was
>> necessary in order handle big endian code correctly in future.
>> 
>> The tablegen instruction selector also saw significant improvements in
>> performance, helping to keep overall compile time regression vs fastisel to
>> be <5% geomean on CTMark. There are still a few outliers like sqlite3 which
>> has a significant regression compared to FastISel, but most of the other
>> benchmarks show little difference or even improvement.
>> 
>> The tablegen importer has had improvements made to it, so that we can import
>> more SelectionDAG selection rules. For example, currently on AArch64 we have
>> about 40% of the rules being successfully imported.
>> 
>> New additions from last year include the beginnings of a new combiner,
>> although there’s still significant work to be done here in terms of the
>> final design. The combiner will become a critical part of the pipeline in
>> order to begin improving runtime performance.
>> 
>> High levels goals
>> 
>> Going forward, we plan to improve GlobalISel in a number of key areas to
>> achieve the following targets:
>> * Keeping compile time under control, ideally within 5% of FastISel, and
>> when optimizations are enabled to maintain a compile time advantage of
>> SelectionDAG.
>> * Begin improving runtime performance by adding the most important
>> optimizations required to be competitive at -Os. We will be targeting and
>> measuring AArch64 for this goal but will endeavor to implement as many
>> optimizations as possible in generic code to benefit other targets.
>> * Improving overall stability and test coverage. Maintaining a high level
>> of code quality and minimizing regressions in correctness and performance
>> will be a significant challenge.
>> * Ensure that the overall design meets the needs of general targets, not
>> being overly tuned to a specific implementation.
>> 
>> Design work planned
>> 
>> These are some design changes coming in the near to medium term future:
>> 
>> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes to
>> handle different use cases. At the moment the opcode is too powerful,
>> resulting in overly complex handling in places like the legalizer. G_MERGE
>> will be split so that it only handles merging of scalars into one larger
>> scalar. For other cases like merging scalars into a vector we will create a
>> new G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the
>> opposite. For the current vector + vector case a new G_CONCAT_VECTOR will be
>> introduced. With these changes it should simplify implementations for all
>> targets.
>> 
>> * Constant representation at the MI level needs some investigation. We
>> currently represent constants as generic instructions, with each instance of
>> a constant being largely independent of each other, being stored in the
>> entry block except for a few places in IR translation where we emit at the
>> point of use. As a result we run a localizer pass in an effort to reduce the
>> live ranges of the constants (and the consequent spilling), using some
>> heuristics to decide where to sink the constant definitions to.
>> 
>> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
>> definitions can exist for the same constant. This can also result in a lot
>> of redundant constants being created, especially for things like address
>> computation. Reducing the number of constants can help reduce compile time
>> and memory usage. Given this situation, one possible approach is to encode
>> constants into the operands of the users, rather than have dedicated machine
>> instructions. At instruction selection time the constant can then be
>> materialized into a register or encoded as an immediate. Further
>> investigation is needed to find the right way forward here.
> 
> The initial design was to not have constant in machine operands. The
> main rational is materializing a constant may be expensive, so we
> better not sprinkle it around.
> Now, in practice, this indeed means that we need to keep a table of
> all the constants created so that we don't start to duplicate them,
> which we fail to do. That should be easy to fix that just a map
> virtual register + type to constant that could be kept at the function
> level (or even module level). Better yet, this could be handled by the
> IRBuilder. E.g., when instantiating a IR builder, it could scan the
> function to see which constants are already there and build this
> mapping and then, create only the constants that are missing.
> 
> Moreover, another the advantage of this model is that optimizations
> don't have to deal with two variants of the same instruction (ADDr and
> ADDi), same for patterns. Alternatively, if we don't change the
> opcode, but only the MachineOperands, then every optimization has to
> deal with two different kind of opcodes. Which is bad IMHO.
> 
> Also, this design was meant to absorb what the constant hoisting pass
> does on the IR, so that we can kill that pass while having a better
> cost model.
> 
> Finally, regarding the localizer pass, this was meant as a workaround
> for the fast regalloc problem and should be killed asap. In
> particular, the main motivation was to avoid code size bloat, but
> AFAIR during our last measurements with Ahmed, we only saved a few %
> on a handful of benchmarks, so maybe we can just kill it.
> 
>> 
>> * For optimizations to be supported, the combiner will become a crucial
>> part of the GISel pipeline. We have already done some preliminary work in a
>> generic combiner, which will be used to eventually support combines of
>> extloads/truncstores. We’ve had discussions on and off list about what we
>> need from the new combiner. The summary is that we want the combiner to be
>> flexible for each target to select from a library of combines, being as
>> efficient as possible. The expression of the combines are currently written
>> in C++, but one piece of investigation work we might do is to prototype
>> using the same tablegen driven instruction selector code to match
>> declarative combine patterns written in tablegen. Regardless, we will need
>> to support the custom C++ use case.
>> 
>> * CSE throughout the pipeline. From a theoretical perspective, having a
>> self contained CSE pass that operates as a single phase in the pipeline is
>> attractive for the simplicity and elegance. However, we know empirically
>> that this is expensive in compile time. Not only does the CSE pass itself
>> take a non-negligible time to run, but having it as a late pass can result
>> in the non-CSE’d code from the IRTranslator onwards surviving for a long
>> time, taking up time in analysis at each stage of compilation. We believe
>> running a light weight CSE early is a win. SelectionDAG currently does CSE
>> by default when building the DAG, and this is something we could explore as
>> part of a custom IRBuilder.
> 
> I have been pushing for having the IRBuilder being smarter. Having
> this doing CSE was something I wanted we try.
> 
>> 
>> * Known bits computation. Some optimizations require the knowledge of which
>> bits in a value are known to be 1 or 0, and do this by using the
>> computeKnownBits() capability for SelectionDAG nodes. We will need some way
>> of getting the same information. In an ideal scenario the replacement
>> infrastructure for this will be more efficient, as this part of the codebase
>> seems to be disproportionately responsible for pathological compile time
>> regressions.
>> 
>> * Load/store ordering needs some thought, as we currently don’t have a way
>> to easily check at the MI level what the ordering requirements are on a set
>> of memory operations. SelectionDAG uses the chains to ensure that they’re
>> scheduled to respect the orderings. How to achieve the same thing remains an
>> open question for GlobalISel.
> 
> I don't get this problem.
> GISel has a sequential IR, the order is already modeled here.

%t1 = load %addr1 :: (load 4 from @a)
%t2 = load %addr2 :: (load 4 from @a + 4)
store %t1 %addr3 :: (store 4 to @b)
store %t2 %addr4 :: (store 4 to @b + 4)

MIR specifies the order:
	load @a -> load @a+4 -> store @b -> store @b+4
but the dependencies are actually:
	load @a -> store @b
	load @a+4 -> store @b+4

We only have a representation of the former at the moment and have to re-calculate the latter each time we want to use that information to sink/hoist/fold/etc. instructions. We need to build a means of preserving that dependency graph within a pass and between passes.

> Dominance should give us the relative ordering, then if we want more,
> we should do exactly what we do at the IR level (alias analysis, etc.)
> 
> Cheers,
> -Quentin
>> * More extensive tests that exercise multiple stages of the pipeline. One
>> advantage of using MIR with GISel is that individual passes can be easily
>> tested by feeding the exact input expected for a particular pass, and
>> checking the immediate output of the pass. However this approach can leave
>> holes in the test coverage. To help mitigate this, we will be exploring
>> writing/generating whole pipeline tests, tracking some IR through each pass
>> and checking how the MIR is mutated. We currently also have a proposed
>> change to allow usage of FileCheck as a library, not just as a stand-alone
>> tool. This would allow us to use FileCheck style checks and Improve testing
>> of currently unused code paths.
>> 
>> 
>> Roadmap for enabling optimizations
>> 
>> I’ve filed a few PRs that people can follow or comment on to track the
>> progress towards enabling the -Os optimization level. The rough outline is:
>> 
>> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or benchmarks
>> (Darwin)
>> PR 38366 - GlobalISel: Lightweight CSE
>> PR 32561 - GlobalISel: placement of constants in the entry-block and fast
>> regalloc result in lots of reloaded constant
>> PR 38367 - GlobalISel: Implement support for obtaining known bits
>> information
>> PR 38368 - GlobalISel: Investigate an efficient way to ensure load/store
>> orderings
>> 
>> These, along with general design and implementation work on the combiner,
>> will then lead onto a long road of performance analysis, inevitable bug
>> fixing, and implementing more optimizations.
>> 
>> If anyone is interested in discussing in more detail, feel free to reach out
>> on the list, or to any of the GlobalISel developers. We’d especially like to
>> hear about any issues or concerns about porting targets to GlobalISel.
>> 
>> Thanks,
>> Amara
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180731/e482f555/attachment.html>