[llvm-dev] GlobalISel design update and goals

Tue Jul 31 08:29:20 PDT 2018

Le mar. 31 juil. 2018 à 08:18, Amara Emerson <aemerson at apple.com> a écrit :

> Hi Quentin,
>
> On Jul 31, 2018, at 12:04 AM, Quentin Colombet <quentin.colombet at gmail.com>
> wrote:
>
> Hi Amara,
>
> Thanks for sharing the plan going forward.
>
> Inlined a couple of comments.
>
> 2018-07-30 7:01 GMT-07:00 Amara Emerson via llvm-dev <
> llvm-dev at lists.llvm.org>:
>
> Hi all,
>
>
>
> Over the past few months we’ve been doing work on the foundations for the
> next stages of GlobalISel development. In terms of changes from this time
> last year, the IR translator, the legalizer, and instruction selector have
> seen moderate to major changes. The most significant of these was the
> change
> to the legalizer API, allowing targets to use predicates to express
> legality, which gives more precise control over what forms of instructions
> are legal, and how to legalize them. This was necessary to implement
> support
> for the new extending loads and truncating stores, but also results in more
> concise and elegant expressions of legality for each target. For example,
> you can now apple a single definition to apply to multiples opcodes (G_ADD,
> G_SUB, G_MUL etc).
>
> The IR translator has been modified to split aggregates rather than
> handling
> them as one single large scalar. This change fixed some bugs and was
> necessary in order handle big endian code correctly in future.
>
> The tablegen instruction selector also saw significant improvements in
> performance, helping to keep overall compile time regression vs fastisel to
> be <5% geomean on CTMark. There are still a few outliers like sqlite3 which
> has a significant regression compared to FastISel, but most of the other
> benchmarks show little difference or even improvement.
>
> The tablegen importer has had improvements made to it, so that we can
> import
> more SelectionDAG selection rules. For example, currently on AArch64 we
> have
> about 40% of the rules being successfully imported.
>
> New additions from last year include the beginnings of a new combiner,
> although there’s still significant work to be done here in terms of the
> final design. The combiner will become a critical part of the pipeline in
> order to begin improving runtime performance.
>
> High levels goals
>
>
>
> Going forward, we plan to improve GlobalISel in a number of key areas to
> achieve the following targets:
> * Keeping compile time under control, ideally within 5% of FastISel, and
> when optimizations are enabled to maintain a compile time advantage of
> SelectionDAG.
> * Begin improving runtime performance by adding the most important
> optimizations required to be competitive at -Os. We will be targeting and
> measuring AArch64 for this goal but will endeavor to implement as many
> optimizations as possible in generic code to benefit other targets.
> * Improving overall stability and test coverage. Maintaining a high level
> of code quality and minimizing regressions in correctness and performance
> will be a significant challenge.
> * Ensure that the overall design meets the needs of general targets, not
> being overly tuned to a specific implementation.
>
> Design work planned
>
>
>
> These are some design changes coming in the near to medium term future:
>
> * The G_MERGE and G_UNMERGE opcodes will be split into separate opcodes to
> handle different use cases. At the moment the opcode is too powerful,
> resulting in overly complex handling in places like the legalizer. G_MERGE
> will be split so that it only handles merging of scalars into one larger
> scalar. For other cases like merging scalars into a vector we will create a
> new G_BUILD_VECTOR opcode, with a new counterpart opcode for doing the
> opposite. For the current vector + vector case a new G_CONCAT_VECTOR will
> be
> introduced. With these changes it should simplify implementations for all
> targets.
>
> * Constant representation at the MI level needs some investigation. We
> currently represent constants as generic instructions, with each instance
> of
> a constant being largely independent of each other, being stored in the
> entry block except for a few places in IR translation where we emit at the
> point of use. As a result we run a localizer pass in an effort to reduce
> the
> live ranges of the constants (and the consequent spilling), using some
> heuristics to decide where to sink the constant definitions to.
>
> Since we don’t do any real caching of MI constants, multiple G_CONSTANT
> definitions can exist for the same constant. This can also result in a lot
> of redundant constants being created, especially for things like address
> computation. Reducing the number of constants can help reduce compile time
> and memory usage. Given this situation, one possible approach is to encode
> constants into the operands of the users, rather than have dedicated
> machine
> instructions. At instruction selection time the constant can then be
> materialized into a register or encoded as an immediate. Further
> investigation is needed to find the right way forward here.
>
>
> The initial design was to not have constant in machine operands. The
> main rational is materializing a constant may be expensive, so we
> better not sprinkle it around.
> Now, in practice, this indeed means that we need to keep a table of
> all the constants created so that we don't start to duplicate them,
> which we fail to do. That should be easy to fix that just a map
> virtual register + type to constant that could be kept at the function
> level (or even module level). Better yet, this could be handled by the
> IRBuilder. E.g., when instantiating a IR builder, it could scan the
> function to see which constants are already there and build this
> mapping and then, create only the constants that are missing.
>
> Moreover, another the advantage of this model is that optimizations
> don't have to deal with two variants of the same instruction (ADDr and
> ADDi), same for patterns. Alternatively, if we don't change the
> opcode, but only the MachineOperands, then every optimization has to
> deal with two different kind of opcodes. Which is bad IMHO.
>
> Also, this design was meant to absorb what the constant hoisting pass
> does on the IR, so that we can kill that pass while having a better
> cost model.
>
> Finally, regarding the localizer pass, this was meant as a workaround
> for the fast regalloc problem and should be killed asap. In
> particular, the main motivation was to avoid code size bloat, but
> AFAIR during our last measurements with Ahmed, we only saved a few %
> on a handful of benchmarks, so maybe we can just kill it.
>
> All valid points, however we're now seeing more and more constants created
> especially as part of address computation, e.g. GEP offsets
>

I can see that, but if constants are not duplicated (which I believe is
easy to do) most of the problem is fixed isn’t it?
What I am saying is we shouldn’t compromise on weakening the representation
by having mixed types of other options work.

. Without a solution to the regalloc problem, code size and compile time is
> taking a hit. IMO a few % is significant enough to warrant considerable
> effort. For example, a quick constant deduplication experiment I did saved
> around 1.5% in overall compile time alone.
>
> If the fast regalloc issue can be fixed without a significant compile time
> impact then I agree it sounds like the best approach combined with early
> deduplication. It’s definitely something we’ll look into.
>

One thing that we should consider is kill fast regalloc and use the basic
one for fast compilation. I don’t know if the time budget would fit but if
it does, we kill a redundant piece of code (why do we have so many
allocator) while improving the generated code.

Cheers,
Q

>
>
> * For optimizations to be supported, the combiner will become a crucial
> part of the GISel pipeline. We have already done some preliminary work in a
> generic combiner, which will be used to eventually support combines of
> extloads/truncstores. We’ve had discussions on and off list about what we
> need from the new combiner. The summary is that we want the combiner to be
> flexible for each target to select from a library of combines, being as
> efficient as possible. The expression of the combines are currently written
> in C++, but one piece of investigation work we might do is to prototype
> using the same tablegen driven instruction selector code to match
> declarative combine patterns written in tablegen. Regardless, we will need
> to support the custom C++ use case.
>
> * CSE throughout the pipeline. From a theoretical perspective, having a
> self contained CSE pass that operates as a single phase in the pipeline is
> attractive for the simplicity and elegance. However, we know empirically
> that this is expensive in compile time. Not only does the CSE pass itself
> take a non-negligible time to run, but having it as a late pass can result
> in the non-CSE’d code from the IRTranslator onwards surviving for a long
> time, taking up time in analysis at each stage of compilation. We believe
> running a light weight CSE early is a win. SelectionDAG currently does CSE
> by default when building the DAG, and this is something we could explore as
> part of a custom IRBuilder.
>
>
> I have been pushing for having the IRBuilder being smarter. Having
> this doing CSE was something I wanted we try.
>
> Yes IRBuilder is one of the candidates we’ll try for this.
>
>
>
> * Known bits computation. Some optimizations require the knowledge of which
> bits in a value are known to be 1 or 0, and do this by using the
> computeKnownBits() capability for SelectionDAG nodes. We will need some way
> of getting the same information. In an ideal scenario the replacement
> infrastructure for this will be more efficient, as this part of the
> codebase
> seems to be disproportionately responsible for pathological compile time
> regressions.
>
> * Load/store ordering needs some thought, as we currently don’t have a way
> to easily check at the MI level what the ordering requirements are on a set
> of memory operations. SelectionDAG uses the chains to ensure that they’re
> scheduled to respect the orderings. How to achieve the same thing remains
> an
> open question for GlobalISel.
>
>
> I don't get this problem.
> GISel has a sequential IR, the order is already modeled here.
> Dominance should give us the relative ordering, then if we want more,
> we should do exactly what we do at the IR level (alias analysis, etc.)
>
> Sorry, I didn’t elaborate this properly. The information is there, you’re
> right about that. The problem is that finding if a load/store A precedes
> another in a block potentially requires a scan of every instruction since
> they’re stored as ilists. What chains give as a side effect of the
> implementation is a way to walk through the dependent memory operations
> without doing a backwards scan of every instruction. A simple block level
> cache might work here. As you say if we want more information then
> something akin to MemorySSA might be useful one day for whole CFG analysis.
>
>
> Cheers,
> -Quentin
>
>
> * More extensive tests that exercise multiple stages of the pipeline. One
> advantage of using MIR with GISel is that individual passes can be easily
> tested by feeding the exact input expected for a particular pass, and
> checking the immediate output of the pass. However this approach can leave
> holes in the test coverage. To help mitigate this, we will be exploring
> writing/generating whole pipeline tests, tracking some IR through each pass
> and checking how the MIR is mutated. We currently also have a proposed
> change to allow usage of FileCheck as a library, not just as a stand-alone
> tool. This would allow us to use FileCheck style checks and Improve testing
> of currently unused code paths.
>
>
> Roadmap for enabling optimizations
>
>
>
> I’ve filed a few PRs that people can follow or comment on to track the
> progress towards enabling the -Os optimization level. The rough outline is:
>
> PR 38365 - [AArch64][GlobalISel] Never fall back on CTMark or benchmarks
> (Darwin)
> PR 38366 - GlobalISel: Lightweight CSE
> PR 32561 - GlobalISel: placement of constants in the entry-block and fast
> regalloc result in lots of reloaded constant
> PR 38367 - GlobalISel: Implement support for obtaining known bits
> information
> PR 38368 - GlobalISel: Investigate an efficient way to ensure load/store
> orderings
>
> These, along with general design and implementation work on the combiner,
> will then lead onto a long road of performance analysis, inevitable bug
> fixing, and implementing more optimizations.
>
> If anyone is interested in discussing in more detail, feel free to reach
> out
> on the list, or to any of the GlobalISel developers. We’d especially like
> to
> hear about any issues or concerns about porting targets to GlobalISel.
>
> Thanks,
> Amara
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180731/2e1b93d5/attachment.html>