[llvm-commits] [llvm] r166865 - /llvm/trunk/lib/Transforms/Vectorize/BBVectorize.cpp

Mon Oct 29 09:53:11 PDT 2012

----- Original Message -----
> From: "David Tweed" <david.tweed at arm.com>
> To: "Nadav Rotem" <nrotem at apple.com>, "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-commits at cs.uiuc.edu
> Sent: Monday, October 29, 2012 7:12:41 AM
> Subject: RE: [llvm-commits] [llvm] r166865 - /llvm/trunk/lib/Transforms/Vectorize/BBVectorize.cpp
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Oct 26, 2012, at 10:08 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> 
> 
> 
> 
> To be fair, I think we expected this. I see two general solutions:
> 1. Cover all such cases with target hooks
> 
> 
> 
> 
> 
> This is possible.
> 
> 
> 
> 
> 
> 2. When the cost of some operation is requested, actually compile it
> (make a small selection DAG and run instruction selection), and then
> use the result (we could just count the number of machine
> instructions, for example). We'd need to cache the results for this
> to be practical.
> 
> 
> 
> 
> 
> I don't think that this is a good approach. The compile time for this
> kind of mechanism would be very high.
> 
> 
> 
> 
> 
> We can build an off-line tool that would compile IR to assembly. We
> can use this (off-line) tool to generate cost-tables.
> 
> 
> 
> 
> I've been thinking of looking at something along those lines
> (obviously partly ARM specific, but large bits of the code would be
> target independent). One of the other issues is that, from
> experience on x86, using the TSC to time pieces of code you really
> need to be running a noticeable number of "iterations" not just
> because of the TSC not serialising, but because you're really trying
> to estimate the cost of an instruction in the context of lots of
> executions (eg, it might be stored in DecoderQueue for tight loops
> and not need decoding ,etc). All this suggests it's ill-suited to be
> running at run-time. (Note this is for estimating the cost of
> individual instructions: trying differently machine-code variants on
> your entire function kernel at run-time might be significant enough
> that the cost is worth it.) The obvious issue is that IR might be
> compiled differently in the off-line system (particularly if it
> doesn't have surrounding instructions to affect decisions) compared
> to when it's compiled in-situ. Not sure there is a good solution for
> that.
> 

Given that the mapping between IR and machine code is fairly indirect, directly assigning costs from first principles is difficult. What we'll really need, I think, is some kind of autotuner. This will need to run on actual hardware (or, at least a cycle-accurate simulator), and produce a set of best-compromise costs to feed the IR optimizers. Given a relatively small number of costs, a well-selected training set, and a good initial guess, implementing an autotuner should be straightforward. Without those three things, I think this is an interesting research problem (not that we should not do it, but it is certainly a larger project). Once we get past the basics (like type-legalization costs), we'll quickly find out, I think, how much additional complexity will be required.

 -Hal

> 
> 
> Cheers,
> 
> Dave

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory