[llvm-commits] Speeding up instruction selection

Wed Apr 23 00:35:46 PDT 2008

Hi Dan,

2008/4/23, Dan Gohman <gohman at apple.com>:
>
>  On Apr 21, 2008, at 8:51 AM, Roman Levenstein wrote:
>  > Sorry, forgot to attach the patch ;-)
>  >
>  > 2008/4/21, Roman Levenstein <romix.llvm at googlemail.com>:
>  >> Hi,
>  >>
>  >> 2008/4/19, Roman Levenstein <romix.llvm at googlemail.com>:
>  >>
>  >>> Hi Dan, Hi Evan,
>  >>>
>  >>> 2008/4/19, Dan Gohman <gohman at apple.com>:
>  >>>
>  >>>>
>  >>>> On Apr 16, 2008, at 9:19 AM, Roman Levenstein wrote:
>  >>>>>
>  >>>>
>  >>>>> BTW, Dan, have you had any time to look at the patch introducing
>  >>>>> queues instead of heaps in the target specific instruction
>  >>>>> selectors?
>  >>>>
>  >>>>
>  >>>> Is this the ScheduleDAGRRList patch that changes
>  >>>> std::priority_queue
>  >>>> to std::set? I know you checked in part of those changes already;
>  >>>> could you send me an updated patch with just the outstanding
>  >>>> changes?
>  >>>
>  >>>
>  >>> No. I don't mean the ScheduleDAGRRList.pacth. For this one, I
>  >>> proposed
>  >>> the remaining patch just with the outstanding patches yesterday (see
>  >>> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20080414/061236.html)
>  >>> .
>  >>> And it is also about using the std::set :-)
>  >>>
>  >>> The patch for instruction selectors was proposed by me here:
>  >>> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20080303/059215.html
>  >>> It is related to the Select() and SelectRoot implementations inside
>  >>> the target-specific code selectors generated by tablegen.
>  >>>
>  >>> Regarding my questions about std::set vs std;;multiset in the
>  >>> mentioned patch:
>  >>> Having looked into it again, I have the impression that std::set
>  >>> could
>  >>> be used and there is no need for std::multiset. Basically, the this
>  >>> code:
>  >>>  void AddToISelQueue(SDOperand N) DISABLE_INLINE {
>  >>>   int Id = N.Val->getNodeId();
>  >>>   if (Id != -1 && !isQueued(Id)) {
>  >>>   ...
>  >>> }
>  >>> ensures that only one node with a given NodeId can be inserted into
>  >>> the ISelQueue.
>  >>>
>  >>> One possible further improvement, I'm thinking about is the
>  >>> following:
>  >>> Why do we use a sorted data structure (e.g. sorted vector, set) for
>  >>> the ISelQueue at all?
>  >>>
>  >>> 1) If we look into the SelectRoot logic, it basically takes the node
>  >>> with the SMALLEST NodeId from the pending selection queue.
>  >>> 2) More over, it seems to me (please, confirm if it is true!) that
>  >>> the
>  >>> next node taken for selection in the loop inside the SelectRoot, at
>  >>> any point in time, has a NodeId that is GREATER than the NodeId of
>  >>> the
>  >>> last selected node.
>  >>> 3) Even if the selection of the node puts new nodes into the
>  >>> selection
>  >>> queue, they ALWAYS have NodeIds GREATER than that of this node.
>  >>> Therefore the property (2) is preserved.
>  >>>
>  >>> I checked (2) and (3) by adding some debug output to the SelectRoot.
>  >>> Only one test of all llvm/test test-cases does not fulfill this
>  >>> assumptions and even there it is not quite obvious if it is a
>  >>> contradiction to (2) and (3) or a bug.
>  >>>
>  >>> So, assuming that properties (2) and (3) do hold, the following
>  >>> can be
>  >>> probably done:
>  >>> - We iterate over the NodeId numbers from 0 till the DAGSize.
>  >>> - We check if a Node with a given NodeId (by looking into the
>  >>> TopOrder???), if isQueued(NodeId) is true, i.e. this is in the queue
>  >>> - We then check if isSelected(NodeId) is true or false and call
>  >>> Select() for this node, if it is not selected yet.
>  >>> - If a new node needs to be inserted into the queue, only
>  >>> setQueued(NodeId) is called
>  >>>
>  >>> With these changes, no online sorting would be required at all,
>  >>> which
>  >>> is a great improvement over the current approach.
>  >>>
>  >>> What do you think about it?  Is it a correct analysis?
>  >>
>  >>
>  >> I created a working implementation based on these I ideas. Now all
>  >> llvm/test test-cases pass it. Please find it in  the attached patch.
>  >> The patch is rather ugly and uses a lot of #ifdefs, but I think it is
>  >> OK for a proof of concept.
>  >>
>  >> Short explanation for the macros:
>  >> OLD - enables old-style selector queue, using heaps. Can be used to
>  >> have a step-by-step comparison with std::set based approach.
>  >> USE_QUEUE - enables std::set based approach
>  >> NO_QUEUE - enables the implementation that does not use any special
>  >> sorted queue at all, as outlined in the previous mail.
>  >> (Note: USE_QUEUE should not be used at the same time as NO_QUEUE)
>  >>
>  >> Both USE_QUEUE and NO_QUEUE modes speed-up the instruction selection
>  >> significantly (on my test compilation time goes down from 12 to 9
>  >> seconds).
>  >>
>  >> I'm very interested in your comments about this approach.
>
>
> This looks like a nice approach. Re-using and updating TopOrder
>  instead of building up a std::set that ends up with the same contents
>  is neat. I was a little concerned about the priority-queue code being
>  removed since I don't know if there were future plans for it, however
>  it looks like doing anything with ordering more specific than
>  simple topological ordering would require a fair amount of work
>  anyway, so I'm not worrying about it.

OK.

>  Do you see any compile-time difference between using a std::set
>  queue and the NO_QUEUE approach, out of curiosity?

Well, the difference between using a std::set queue and the NO_QUEUE
approach is rather small, especially on small and medium-sized
testcases. I have the impression that it comes mainly from the fact
that NO_QUEUE does not do any dynamic heap memory allocation. Both of
these approaches are significantly faster than the current
heap-sorting approach.

>  In addition to regular testing, since this patch doesn't modify any
>  heuristics, it should be possible to compare assembly output
>  between an unmodified compiler and one with this patch applied.
>  Can you verify that this patch doesn't change any output on some
>  interesting testcases?

I tried it with Kimwitu, which is a very big and complex test-case.
All approaches produce exactly the same assembler code. On my
test-cases with very big BBs it also produces exactly the same
results.

For the sake of comparison, here are also some time statistics for the
compilation of Kimwitu, measured by means of the Linux "time" command:

Current LLVM approach:
real    1m51.464s
user    1m21.161s
sys     0m8.529s

std::set approach:
real    1m37.938s (13% faster than current LLVM implementation)
user    1m16.169s
sys     0m7.732s

NO_QUEUE approach:
real    1m29.548s (20% faster than current LLVM implementation)
user    1m14.949s
sys     0m7.824s

-Roman