[llvm-commits] Speeding up instruction selection

Thu Mar 6 05:44:11 PST 2008

Hi,

One more observation:
For big MBBs (like the one in big4,bc),
llvm::SelectionDAG::ReplaceAllUsesOfValueWith can become a bottleneck
if there are thousends of uses. SDNode.Uses is currently a
small-vector, so that the deletion by means of removeUser is VERY,
VERY slow. I thought that may be a set, map or hash-table should be
used instead? So, I tried with std::set (see the attached
proof-of-concept patch). This improves the overall compilation time by
10%-15% on big MBBs. Thus, may be there should be a combined approach
decided at run-time? If there are just few uses, then SmallVector is
used. But if the number of uses becomes much bigger, then std::set or
something similar should be used.

With all my recent changes to ScheduleDAG and instruction selector,
the compilation time for big4.bc went down from 45-50 seconds to 6-9
seconds!!! This is almost a 10 times performance speed-up which makes
the compiler much more scaleable.

According to the profiler, after all those changes, there are no
functions that are obvious bottlenecks. The biggest remaining
performance hogs are the following functions (and that for both linear
scan and bigalloc register allocators). All of them are mostly related
to the FoldingSet implementation.

      9.2%  llvm::FoldingSetNodeID::ComputeHash
      7.7%  llvm::SmallVectorImpl::push_back
      6.1%  llvm::SmallVectorImpl::destroy_range
      4.2%  llvm::FoldingSetNodeID::AddPointer
      3.7%  AddNodeIDOperands

I know too little about the FoldingSet to improve it, so I stop here
for the time being ;-)

Please review the idea of the patch and tell me, if it makes sense and
I should prepare a cleaned-up one.

-Roman

2008/3/5, Roman Levenstein <romix.llvm at googlemail.com>:
> Hi Evan,
>
>  2008/3/4, Evan Cheng <evan.cheng at apple.com>:
>  >  >> There's make_heap/push_heap/etc. in <algorithm> that let a
>  >  >> plain std::vector (or a SmallVector I guess) be used as a heap.
>  >  >
>  >  > Yes, this is possible but produces much more overhead than std::set on
>  >  > my tests. BTW, this approach is used in DAGISel.inc files generated by
>  >  > tablegen. I tried to changed it to std::set as well and ,again, it
>  >  > works much (25%-30%) faster  on BBs with few hundreds or thousends
>  >  > instructions.
>  >
>  > If you give me a patch, I'll test it on my end. Thanks.
>
>  Here is a patch for the DAGISel.inc. It is generated as a diff against
>  the X86GenDAGISel.inc generated by tablegen. It is a bit ugly, but
>  gives you the idea and enables testing.
>
>  As a test, I used the big4.bc, which is one huge MBB. You can find it here:
>  http://llvm.org/bugs/attachment.cgi?id=1275&action=edit
>
>  I would be very interested if you could review it, test and provide
>  some feedback.
>
>  One thing I do not quite understand about the instruction selector is:
>  1) Can there be more than one SDNode with the same NodeId in the ISelQueue?
>     I have the impression that it is possible, but I'm not sure.
>  2)  Can _the same_ SDNode ocure more than once in the ISelQueue?
>
>  These two questions are relevant, if std::set is to be used. Sets use
>  the NodeId as a key of a given SDNode and std::set ensures the
>  uniqueness of the the elements in the ISelQueue. If (1) is true, then
>  probably std::multiset should be used instead of std::set. I tried
>  with both set implementations and performance was roughly the same
>  between them.
>
>
>  I have also one more question regarding the ISelQueue:
>
>  What exactly does it represent and how is it built? My understanding
>  is that we start with the root element and then all of its
>  dependencies are pushed into the queue as instruction selection
>  proceeds. Then their dependencies and so on. But is it somehow
>  related/similar to scheduler's dependencies? Would it be possible to
>  do some sort of the topological sorting on the DAG first and then do
>  the selection? For the above mentioned big4.bc use-case, the ISelQueue
>  sometimes has up-to 2000 SDNodes in the queue, which makes make_heap()
>  very inefficient. Is it normal that the queue becomes so long? Could
>  it be that some dependencies are just selected already and could be
>  safely removed?
>
>  I cannot really explain and realize it at the moment yet, but it seems
>  to me that a more efficient data structure than a priority queue could
>  be used during instruction selection.
>
>
>  -Roman
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SelectionDAG.patch
Type: text/x-diff
Size: 3856 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20080306/a8583c8d/attachment.patch>