[LLVMdev] FoldingSetNodeID operations inefficiency

Wed Apr 30 11:13:18 PDT 2008

Hi Dan,

Thanks for commenting on this topic.
See my comments in-line.

----- Ursprüngliche Mail ----
> Von: Dan Gohman <gohman at apple.com>
> An: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu>
> Gesendet: Mittwoch, den 30. April 2008, 21:38:26 Uhr
> Betreff: Re: [LLVMdev] FoldingSetNodeID operations inefficiency
> 
> 
> On Apr 28, 2008, at 6:21 AM, Roman Levenstein wrote:
> 
> > Hi Chris,
> >
> > Your were totally right with your suggestion.
> >
> > I have implemented the code that :
> >
> > a) does not merge multiple TokenFactor nodes in the  
> > DAGCombiner::visitTokenFactor(), if the resulting TF node would  
> > contain more than 64 operands.
> >
> > b) produces a bunch of TokenFactor nodes with at most 64 operands,
> > instead of one huge TokenFactor in the  
> > SelectionDAGLowering::getRoot().
> >
> >   If we have n pending loads, they can be combined into TFs in two  
> > ways:
> >   1) The first 64 loads are put into a new node TF1. Then TF1 and  
> > the next 64 loads are put into a new node TF2 and so on. That is,  
> > each generated TF contains a previous TF as its first operand and a  
> > bunch of pending loads:
> >      ___TF2______
> >     /           |          \
> >  TF1    LD2.1 .. LD2.64
> >  /  \
> > LD1.1..LD1.64
> >
> >
> >   2) Every 64 loads are put into a new TF node TFi. Then all such  
> > TFs are put into a big parent TF, that has only these TFi nodes as  
> > operands:
> >      _____TF______
> >     /                        \
> >  TF1     ...            __TFk__
> >  /         \            /                \
> > LD1.1..LD1.64  LDk.1...LDk.64
> >
> >
> > These changes (a) and (b) significantely reduce the compilation time  
> > on my pathological use-cases with huge TokenFactors.
> >
> > I attach the proposed patch to this mail for review.
> >
> > The only questions I still have are the following:
> > - Which approach is better, b.1 or b.2?
> 
> I think b.2 is slightly better than b.1, because b.2 leaves all
> the nodes at the same depth/height in the graph.

Good point.

> Either approach
> will restrict scheduling choices though. With huge numbers of
> loads, register pressure could be a prohibitive problem, so it
> may be desirable to give the list-burr scheduler as much
> flexibility as possible.

OK.

> I think it's worthwhile to investigate alternatives before going
> too far down the path of splitting TokenFactors.

I totally agree and I tried it.

> Do we know where the quadradic FoldingSetNodeID operations are coming from?

Re-computation of the FoldingSetNodeID hashes takes a lot of time for a node with 3200 operands, 
since the Bits array contains 6400 elements. Moreover, it looks like a temporary data structure is
allocated for this computation, which also takes time and memory. I still do not quite understand where
the quadratic factor comes from. 

> Is caching of TokenFactor hash id's effective?

I tried to insert some debugging code to see how efficient are the CSE sets. And my understanding is that we have hits
only for constants, registers and some other "simple" nodes. I have never seen on my tests e.g. any TokenFactor that would 
be found  in the CSE table  and reused.  If it is the case in general, may be  such nodes should not be put into CSE tables
at all?

> >
> > - If I have a huge number k of loads/nodes (i.e. nodes N1 ... Nk),  
> > to be put into a TokenFactor, what is a better order to put them  
> > into a TF? Is it left-to-right or right-to-left or something else? I  
> > have experimented with the approaches b.1 and b.2 and it seems that  
> > depending on the order, the scheduling decisions are affected in a  
> > positive or negative way.
> 
> If you're using the list-burr scheduler (the default on x86)

I do.

> then it is currently arbitrary. One way might get more lucky than the other, but
> it comes down to the NodeQueueId comparison in many cases.

Yes. I still remember this fact, since I added this code to SVN just few days ago ;-)

> In the future I'd like to see the schedulers be able to sort memory
> operations by their addresses. Among other things, it would make them
> less dependent on arbitrary choices like the order of operands in a
> TokenFactor.

Good idea. I was thinking about it as well. Besides introducing a more
deterministic choice, it would also help to achieve a better memory
locality. And this may produce dramatic performance wins for run-time of compiled applications.
I've observed something similar while working on the spill slot coalescing for my register
allocator. 

> If you're going to be splitting large TokenFactors, then theoretically
> you'd want to use the same ordering that the schedulers would use,
> such as sorting the loads ascending by addresses or whatever. That
> might not be very simple though :-}.

Well, if the addresses are not constants and have different bases, it may be a bit tricky.
Or do you have something specific in mind?

- Roman

      E-Mails jetzt auf Ihrem Handy.
www.yahoo.de/go