[LLVMdev] FoldingSetNodeID operations inefficiency

Mon Apr 28 06:21:47 PDT 2008

Hi Chris,

Your were totally right with your suggestion. 

I have implemented the code that :

a) does not merge multiple TokenFactor nodes in the DAGCombiner::visitTokenFactor(), if the resulting TF node would contain more than 64 operands.

b) produces a bunch of TokenFactor nodes with at most 64 operands, 
instead of one huge TokenFactor in the SelectionDAGLowering::getRoot().

   If we have n pending loads, they can be combined into TFs in two ways:
   1) The first 64 loads are put into a new node TF1. Then TF1 and the next 64 loads are put into a new node TF2 and so on. That is, each generated TF contains a previous TF as its first operand and a bunch of pending loads:
      ___TF2______
     /           |          \  
  TF1    LD2.1 .. LD2.64 
  /  \         
LD1.1..LD1.64  

   2) Every 64 loads are put into a new TF node TFi. Then all such TFs are put into a big parent TF, that has only these TFi nodes as operands: 
      _____TF______
     /                        \
  TF1     ...            __TFk__
  /         \            /                \
LD1.1..LD1.64  LDk.1...LDk.64

These changes (a) and (b) significantely reduce the compilation time on my pathological use-cases with huge TokenFactors.

I attach the proposed patch to this mail for review. 

The only questions I still have are the following:
 - Which approach is better, b.1 or b.2?
 - If I have a huge number k of loads/nodes (i.e. nodes N1 ... Nk), to be put into a TokenFactor, what is a better order to put them into a TF? Is it left-to-right or right-to-left or something else? I have experimented with the approaches b.1 and b.2 and it seems that depending on the order, the scheduling decisions are affected in a positive or negative way.

I do ask these questions, because I do not have a clear understanding of TokenFactors, how they are processed and how they affect the scheduling and code generation. Therefore any help is highly appreciated.

 Thanks,
   Roman

----- Ursprüngliche Mail ----
Von: Roman Levenstein <romixlev at yahoo.com>
An: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu>
Gesendet: Donnerstag, den 24. April 2008, 08:33:25 Uhr
Betreff: Re: [LLVMdev] FoldingSetNodeID operations inefficiency

Hi Chris,

This is a good idea and I started thinking in that direction already.
But what I don't quite understand the TFs, how TFs are formed and which rules they should obey to.

For example now: 
> PendingLoads created by the SelectionDAGLowering::getLoadFrom and then copied into the
> TokenFactor node by SelectionDAGLowering::getRoot called from the
> SelectionDAGLowering::visitStore

So, if I now detect in the getRoot that there are more than let's say 64 nodes in the  PendingLoads queue, what should I do?
 - Can I take only 64 of them and forget about the rest by ignoring them? 
 - Or should I produces a series of TFs connected to each other, each with max 64 operands? If so, how do I do it?
Basically, I need a better understanding of TFs and how they relate to each other.

Another question: My small experiments give me the impression that TF nodes in my use-cases are never found in the SelectionDAG's CSE map by FindOrInsert, even though they are inserted into it. That is, every time a new node is inserted into the map. If this is true, may be their insertion into CSE maps can be avoided alltogether? But may be I'm wrong...

- Roman

----- Ursprüngliche Mail ----
Von: Chris Lattner <sabre at nondot.org>
An: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu>
Gesendet: Donnerstag, den 24. April 2008, 02:07:55 Uhr
Betreff: Re: [LLVMdev] FoldingSetNodeID operations inefficiency

On Wed, 23 Apr 2008, Roman Levenstein wrote:
> Hi,
>
> While profiling LLVM using my test-cases with huge MBBs, I noticed that
> FoldingSetNodeID operations (ComputeHash,insertion,etc) may become
> really inefficient for the nodes, which have very many operands.

Here is a stupid idea, which is simple and probably really effective:

This only affects you because you have 3200 operands to your TF node. 
This only affects TF because TF can take an arbitrary number of operands, 
and we fold TFs together arbitrarily.

What if you changed the code generator to only merge TF nodes if they have 
less than some threshold (64?) of operands?  If you avoided making totally 
crazy nodes, the problem would be defined away.  Could this work?

-Chris

> test-case (you can fetch it from here
> http://llvm.org/bugs/attachment.cgi?id=1275), which is just one HUGE MBB
> with tens of thousends of instructions, there is a TokenFactor node N
> that has ... 3200 operands! (These operands are PendingLoads created by
> the SelectionDAGLowering::getLoadFrom and then copied into the
> TokenFactor node by SelectionDAGLowering::getRoot called from the
> SelectionDAGLowering::visitStore).
>
> During the code selection phase, the SelectionDAG::ReplaceAllUsesWith is
> called many times and in many cases it tried to update this node, by
> replacing some of its 3200 operands by a different operand. After each
> such operand replacement, this snippen of code is typically executed
> (see SelectionDAG.cpp):
>
> // Now that we have modified U, add it back to the CSE maps.
> // If it already exists there, recursively merge the results together.
>  if (SDNode *Existing = AddNonLeafNodeToCSEMaps(U)) {
>      ReplaceAllUsesWith(U, Existing, UpdateListener);
>       // U is now dead.  Inform the listener if it exists and delete it.
>       if (UpdateListener)
>         UpdateListener->NodeDeleted(U);
>       DeleteNodeNotInCSEMaps(U);
>   } else {
> // If the node doesn't already exist, we updated it.
> // Inform a listener if it exists.
>       if (UpdateListener)
>         UpdateListener->NodeUpdated(U);
>  }
>
> So, basically CSE map should be updated. To achieve that, a hash of the
> updated node N is computed (and its internal FoldingSetNodeID::Bits
> array contains about 6400 elements!) and the node is replaced.
>
> My observation shows that amount of FoldingSetNodeID operations on this
> node N is almost quadratic. It manifests itself by consuming about 20%
> of the compilation time only for these operations.
>
> Question:
> Does anyone see an opportunity to solve this issue?
> My current ideas are:
>  - avoid recomputation of the hash values for nodes that were not
> changed since the hash was computed last time
>  - do some sort of lazy CSE maps update. Accumulate operand changes for
> a given node and then do only one update. But I don't see yet, how it
> can be done practically
>  - Optimize the SelectionDAG::ReplaceAllUsesWith to do bulk updates,
> i.e. updating many user nodes of the given node at once or updating all
> operands of the given node at once. Again, I  don't see how to do it
> practically.
>  - May be the issue is related to the strange TokenFactor node with the
> 3200 operands. May be creation of the node with so many operands can be
> avoided?
>
> Any ideas are appriciated!
>
> -Roman
>
>
>
>
>
>
>      Lesen Sie Ihre E-Mails jetzt einfach von unterwegs.
> www.yahoo.de/go
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu        http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

-Chris

-- 
http://nondot.org/sabre/
http://llvm.org/
_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu        http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

      __________________________________________________________
Gesendet von Yahoo! Mail.
Mehr Möglichkeiten, in Kontakt zu bleiben. http://de.overview.mail.yahoo.com

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

      E-Mails jetzt auf Ihrem Handy.
www.yahoo.de/go
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TokenFactor.patch
Type: application/octet-stream
Size: 5005 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20080428/f8a5f4ed/attachment.obj>