<div dir="ltr"><div><div><div><div>Hey,<br><br></div>Thanks for that idea, I had not considered that. I ran perf on the same examples as before. However, I do not see a large difference in branch mis-predicts. <a href="https://gist.github.com/bollu/7a9989a727ed4bf1c118dbcf386d4fc1">Link to gist with numbers here</a>. Is there something else that I missing? The instruction mix, perhaps? The C version has roughly 1 more instruction per cycle. I am not sure how significant that is, however. <br><br></div>Is there some other way to pin down what is going on in terms of slowdown? (read the asm? profile?)<br><br></div>Thanks,<br></div>~Siddharth<br></div><br><div class="gmail_quote"><div dir="ltr">On Sat, 16 Dec 2017 at 04:25 Kuba Ober via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

<br>

> 15 dec. 2017 kl. 20:51 skrev (IIIT) Siddharth Bhat via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>>:<br>

><br>

> One major difference is that GHC uses a "stack-in-the-heap", in the sense that the a is a chunk of heap memory that functions effectively as call stack. It is managed explicitly by GHC. GHC does not use the native call stack _at all_. This is to implement continuation passing, in a sense.<br>

><br>

> I want to check that the difference in the performance is truly from this "stack in  the heap" behavior, so I want to teach a backend to generate code that looks like this.<br>

<br>

If the indirect jumps via this stack are all made from the same location in the GHC runtime, then perhaps this might kill branch prediction. call/ret addresses are duplicated in a branch predictor’s own hardware stack. If the runtime doesn’t use call/ret, not even indirect calls, then this functionality is lost and general branch prediction logic has to kick in. The predictors may then either fail for a long time before a pattern is recognized via the jump site in the runtime, or may even not recognize it and always mispredict. There may be differences between perceptron and other techniques in this case, so see if it’s equally bad on chips that use perceptrons (some AMD?) if yours doesn’t.<br>

<br>

There should be a smoking gun somewhere in the performance monitoring registers then, I’d hope. It’d be very obvious - persistent branch misprediction at the level of functions costs dearly.<br>

<br>

I have rather cursory knowledge in this area so perhaps the state of the art is way ahead of my imagination, or perhaps I’m dead wrong.<br>

<br>

Cheers, Kuba<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div><div dir="ltr">-- <br></div><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Sending this from my phone, please excuse any typos!</div></div>