<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 24, 2016 at 1:04 PM, Sean Silva <span dir="ltr"><<a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">silvas added a subscriber: silvas.<br>

silvas added a comment.<br>

<br>

I can see why this would help iTLB/paging, but I'm not grokking why it would help icache very much compared to per-function machine block placement ensuring that the cold stuff ends up at the end on a separate cacheline (does MBP already do that?). In fact (playing devil's advocate) the MBP approach could be more beneficial because it could allow branches to be relaxed to smaller encodings.<br></blockquote><div><br></div><div>true.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

The scenarios I can see this being a substantial win for icache over MBP is when you e.g. have two functions with 1.5 cachelines of hot text (and say 1 cacheline of cold text). With MBP, each function would end up using ceiling(1.5) = 2 cachelines for the hot and one cacheline for the cold, but with the splitting the linker would see 2x 1.5 cacheline hot + 2x 1 cachline cold and so you could put the two 1.5's together and only use 3 cachelines for the hot part. How often does that occur (and does the linker actually manage to exploit this?).<br></blockquote><div><br></div><div>yes -- the latest prefix based function grouping patch will teach linker to do this.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Since the benefit is based on the "rounding", we save at most just under ("just under" is determined by the text alignment) one cacheline every time we can pack these densely. The benefit is at most #hotFunctions * (sizeof(Cacheline) - alignof(Function)) text size for the hot working set.<br></blockquote><div><br></div><div> On x86 is 64bytes in size (small), and it can reduce the overall savings. Another factor is the use of long jumps with function splitting can offset some of the savings here.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

That being said, this kind of low-level function splitting is a really powerful tool and I fully support adding it, but I agree with Mehdi that I'd like to see some supporting benchmark results.<br>

<br></blockquote><div><br></div><div>Agree -- I think this functionality is useful to have but probably off by default and controlled by an user option.</div><div><br></div><div>David </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<a href="http://reviews.llvm.org/D17555" rel="noreferrer" target="_blank">http://reviews.llvm.org/D17555</a><br>

<br>

<br>

<br>

</blockquote></div><br></div></div>