<div dir="ltr">David, <br><br>I have no particular insight into the performance variance, but I  recently landed a patch that should remove the vast majority of pathological cases in MergeConsecutiveStores (r332490). If you can <br>land that locally you'd likely sidestep this issue entirely. Note, you'll probably want to catch the previous associated cleanups as well, i.e., r328233, r332489. <div><br></div><div><div>If it's still too long, feel free to send me a test case and I'll take a look. </div><div><br></div><div>-Nirav</div><div><br>On Tue, May 29, 2018 at 8:41 AM Dean Michael Berris via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>><br>><br>><br>> > On 29 May 2018, at 22:02, David Jones via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>> ><br>> > My back-end code generator uses LLVM 5.0.1 to optimize and generate code for x86_64.<br>> ><br>> > If I run it on a given sample of IR, it takes almost 5 minutes to generate object code.  95%+ of this time is spent in MergeConsecutiveStores().  (One function has a basic block with 14000 instructions, which is a pathological case for MergeConsecutiveStores.)<br>> ><br>> > If, instead, I dump out the LLVM IR, and manually run both opt and llc on it with -O2, the whole affair takes only 2 minutes.<br>> ><br>> > I am using a dynamically linked LLVM library.  I have verified using GDB that both my code generator and llc are invoking the shared library (i.e. the exact same code) so I would not expect to see a 2.5x performance difference.<br>> ><br>> > What could explain this?<br>> ><br>><br>> Without any more additional information, I would think this has something to do with the locality of the memory when you’re using the LLVM API to generate the basic blocks and instructions versus when you’re reading the data in from files (as what llc and opt would be doing). I suspect without seeing the way you’re constructing the basic blocks and instructions, that you’re doing it one instruction at a time and relying on vectors/lists growing one element at a time (instead of using an object pool which already pre-allocates elements that are colocated in the same page of memory).<br>><br>> There’s a lot of factors that will potentially lead to why you’re seeing a marked performance difference here. If you’re able, you might want to build your code-generator with XRay and see whether it points out where your latency is coming from.<br>><br>> <a href="https://llvm.org/docs/XRayExample.html">https://llvm.org/docs/XRayExample.html</a><br>><br>> -- Dean<br>><br>> _______________________________________________<br>> LLVM Developers mailing list<br>> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a></div></div></div>