<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Hello all!<br class=""><br class="">I was looking through the results of disassembling a heavily-used short function<br class="">in the program I'm working on, and ended up wondering why LLVM was generating<br class="">that assembly and what changes would be necessary to improve the code. I asked<br class="">on #llvm, but it seems that the people with the necessary expertise weren't<br class="">around.<br class=""><br class="">Here is a condensed version of the code: <a href="https://godbolt.org/g/ec5cP7" class="">https://godbolt.org/g/ec5cP7</a><br class=""><br class="">My main question concerns assembly lines 37/38 and 59/60, where xmm0 is spilled<br class="">to the stack, only to be immediately reloaded into xmm1. Google tells me that<br class="">there is a register-to-register mov instruction for the xmmn registers, so I<br class="">found it odd that LLVM missed what looks like an easy optimization. tstellar on<br class="">#llvm pointed me towards using -debug-only=regalloc with llc to see what LLVM is<br class="">thinking (regalloc log here, since I'm not sure what's considered "too large"<br class="">for mailing lists: [0]), and it seemed to me like the load/store were<br class="">introduced separately, and llc never looked at them at the same time, and so<br class="">never realized that they could be folded. Is that what is happening? I know<br class="">little about compilers, so I wouldn't be surprised if I were wrong.<br class=""><br class="">The other two questions are tangential, so please let me know if I should ask<br class="">them somewhere else.<br class=""><br class="">On assembly lines 24 and 46, I think the vtable pointer for the Quad object is<br class="">being reloaded every iteration of the loop. nbjoerg on #llvm said that's due to<br class="">the possibility of placement new being used somewhere inside the called<br class="">function, which makes sense to me. Is there a way to indicate to LLVM that this<br class="">will not happen? I tried [[gnu::pure]], since the function doesn't write to<br class="">externally-visible memory, but the vtable pointer reload remained.<br class=""><br class="">Finally, I'm inclined to say that this routine should be vectorizable, since<br class="">it's essentially just an accumulate, but Clang can't prove that GetLocalValue<br class="">doesn't have side effects that will affect later iterations. Is this correct,<br class="">and if so, are there any hints I can give Clang besides just manually<br class="">parallelizing it with #pragma omp or something?<br class=""><br class="">I do intend on changing this loop to something a bit less messy, but it'll be<br class="">part of a larger refactoring, so it's still a ways off.<br class=""><br class="">Thanks!<div class=""><br class=""></div><div class="">Alex<br class=""><br class="">   [0]: <a href="https://hastebin.com/raw/oqamesahos" class="">https://hastebin.com/raw/oqamesahos</a></div></body></html>