[cfe-dev] Clang generates absurd amount of assembly for libc++ std::vector::emplace

Wed Jul 25 09:17:59 PDT 2018

The patch that removes __always_inline__ is https://reviews.llvm.org/D49240 <https://reviews.llvm.org/D49240> (ready to go, but waiting for sign off by Eric).

In this specific case, it does not appear to be related to inlining, though. Like it’s been said, there seems to be unrolling going on and vectorization also (look for vmovups). I checked your example with my patch that removes always_inline and the result is roughly the same, so I don’t think it’s related to the fact that libc++ uses __always_inline__ for linkage purposes.

Louis

> On Jul 25, 2018, at 12:12, David Blaikie <dblaikie at gmail.com> wrote:
> 
> 
> 
> On Wed, Jul 25, 2018 at 8:57 AM <aw1621107 at gmail.com <mailto:aw1621107 at gmail.com>> wrote:
> The number of lines of assembly isn't really a good proxy for the performance of some code - mostly due to inlining (one piece of code may be many more lines of assembly because it's not calling large/complicated external functions - or, even taken as a whole (including those external functions) it might still be more efficient to have longer code (because it's more specialized - ie: two calls to one generic function were inlined into two places and each one simplified/optimized a bit for those situations))
> 
> Yeah, you’re right; that was also pointed out to me by someone on one of the IRC channels I lurk on. A bit more investigation on Godbolt revealed that the difference could be to unrolling. It was certainly a surprise to me, as I expected that libstdc++ and libc++ would have relatively similar implementations that would produce relatively similar outputs. Guess something about libc++’s implementation is a bit easier for Clang to inspect? In any case, let that be a lesson to me to be a bit more careful about drawing conclusions from code size
> 
> 
> 
> That said, libc++ does have a bunch of forced inlining that's not for performance reasons, but for linkage reasons (to ensure that certain kinds of changes/updates to libc++ don't break existing compiled code/libraries). It's a tradeoff that not every user of libc++ needs to make & there are steps being taken to make that tradeoff more configurable/optional, so far as I understand it.
> 
> Huh, that’s interesting. That isn’t what is happening here, though, right? I didn’t see anything that looks like that around the declarations/implementations of emplace() and friends
> 
> 
> Yeah, probably doesn't come up for the fully dependent template things in the standard library - but maybe some implementation details that are used in there like allocators, etc, might have some of these features. There's a lot of stuff in there - so hard for me to check at a glance. (though you can see it around otehr functions in the form of _LIBCPP_INLINE_VISIBILITY)
> 
> - Dave
>  
>  
> 
> On Mon, Jul 23, 2018 at 4:43 PM via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
> 
> Hello all,
> 
>  
> 
> Just a quick question to make sure I’m not missing something.
> 
>  
> 
> This program:
> 
>  
> 
> #include <vector>
> 
> void f(std::vector<double>& vec, double val) {
> 
>       vals.emplace(std::cbegin(vec), val);
> 
> }
> 
>  
> 
> When compiled with trunk Clang on Godbolt with -O3 -march=haswell -std=c++17 -stdlib=libstdc++, 132 lines of assembly are produced. If -stdlib=libc++ is used, though, 638 (!) lines of assembly are produced. A few of those lines are due to f() itself, but it appears the vast majority are due to the implementation of emplace(). As a partial comparison, GCC trunk produced 136 lines of assembly, and seems to have partially inlined emplace(), leaving 94 lines of assembly for _M_realloc_insert.
> 
>  
> 
> I can sort of duplicate this on Debian sid, with libc++-dev 6.0.1-1 and clang++-7 (--version doesn’t appear to give a revision number, unfortunately?). Using libstdc++ results in 176 lines of assembly, and libc++ results in 803 lines of assembly (counted by wc -l).
> 
>  
> 
> Is this something to be worried about? I’m still rather new to performance-related work, so I’m working from a relatively simplistic view of what could be affecting performance. A 4x difference in what could be a commonly-used function seems rather unusual to me, though.
> 
>  
> 
> Thanks,
> 
>  
> 
> Alex
> 
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180725/b4ee331f/attachment.html>