[Lldb-commits] [PATCH] D31451: New C++ function name parsing logic

Thu Mar 30 18:42:14 PDT 2017

eugene marked 4 inline comments as done.
eugene added a comment.

> I think we should do some performance measurements to see whether this needs more optimising or it's fine as is.
> 
> I propose the following benchmark:
> 
>   bin/lldb bin/clang
> 
> make sure clang is statically linked
>  breakpoint set --name non_exisiting_function
> 
> Setting the breakpoint on a nonexistent function avoids us timing the breakpoint setting machinery, while still getting every symbol in the executable parsed.

I did some micro-benchmarking and on average new parser is ~3 time slower than the old one. (new parser - ~200k string/s, old parser - ~700k string/s) 
clang::Lexer appears to be the slowest part of it.

On the clang breakpoint benchmark you proposed, it is hard to notice much of a difference. clang binary has about 500k functions.
With new parser it takes about 11s to try to set a breakpoint, on the old one it's about 10s. (release version of  LLDB, debug static version of clang)

I decided to use a hybrid approach, when we use old parsing code for simple cases and call new parser only when it fails.
80% of clang functions are simple enough that we don't really need the new parser, so it helped to bring clang breakpoint test back to 10s.
I think it's reasonable to assume that similar distribution is true for most programs, and most of their functions can be parsed with the old code.

I don't think we can really improve performance of a new parser without giving up on clang::Lexer, and I'm reluctant to do it. 
So I propose to keep hybrid approach and call a new parser only for complicated cases.

================
Comment at: source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp:94
   if (!m_parsed && m_full) {
-    //        ConstString mangled;
-    //        m_full.GetMangledCounterpart(mangled);
-    //        printf ("\n   parsing = '%s'\n", m_full.GetCString());
-    //        if (mangled)
-    //            printf ("   mangled = '%s'\n", mangled.GetCString());
-    m_parse_error = false;
-    m_parsed = true;
-    llvm::StringRef full(m_full.GetCString());
-
-    size_t arg_start, arg_end;
-    llvm::StringRef parens("()", 2);
-    if (ReverseFindMatchingChars(full, parens, arg_start, arg_end)) {
-      m_arguments = full.substr(arg_start, arg_end - arg_start + 1);
-      if (arg_end + 1 < full.size())
-        m_qualifiers = full.substr(arg_end + 1);
-      if (arg_start > 0) {
-        size_t basename_end = arg_start;
-        size_t context_start = 0;
-        size_t context_end = llvm::StringRef::npos;
-        if (basename_end > 0 && full[basename_end - 1] == '>') {
-          // TODO: handle template junk...
-          // Templated function
-          size_t template_start, template_end;
-          llvm::StringRef lt_gt("<>", 2);
-          if (ReverseFindMatchingChars(full, lt_gt, template_start,
-                                       template_end, basename_end)) {
-            // Check for templated functions that include return type like:
-            // 'void foo<Int>()'
-            context_start = full.rfind(' ', template_start);
-            if (context_start == llvm::StringRef::npos)
-              context_start = 0;
-            else
-              ++context_start;
-
-            context_end = full.rfind(':', template_start);
-            if (context_end == llvm::StringRef::npos ||
-                context_end < context_start)
-              context_end = context_start;
-          } else {
-            context_end = full.rfind(':', basename_end);
-          }
-        } else if (context_end == llvm::StringRef::npos) {
-          context_end = full.rfind(':', basename_end);
-        }
-
-        if (context_end == llvm::StringRef::npos)
-          m_basename = full.substr(0, basename_end);
-        else {
-          if (context_start < context_end)
-            m_context =
-                full.substr(context_start, context_end - 1 - context_start);
-          const size_t basename_begin = context_end + 1;
-          m_basename =
-              full.substr(basename_begin, basename_end - basename_begin);
-        }
-        m_type = eTypeUnknownMethod;
-      } else {
-        m_parse_error = true;
-        return;
-      }
-
-      if (!IsValidBasename(m_basename)) {
-        // The C++ basename doesn't match our regular expressions so this can't
-        // be a valid C++ method, clear everything out and indicate an error
-        m_context = llvm::StringRef();
-        m_basename = llvm::StringRef();
-        m_arguments = llvm::StringRef();
-        m_qualifiers = llvm::StringRef();
-        m_parse_error = true;
-      }
+    CPlusPlusNameParser parser(m_full.GetStringRef());
+    auto function = parser.ParseAsFunctionDefinition();
----------------
labath wrote:
> How about the following api:
> ```
> if (auto function = CPlusPlusNameParser::ParseAsFunctionDefinition(m_full.GetStringRef())) {
>   ...
> ```
If you don't mind I'll leave it as it is. 

I understand that it's very tempting to have two simple functions ParseAsFunctionDefinition and ParseAsFullName instead of a class, but I can imagine calling second one if first one fails, and in this case it'll be good that parser doesn't need to tokenize string all over again.

================
Comment at: source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp:160
+  auto full_name = parser.ParseAsFullName();
+  if (full_name.hasValue()) {
+    identifier = full_name.getValue().m_basename;
----------------
labath wrote:
> Same here
see above

================
Comment at: source/Plugins/Language/CPlusPlus/CPlusPlusNameParser.cpp:62
+bool CPlusPlusNameParser::HasMoreTokens() {
+  return m_next_token_index < static_cast<int>(m_tokens.size());
+}
----------------
labath wrote:
> Wouldn't it be better to change the type of m_next_token_index to size_t?
Unsigned types are dangerous. I prefer to avoid them wherever I can.

As far as I remember many members of C++ standard commit publicly regretted the fact that .size() returns an unsigned type.

================
Comment at: source/Plugins/Language/CPlusPlus/CPlusPlusNameParser.cpp:188
+  // as arithmetic or shift operators not as template brackets.
+  // Examples: std::enable_if<(10u)<(64), bool>
+  //           f<A<operator<(X,Y)::Subclass>>
----------------
labath wrote:
> Is this really the case for the types we are interested in? I would have hoped that this would get simplified to `std::enable_if<true, bool> before it reaches us?
Yes. I was very surprised as well. But apparently compiler sometimes is being lazy and doesn't simplify everything :(

Running:
~$ nm -C --defined-only clang  | grep std::enable_if

Gives a few symbols like this:
std::enable_if<(10u)<(64), bool>::type llvm::isUInt<10u>(unsigned long)

================
Comment at: source/Plugins/Language/CPlusPlus/CPlusPlusNameParser.cpp:572
+  clang::SourceLocation end_location = last_token.getLocation();
+  const char *begin_ptr = m_text.data() + begin_location.getRawEncoding();
+  const char *end_ptr =
----------------
labath wrote:
> Could this be written as: 
> `m_text.take_front(end_pos).drop_front(start_pos)`?
Thanks! I still have a lot to learn about llvm classes. 

================
Comment at: source/Plugins/Language/CPlusPlus/CPlusPlusNameParser.h:94
+  // when destructed unless it's manually removed with Remove().
+  class Bookmark {
+  public:
----------------
labath wrote:
> Please make the type move-only. Otherwise you will have a fun time debugging accidental copies. (You already have one, although it is benign now)
Good catch! Thanks!

================
Comment at: unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp:106
+    EXPECT_EQ(test.qualifiers, method.GetQualifiers())
+        << method.GetQualifiers();
     EXPECT_EQ(test.scope_qualified_name, method.GetScopeQualifiedName());
----------------
labath wrote:
> Would defining operator<< for std::ostream and StringRef ( local tomthis test) enable you to get rid of these?
> I've ran into this before but was never annoyed enough to actually do it.. :/
I tried to to it, but it didn't work for me after a few tries. 
I decided to just convert it to std::string,  it solves the same problem.

https://reviews.llvm.org/D31451