[llvm] r340154 - [llvm] Make YAML serialization up to 2.5 times faster

Mon Aug 20 00:00:36 PDT 2018

Author: omtcyfz
Date: Mon Aug 20 00:00:36 2018
New Revision: 340154

URL: http://llvm.org/viewvc/llvm-project?rev=340154&view=rev
Log:
[llvm] Make YAML serialization up to 2.5 times faster

This patch significantly improves performance of the YAML serializer by
optimizing `YAML::isNumeric` function. This function is called on the
most strings and is highly inefficient for two reasons:

* It uses `Regex`, which is parsed and compiled each time this
  function is called
* It uses multiple passes which are not necessary

This patch introduces stateful ad hoc YAML number parser which does not
rely on `Regex`. It also fixes YAML number format inconsistency: current
implementation supports C-stile octal number format (`01234567`) which
was present in YAML 1.0 specialization (http://yaml.org/spec/1.0/),
[Section 2.4. Tags, Example 2.19] but was deprecated and is no longer
present in latest YAML 1.2 specification
(http://yaml.org/spec/1.2/spec.html), see [Section 10.3.2. Tag
Resolution]. Since the rest of the rest of the implementation does not
support other deprecated YAML 1.0 numeric features such as sexagecimal
numbers, commas as delimiters it is treated as inconsistency and not
longer supported. This patch also adds unit tests to ensure the validity
of proposed implementation.

This performance bottleneck was identified while profiling Clangd's
global-symbol-builder tool with my colleague @ilya-biryukov. The
substantial part of the runtime was spent during a single-thread Reduce
phase, which concludes with YAML serialization of collected symbol
collection. Regex matching was accountable for approximately 45% of the
whole runtime (which involves sharded Map phase), now it is reduced to
18% (which is spent in `clang::clangd::CanonicalIncludes` and can be
also optimized because all used regexes are in fact either suffix
matches or exact matches).

`llvm-yaml-numeric-parser-fuzzer` was used to ensure the validity of the
proposed regex replacement. Fuzzing for ~60 hours using 10 threads did
not expose any bugs.

Benchmarking `global-symbol-builder` (using `hyperfine --warmup 2
--min-runs 5 'command 1' 'command 2'`) tool by processing a reasonable
amount of code (26 source files matched by
`clang-tools-extra/clangd/*.cpp` with all transitive includes) confirmed
our understanding of the performance bottleneck nature as it speeds up
the command by the factor of 1.6x:

| Command | Mean [s] | Min…Max [s] |
| this patch (D50839) | 84.7 ± 0.6 | 83.3…84.7 |
| master (rL339849) | 133.1 ± 0.8 | 132.4…134.6 |

Using smaller samples (e.g. by collecting symbols from
`clang-tools-extra/clangd/AST.cpp` only) yields even better performance
improvement, which is expected because Map phase takes less time
compared to Reduce and is 2.05x faster and therefore would significantly
improve the performance of standalone YAML serializations.

| Command | Mean [ms] | Min…Max [ms] |
| this patch (D50839) | 3702.2 ± 48.7 | 3635.1…3752.3 |
| master (rL339849) | 7607.6 ± 109.5 | 7533.3…7796.4 |

Reviewed by: zturner, ilya-biryukov

Differential revision: https://reviews.llvm.org/D50839

Added:
    llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/
    llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/CMakeLists.txt
    llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/DummyYAMLNumericParserFuzzer.cpp
    llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/yaml-numeric-parser-fuzzer.cpp
Modified:
    llvm/trunk/include/llvm/Support/YAMLTraits.h
    llvm/trunk/unittests/Support/YAMLIOTest.cpp

Modified: llvm/trunk/include/llvm/Support/YAMLTraits.h
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/Support/YAMLTraits.h?rev=340154&r1=340153&r2=340154&view=diff
==============================================================================

--- llvm/trunk/include/llvm/Support/YAMLTraits.h (original)
+++ llvm/trunk/include/llvm/Support/YAMLTraits.h Mon Aug 20 00:00:36 2018
@@ -27,6 +27,7 @@
 #include <cctype>
 #include <cstddef>
 #include <cstdint>
+#include <iterator>
 #include <map>
 #include <memory>
 #include <new>
@@ -449,46 +450,100 @@ public:
   static bool const value = (sizeof(test<DocumentListTraits<T>>(nullptr))==1);
 };
 
-inline bool isNumber(StringRef S) {
-  static const char OctalChars[] = "01234567";
-  if (S.startswith("0") &&
-      S.drop_front().find_first_not_of(OctalChars) == StringRef::npos)
-    return true;
+inline bool isNumeric(StringRef S) {
+  const static auto skipDigits = [](StringRef Input) {
+    return Input.drop_front(
+        std::min(Input.find_first_not_of("0123456789"), Input.size()));
+  };
+
+  // Make S.front() and S.drop_front().front() (if S.front() is [+-]) calls
+  // safe.
+  if (S.empty() || S.equals("+") || S.equals("-"))
+    return false;
 
-  if (S.startswith("0o") &&
-      S.drop_front(2).find_first_not_of(OctalChars) == StringRef::npos)
+  if (S.equals(".nan") || S.equals(".NaN") || S.equals(".NAN"))
     return true;
 
-  static const char HexChars[] = "0123456789abcdefABCDEF";
-  if (S.startswith("0x") &&
-      S.drop_front(2).find_first_not_of(HexChars) == StringRef::npos)
-    return true;
+  // Infinity and decimal numbers can be prefixed with sign.
+  StringRef Tail = (S.front() == '-' || S.front() == '+') ? S.drop_front() : S;
 
-  static const char DecChars[] = "0123456789";
-  if (S.find_first_not_of(DecChars) == StringRef::npos)
+  // Check for infinity first, because checking for hex and oct numbers is more
+  // expensive.
+  if (Tail.equals(".inf") || Tail.equals(".Inf") || Tail.equals(".INF"))
     return true;
 
-  if (S.equals(".inf") || S.equals(".Inf") || S.equals(".INF"))
-    return true;
+  // Section 10.3.2 Tag Resolution
+  // YAML 1.2 Specification prohibits Base 8 and Base 16 numbers prefixed with
+  // [-+], so S should be used instead of Tail.
+  if (S.startswith("0o"))
+    return S.size() > 2 &&
+           S.drop_front(2).find_first_not_of("01234567") == StringRef::npos;
+
+  if (S.startswith("0x"))
+    return S.size() > 2 && S.drop_front(2).find_first_not_of(
+                               "0123456789abcdefABCDEF") == StringRef::npos;
+
+  // Parse float: [-+]? (\. [0-9]+ | [0-9]+ (\. [0-9]* )?) ([eE] [-+]? [0-9]+)?
+  S = Tail;
+
+  // Handle cases when the number starts with '.' and hence needs at least one
+  // digit after dot (as opposed by number which has digits before the dot), but
+  // doesn't have one.
+  if (S.startswith(".") &&
+      (S.equals(".") ||
+       (S.size() > 1 && std::strchr("0123456789", S[1]) == nullptr)))
+    return false;
+
+  if (S.startswith("E") || S.startswith("e"))
+    return false;
+
+  enum ParseState {
+    Default,
+    FoundDot,
+    FoundExponent,
+  };
+  ParseState State = Default;
 
-  Regex FloatMatcher("^(\\.[0-9]+|[0-9]+(\\.[0-9]*)?)([eE][-+]?[0-9]+)?$");
-  if (FloatMatcher.match(S))
-    return true;
+  S = skipDigits(S);
 
-  return false;
-}
-
-inline bool isNumeric(StringRef S) {
-  if ((S.front() == '-' || S.front() == '+') && isNumber(S.drop_front()))
-    return true;
-
-  if (isNumber(S))
+  // Accept decimal integer.
+  if (S.empty())
     return true;
 
-  if (S.equals(".nan") || S.equals(".NaN") || S.equals(".NAN"))
-    return true;
+  if (S.front() == '.') {
+    State = FoundDot;
+    S = S.drop_front();
+  } else if (S.front() == 'e' || S.front() == 'E') {
+    State = FoundExponent;
+    S = S.drop_front();
+  } else {
+    return false;
+  }
+
+  if (State == FoundDot) {
+    S = skipDigits(S);
+    if (S.empty())
+      return true;
+
+    if (S.front() == 'e' || S.front() == 'E') {
+      State = FoundExponent;
+      S = S.drop_front();
+    } else {
+      return false;
+    }
+  }
+
+  assert(FoundExponent && "Should have found exponent at this point.");
+  if (S.empty())
+    return false;
+
+  if (S.front() == '+' || S.front() == '-') {
+    S = S.drop_front();
+    if (S.empty())
+      return false;
+  }
 
-  return false;
+  return skipDigits(S).empty();
 }
 
 inline bool isNull(StringRef S) {

Added: llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/CMakeLists.txt
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/CMakeLists.txt?rev=340154&view=auto
==============================================================================
--- llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/CMakeLists.txt (added)
+++ llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/CMakeLists.txt Mon Aug 20 00:00:36 2018
@@ -0,0 +1,9 @@
+set(LLVM_LINK_COMPONENTS
+  Support
+  FuzzMutate
+)
+
+add_llvm_fuzzer(llvm-yaml-numeric-parser-fuzzer
+  yaml-numeric-parser-fuzzer.cpp
+  DUMMY_MAIN DummyYAMLNumericParserFuzzer.cpp
+  )

Added: llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/DummyYAMLNumericParserFuzzer.cpp
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/DummyYAMLNumericParserFuzzer.cpp?rev=340154&view=auto
==============================================================================
--- llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/DummyYAMLNumericParserFuzzer.cpp (added)
+++ llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/DummyYAMLNumericParserFuzzer.cpp Mon Aug 20 00:00:36 2018
@@ -0,0 +1,19 @@
+//===--- DummyYAMLNumericParserFuzzer.cpp ---------------------------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===----------------------------------------------------------------------===//
+//
+// Implementation of main so we can build and test without linking libFuzzer.
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/FuzzMutate/FuzzerCLI.h"
+
+extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size);
+int main(int argc, char *argv[]) {
+  return llvm::runFuzzerOnInputs(argc, argv, LLVMFuzzerTestOneInput);
+}

Added: llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/yaml-numeric-parser-fuzzer.cpp
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/yaml-numeric-parser-fuzzer.cpp?rev=340154&view=auto
==============================================================================
--- llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/yaml-numeric-parser-fuzzer.cpp (added)
+++ llvm/trunk/tools/llvm-yaml-numeric-parser-fuzzer/yaml-numeric-parser-fuzzer.cpp Mon Aug 20 00:00:36 2018
@@ -0,0 +1,47 @@
+//===--- special-case-list-fuzzer.cpp - Fuzzer for special case lists -----===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/Regex.h"
+#include "llvm/Support/YAMLTraits.h"
+#include <cassert>
+#include <string>
+
+llvm::Regex Infinity("^[-+]?(\\.inf|\\.Inf|\\.INF)$");
+llvm::Regex Base8("^0o[0-7]+$");
+llvm::Regex Base16("^0x[0-9a-fA-F]+$");
+llvm::Regex Float("^[-+]?(\\.[0-9]+|[0-9]+(\\.[0-9]*)?)([eE][-+]?[0-9]+)?$");
+
+inline bool isNumericRegex(llvm::StringRef S) {
+
+  if (S.equals(".nan") || S.equals(".NaN") || S.equals(".NAN"))
+    return true;
+
+  if (Infinity.match(S))
+    return true;
+
+  if (Base8.match(S))
+    return true;
+
+  if (Base16.match(S))
+    return true;
+
+  if (Float.match(S))
+    return true;
+
+  return false;
+}
+
+extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
+  std::string Input(reinterpret_cast<const char *>(Data), Size);
+  Input.erase(std::remove(Input.begin(), Input.end(), 0), Input.end());
+  if (!Input.empty() && llvm::yaml::isNumeric(Input) != isNumericRegex(Input))
+    __builtin_trap();
+  return 0;
+}

Modified: llvm/trunk/unittests/Support/YAMLIOTest.cpp
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/unittests/Support/YAMLIOTest.cpp?rev=340154&r1=340153&r2=340154&view=diff
==============================================================================
--- llvm/trunk/unittests/Support/YAMLIOTest.cpp (original)
+++ llvm/trunk/unittests/Support/YAMLIOTest.cpp Mon Aug 20 00:00:36 2018
@@ -16,16 +16,17 @@
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
 
+using llvm::yaml::Hex16;
+using llvm::yaml::Hex32;
+using llvm::yaml::Hex64;
+using llvm::yaml::Hex8;
 using llvm::yaml::Input;
-using llvm::yaml::Output;
 using llvm::yaml::IO;
-using llvm::yaml::MappingTraits;
+using llvm::yaml::isNumeric;
 using llvm::yaml::MappingNormalization;
+using llvm::yaml::MappingTraits;
+using llvm::yaml::Output;
 using llvm::yaml::ScalarTraits;
-using llvm::yaml::Hex8;
-using llvm::yaml::Hex16;
-using llvm::yaml::Hex32;
-using llvm::yaml::Hex64;
 using ::testing::StartsWith;
 
 
@@ -2569,3 +2570,73 @@ TEST(YAMLIO, TestEscaped) {
     TestEscaped((char const *)foobar, "\"foo\\u200Bbar\"");
   }
 }
+
+TEST(YAMLIO, Numeric) {
+  EXPECT_TRUE(isNumeric(".inf"));
+  EXPECT_TRUE(isNumeric(".INF"));
+  EXPECT_TRUE(isNumeric(".Inf"));
+  EXPECT_TRUE(isNumeric("-.inf"));
+  EXPECT_TRUE(isNumeric("+.inf"));
+
+  EXPECT_TRUE(isNumeric(".nan"));
+  EXPECT_TRUE(isNumeric(".NaN"));
+  EXPECT_TRUE(isNumeric(".NAN"));
+
+  EXPECT_TRUE(isNumeric("0"));
+  EXPECT_TRUE(isNumeric("0."));
+  EXPECT_TRUE(isNumeric("0.0"));
+  EXPECT_TRUE(isNumeric("-0.0"));
+  EXPECT_TRUE(isNumeric("+0.0"));
+
+  EXPECT_TRUE(isNumeric("12345"));
+  EXPECT_TRUE(isNumeric("012345"));
+  EXPECT_TRUE(isNumeric("+12.0"));
+  EXPECT_TRUE(isNumeric(".5"));
+  EXPECT_TRUE(isNumeric("+.5"));
+  EXPECT_TRUE(isNumeric("-1.0"));
+
+  EXPECT_TRUE(isNumeric("2.3e4"));
+  EXPECT_TRUE(isNumeric("-2E+05"));
+  EXPECT_TRUE(isNumeric("+12e03"));
+  EXPECT_TRUE(isNumeric("6.8523015e+5"));
+
+  EXPECT_TRUE(isNumeric("1.e+1"));
+  EXPECT_TRUE(isNumeric(".0e+1"));
+
+  EXPECT_TRUE(isNumeric("0x2aF3"));
+  EXPECT_TRUE(isNumeric("0o01234567"));
+
+  EXPECT_FALSE(isNumeric("not a number"));
+  EXPECT_FALSE(isNumeric("."));
+  EXPECT_FALSE(isNumeric(".e+1"));
+  EXPECT_FALSE(isNumeric(".1e"));
+  EXPECT_FALSE(isNumeric(".1e+"));
+  EXPECT_FALSE(isNumeric(".1e++1"));
+
+  EXPECT_FALSE(isNumeric("ABCD"));
+  EXPECT_FALSE(isNumeric("+0x2AF3"));
+  EXPECT_FALSE(isNumeric("-0x2AF3"));
+  EXPECT_FALSE(isNumeric("0x2AF3Z"));
+  EXPECT_FALSE(isNumeric("0o012345678"));
+  EXPECT_FALSE(isNumeric("0xZ"));
+  EXPECT_FALSE(isNumeric("-0o012345678"));
+  EXPECT_FALSE(isNumeric("000003A8229434B839616A25C16B0291F77A438B"));
+
+  EXPECT_FALSE(isNumeric(""));
+  EXPECT_FALSE(isNumeric("."));
+  EXPECT_FALSE(isNumeric(".e+1"));
+  EXPECT_FALSE(isNumeric(".e+"));
+  EXPECT_FALSE(isNumeric(".e"));
+  EXPECT_FALSE(isNumeric("e1"));
+
+  // Deprecated formats: as for YAML 1.2 specification, the following are not
+  // valid numbers anymore:
+  //
+  // * Sexagecimal numbers
+  // * Decimal numbers with comma s the delimiter
+  // * "inf", "nan" without '.' prefix
+  EXPECT_FALSE(isNumeric("3:25:45"));
+  EXPECT_FALSE(isNumeric("+12,345"));
+  EXPECT_FALSE(isNumeric("-inf"));
+  EXPECT_FALSE(isNumeric("1,230.15"));
+}