[PATCH] D16529: [clang-tidy] Add modernize-raw-string-literal check

Wed Jan 27 17:44:19 PST 2016

alexfh added inline comments.

================
Comment at: clang-tidy/modernize/RawStringLiteralCheck.cpp:27
@@ +26,3 @@
+  // we only transform ASCII string literals
+  if (!Literal->isAscii())
+    return false;
----------------
LegalizeAdulthood wrote:
> alexfh wrote:
> > Why can't the check convert non-ascii string literals to corresponding raw string literals?
> Just doing one thing at a time and not trying to create a "kitchen sink" check on the first pass.
It should be rather straightforward to implement handling of the other types of string literals, but if you still want to postpone this, then a TODO would be more useful than the comment that just states that "we only transform ASCII string literals" (without explaining why).

================
Comment at: clang-tidy/modernize/RawStringLiteralCheck.cpp:57
@@ +56,3 @@
+
+  // already a raw string literal if R comes before "
+  if (Text.find_first_of("R") < Text.find_first_of(R"(")"))
----------------
aaron.ballman wrote:
> LegalizeAdulthood wrote:
> > alexfh wrote:
> > > nit: Capitalization, punctuation. Same for other comments in the file.
> > I don't understand what exactly you are asking me to change.  There is something wrong with the comment?
> ` // already a raw string literal if R comes before "`
> 
> Should be:
> 
> ` // Already a raw string literal if R comes before ".`
Sorry for the brevity. Aaron explained what I meant and here's the motivation (http://llvm.org/docs/CodingStandards.html#commenting):

> When writing comments, write them as English prose, which means they should use proper capitalization, punctuation, etc.

================
Comment at: clang-tidy/modernize/RawStringLiteralCheck.cpp:77
@@ +76,3 @@
+  std::string Delimiter;
+  for (int Counter = 0; containsDelimiter(Bytes, Delimiter); ++Counter) {
+    if (Counter == 0) {
----------------
LegalizeAdulthood wrote:
> alexfh wrote:
> > Why don't you try an empty delimiter? It will cover a majority of cases.
> > 
> > Also, even though the function is used only when generating fix-it hints, it still should be more effective than it is now. Most of the heap allocations and copies caused by the usage of std::ostringstream, std::string and std::string::operator+ can be avoided, e.g. like this:
> > 
> >   static const StringRef Delimiters[2][] =
> >     {{"", "R\"(", ")\""}, {"lit", "R\"lit(", ")lit\""}, {"str", "R\"str(", ")str\""},
> >     /* add more different ones to make it unlikely to meet all of these in a single literal in the wild */};
> >   for (const auto &Delim : Delimiters) {
> >     if (Bytes.find(Delim[2]) != StringRef::npos)
> >       return (Delim[1] + Bytes + Delim[2]).str();
> >   }
> >   // Give up gracefully on literals having all our delimiters.
> The very first thing tried is an empty delimiter.
> 
> I coded a solution to the problem that always works.
> 
> I find it less understandable to try a bunch of random delimiters and then fail if they are all present in the string.  Then the whole algorithm becomes more complicated because even though I **could** construct a fixit, I'm failing stupidly because all my "favorite" delimiters were used in your string.
> 
> Sometimes I feel these reviews over-obsess with allocations and efficiency instead of implementing a simple general algorithm that works reliably.
> 
> We don't even have any measurements to show that this performance consideration is dominant or even noticeable, but I'm being asked to take a perfectly correct algorithm and cram it into a StringRef corset.
> The very first thing tried is an empty delimiter.

Ah, sorry, this wasn't clear to me at first.

> I coded a solution to the problem that always works.

We first need to agree on whether we're trying to create a solution that works for hypothetical worst-case that has no chances to appear outside of the tests for your check, or we're trying to target real code.

The solution I suggest is more than enough for real code. For example, I've just searched for the `\)[a-z]\\\"` regex (a sequence that would prevent a single-letter delimiter) in a huuuuuge codebase and there were only ~20 hits (roughly half of them in tests for handling raw string literals in clang-format and other C++ tools). Needless to say that `)lit\"` wasn't found, nor `)str\"`.

> I find it less understandable to try a bunch of random delimiters and then fail if they are all present in the string. Then the whole algorithm becomes more complicated because even though I could construct a fixit, I'm failing stupidly because all my "favorite" delimiters were used in your string.

Good news is that this code will "fail" (and it should fail gracefully, e.g. just skip the fixit) extremely infrequently (see above). And by having a list of "favorite" delimiters we can make the resulting code look better in the (extremely rare) case when the version you suggest would have to use `)lit0"`, for example.

> Sometimes I feel these reviews over-obsess with allocations and efficiency instead of implementing a simple general algorithm that works reliably.

Unfortunately, in the code that deals with C++ analysis it's very important to be thoughtful about memory allocations and other possible sources of performance issues. It's not a hypothetical problem: the set of checks we normally use already takes much longer than just parsing the code, which sometimes results in minutes for a single translation unit. 

> We don't even have any measurements to show that this performance consideration is dominant or even noticeable, but I'm being asked to take a perfectly correct algorithm and cram it into a StringRef corset.

There are currently about 100 checks in clang-tidy and in order to be dominant time consumer a single check needs to be about 100 times slower than the average other check. Don't aim for that. Instead, please try to understand that this "obsession with allocations and efficiency" comes from experience of engineers who spent lots of time analyzing performance issues in various parts of LLVM/Clang that frequently enough boil down to excessive allocations and/or copying of strings and other data structures. This experience resulted in a set of recommendations and an extensive library of data structures that help to avoid common performance problems, when used appropriately. See http://llvm.org/docs/ProgrammersManual.html#picking-the-right-data-structure-for-a-task for some relevant documentation.

================
Comment at: clang-tidy/modernize/RawStringLiteralCheck.cpp:107
@@ +106,3 @@
+  if (const auto *Literal = Result.Nodes.getNodeAs<StringLiteral>("lit"))
+    preferRawStringLiterals(Result, Literal);
+}
----------------
LegalizeAdulthood wrote:
> alexfh wrote:
> > I don't see a reason to have a separate function, I'd just inline it here.
> There is no reason to inline it.
I apologize for the form of this comment. I wanted to ask whether you have any reasons to have this as a separate function.

Here are a few reasons to not have a separate `preferRawStringLiterals` function:
  * this function has a single caller that's small enough to fit the whole body of this function;
  * the name of this function doesn't make the responsibilities of this function clear to the reader;
  * the code will become shorter (and IMO, easier to read), if you just remove the function and put its body where it is now called.

================
Comment at: clang-tidy/modernize/RawStringLiteralCheck.cpp:117
@@ +116,3 @@
+        Result.Context->getLangOpts());
+    StringRef Replacement = asRawStringLiteral(Literal);
+    diag(Literal->getLocStart(),
----------------
LegalizeAdulthood wrote:
> alexfh wrote:
> > `Replacement` will point to deleted memory right after this line.
> For the record: I **hate** StringRef.  I spend about 80% of my time on clang-tidy submissions dicking around with StringRef based on all the review comments.  It is a very hard class to use properly.  If a usage like this results in referencing deleted memory, it shouldn't even compile.
> For the record: I hate StringRef. I spend about 80% of my time on clang-tidy submissions dicking around with StringRef based on all the review comments.

And there's a bunch of good reasons to use StringRef instead of `std::string` (and `const char*`) where appropriate. See my comment above regarding optimizations.

> It is a very hard class to use properly. 

Yes, it's like pointers. They are also hard to use to a certain degree. However, we don't have a better tool to work with read-only fragments of strings, pass strings around without copying them unintentionally, etc.

> If a usage like this results in referencing deleted memory, it shouldn't even compile.

If it's possible to achieve this without breaking valid use cases, I believe, many LLVM/Clang developers will say thank you to the person proposing and implementing this solution.

But until we have this solution, we still need to be careful about creating `StringRef` from temporary `std::string`. One tool that can help finding this kind of a bug is ASan, it can be enabled using CMake LLVM_USE_SANITIZER option (see http://llvm.org/docs/CMake.html for relevant docs).

http://reviews.llvm.org/D16529