[PATCH] D16529: [clang-tidy] Add modernize-raw-string-literal check

Sun Feb 7 16:51:28 PST 2016

LegalizeAdulthood marked 5 inline comments as done.

================
Comment at: clang-tidy/modernize/RawStringLiteralCheck.cpp:96
@@ +95,3 @@
+  std::string Delimiter;
+  for (int Counter = 0; containsDelimiter(Bytes, Delimiter); ++Counter) {
+    Delimiter = (Counter == 0) ? "lit" : "lit" + std::to_string(Counter);
----------------
alexfh wrote:
> LegalizeAdulthood wrote:
> > alexfh wrote:
> > > Please address my comment above that relates to this code. Specifically, I find this generic algorithm unnecessarily inefficient and too rigid, i.e. one can't configure a custom set of delimiters that don't follow the <prefix><number> pattern. I suggest using a list of pre-concatenated pairs of strings (like `R"lit(` and `)lit"`).
> > Raw strings are new and not many people are using them because the don't have a good way to automatically convert disgusting quoted strings to raw strings.  So I don't think it is reasonable to draw conclusions by looking in existing code bases for raw strings.
> > 
> > We're having the same conversation we've had before.  I'm trying to do a basic check and get things working properly and the review comments are tending towards a desire for some kind of perfection.
> > 
> > I don't see why we have to make the perfect the enemy of the good.  Nothing you are suggesting **must** be present now in order for this check to function properly and reasonably.
> We're looking at this from different perspectives. From my point of view, preventing a potentially wasteful code in ClangTidy checks before it's even committed is much easier than tracking down performance issues when tens of checks run on multiple machines analyzing millions lines of code. So I'm asking to avoid writing code using string allocations and concatenations when there are good alternatives. Apart from possible performance issues, in this case the generic solution you suggest is targets extremely rare cases, while most real-world situations can be handled with a fixed set of delimiters (possibly, configured by the user, while I expect the preferences for the choice of delimiters to be very different in different teams).
I believe we agree on the following:

  - In order to turn a non-raw string literal into a raw string literal I have to surround it with `R"(` and `)"`.
  - If the existing string literal contains `)"` already, then surrounding the literal with the above will result in a syntax error.  Therefore, every potential raw string literal will have to be searched at least once for this non-delimited form of raw string literal.
  - `clang-tidy` checks should never introduce syntax errors.

Therefore, I can either not transform the string literal if it contains `)"`, or I can come up with some delimited form of raw string literal where the delimiter does not appear in the string.  In order to not transform the literal, I first have to search for `)"` in order to know that it should not be transformed.  Therefore, the search for `)"` must be performed, no matter what algorithm is used to determine a delimited or non-delimited raw string literal.

Where we seem to disagree is what algorithm should be used to determine the delimiter when a delimited raw string literal is required.

My implementation is the minimum performance impact for the typical case where the string does not contain `)"`.

Now let's talk about the cases where searching the string repeatedly for a delimiter would be expensive.

First, the string literal will have to contain `)"`.

Second, the string literal would have to be lengthy for the delimiter searching to be expensive.  Most of the time lengthy string literals are when someone has transformed a text file into a giant string literal.  Such text files include newlines and in the latest version of the code, I've decided to skip any string literal containing a newline.  It simply results in too many strings being converted to raw string literals and obfuscates the newlines.  The one case where the embedded newlines makes sense is the case of the text file converted into a string literal.

Repeated searching of the string for the literal delimiter could be avoided by changing the routine to search for the delimiter.  If present, it could examine the characters following the literal to make the literal more unique and then continue searching from there to look for more occurrences of the extended delimiter.  It would proceed incrementally through the rest of the string to create a unique delimiter in a single pass through the string.  I think the resulting literals would be even more non-sensical than the current implementation, but it would result in a delimiter obtained by a single pass through the string.  It's a significantly more complicated implementation and results in an even weirder delimiter to handle a very edge case.

If I follow your suggested approach of using a fixed number of string delimiters, I still have to check the strings for the presence of those delimiters surrounded by `)` and `"`, so I am not saving anything in terms of performance.  In the worst case i still have to perform multiple searches through the string and then fail the entire transformation if all of them are present in the string.  Failing the transformation significantly complicates the entire check because more state has to be passed around up and down the call chain of the functions evaluating the string literal so that we can abort the whole replacement.

I ran this check on the entire LLVM code base and not once did the check produce a raw string literal containing a delimiter.  Therefore the performance question of how to choose a unique delimiter we are discussing is moot.  It never entered the picture.

I'm happy to run this check on other large code bases to determine the likelihood of `)"` appearing in a string and the need for determining a unique delimiter.

================
Comment at: clang-tidy/modernize/RawStringLiteralCheck.cpp:116
@@ +115,3 @@
+    if (containsEscapedCharacters(Result, Literal))
+      replaceWithRawStringLiteral(Result, Literal);
+  }
----------------
alexfh wrote:
> aaron.ballman wrote:
> > A configuration option makes sense to me, but I would be fine if it was a follow-on patch. I think that someone running this check is likely fine with the level of noise it will produce (which should be moderately low).
> Agreed. This can be done in a follow up.
In my current version of the code, I've decided to skip strings containing newlines.  There are too many heuristics involved and subjective evaluations about readability when newlines are present in raw string literals.

The one case where I see it to be useful is when you process a text file into a C++ source file as a giant string literal.  However, I think the better solution there is to have the generating utility use raw string literals directly instead of running `clang-tidy` on it afterwards.

================
Comment at: test/clang-tidy/modernize-raw-string-literal.cpp:1
@@ +1,2 @@
+// RUN: %check_clang_tidy %s modernize-raw-string-literal %t
+
----------------
alexfh wrote:
> As usual, please add tests ensuring that replacements are done correctly in templates with multiple instantiations and in macro arguments (but not in macro bodies), e.g.
> 
>   template<typename T>
>   void f() {
>     (void)"asdf\\\\\\";
>     // CHECK-FIXES: void f() {
>     // CHECK-FIXES-NEXT: {{^  }}(void)R"(asdf\\\)";{{$}}
>     // CHECK-FIXES-NEXT: {{^}$}}
>   }
>   void g() {
>     f<int>();
>     f<double>();
>   }
> 
>   #define M(x) x x x "a\\b\\"
>   void h() {
>     const char *const s = M("c\\d\\");
>     // CHECK-FIXES: {{^}}#define M(x) x x x "a\\b\\"{{$}}
>     // CHECK-FIXES: {{^  }}const char *const s = M(R"(c\d\)");
>   }
> 
> 
I thought I could get it to handle macro arguments, but when I applied the check, it ended up inlining the macro instead, so I don't know exactly how to do that.  In the meantime, I've simply skipped any location that came from macro body or argument expansion.  This could be improved in the future, but prevents unwanted changes in the source file now.

http://reviews.llvm.org/D16529