<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/122070>122070</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Multi-line C++ R-strings give extra newlines with -E.
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
mrolle45
</td>
</tr>
</table>
<pre>
Take this example:
```c
file "t.c":
int main()
{
const char * s;
s = R"(Line 1.
Line 2.)";
s = R"(Line 1.\
Line 2.)";
foobar;
return 1;
}
__LINE__
```
I run `clang -E -xc++ t.c`, on a Windows platform. The result file is
```c
# 1 "t.c"
int main()
{
const char * s;
s = R"(Line 1.
Line 2.)";
s = R"(Line 1.\
Line 2.)";
foobar;
return 1;
}
11
```
**There are four blank lines that don't belong!**
If I change the line endings of `t.c` from Windows CR/LF to unix LF, the blank lines between Line 1 and Line 2 in the raw strings go away. **But the other two blank lines remain**.
The reason for the first and third blank lines is that the clang lexer is reverting the text of the string to the original source file. I did some debugging of the lexer and I found that it was taking the original text as `... Line 1\r\nLine 2...` When the token is written out via -E, the \r and \n both become new lines.
I think that the solution is for clang to read the input files in text mode, rather than binary mode, so that the line endings are all the \n newline character.
The second and fourth blank lines come from the `HandleWhitespaceBeforeTok` function, which treats the raw string as being on a single output line, even though it has an embedded newline. The next input token comes from two lines later, and so a blank line is written to the output.
There are two problems for the user if they use this -E output as input for a compile with clang or whatever other compiler. Compiling the original file without -E seems to avoid both of these problems.
## Wrong token value (on Windows platform).
Since the extra \r characters are part of the r-string literal token, the compiled program will have the wrong value. Any code which prints or otherwise uses this literal will get it wrong.
## Wrong line numbers on diagnostics (on _any_ platform).
The extra blank lines mean that everything following appears one line later than it should, and so diagnostics will have incorrect line numbers.
I ran clang on the output file. The `foobar` token I put there in order to cause a diagnostic to be emitted. Also, the `__LINE__` shows the line number in the source file and causes another diagnostic.
Here is the output from running clang:
```
t.c:10:5: error: use of undeclared identifier 'foobar'
10 | foobar;
| ^
t.c:13:1: error: expected unqualified-id
13 | 11
| ^
2 errors generated.
```
These diagnostics _should have been_ on lines 8 and 11, _resp._
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJysV81u474Rfxr6MrAgUfFHDj443hhrIO1hGyBHgxJHEhuadEnKSt6-GFKynWy77eEPBBtFImfm9zFDrvBetQZxwxZPbPFjJvrQWbc5Oas1PixmlZWfm1fxjhA65QE_xOmskZVblm_ZMk8_Ncu3jdIIjPOQ1YzztECZACehDONrxh9px-qJ5VsAgNoaH6DuhAPGt-BZOX3xwMof8IuC8PWLMghFFj_FZ55RJErwx_WL3Z-2NNZWwt3-dhh6Z6BIb9jqB8u3x-PL4e_Px-M9TpZvD-B6A4RZC9PC_BnmHzXjT4w_AUFf5ozvwBoQ8KaMtIOHsxahse6UAbx2CA59rwNEvpSn8N-YZLyE4o7Kv5bH_8bL_2TzTxv_H0KL4huVESr9vHboEIRDaGzvoNLCvINWBj2ETgSQ1jC-ClChtqZlvEjbUohDAwfCb1qyKMZ9gEYq03qwDUmVdIHG2dNVlN0vxvcvewgWeqM-4GVPulGA-_QVhgHRJNgFCCNHBkCZuNiJAXxwMVlrQQziM4NU3lMf4hIbOnQQBvslssMkKK0cZUnmEN4aaKyLexvlfIhpQ6ec_BJBjezQumRGjR_o6L3DC7qgTBs_BvwIxAQ9p1oJdSzNqVYZocHb3tUYPZkBHEAqCd6eECRWfdvSljFAykEVHUitWJkIoAIMwkMQ71PWa-yYXngSIsuykUq22Dm22JnRT1lGAsFbh4nWYN_REJLBqRDQgO0DXJSA-fMkE0WIdVAYqGzooMKaajY4JIpGXg9Ennm_seWt7oOyMQExndgLltiXcYUy5z61qI9KE4STlUjJnUiCdsJApYxwn9dP3t6SfPEheVtoPRVuqMa4gBpX1AHdnQc81tbIiI36gYDdyR4hRivHYMv8pzBS41unAvqzqPEJG-vw1b5Hz_emJqhU3dCpuoPgUAT_zbykT4VRZppcXplWI5FONFBe2o-XqI7t24707oQHYQBPFUqJcoI0TjlDlCUak5hUtx8LH-wIRouAjmITWG9B3EG9l3_yayzoRtU4NSje2dlK48lfW6f31AvRtJ_0Rzq_5s8TKuEnla0DQeWdaSIPKnSjIayDoROBemls4nGRywB28fE3szdTDDLs_Bk8Uk3BgrhYJZNNUyd5vNacTcOwpNH_5mx0I5F2EbqnU3VtzW_HCeOPtPEfytRp8uFHcCL1xdVWyXpn4a4TwM1HybUK6Kg_KdPUVSNCSbW1TpxgUFpDJy4pxRBri1VlAFvzCbWVODrr7JQJnmiLbA3KRxV8Yn5KFwO2mEYGhftP6KP-pj9VhMAakEq0xvqgaj-ycRTm8_idi9crC_cdc0JhUl-SlJ80DFporNZ2iNY_n1HENGPTRlOm_lYBfGd7Le88el_LjR1lausc1uFL7Vm6MQgzWcrc-Xgat6-pj8czdJmP2h_gnE4QR9HBOklVWagFuVnc1UFvKwQ8UbNIEkZ7e52Ty_x6lVnmBGfwtwGV6pxOs7tjIKKNqajLk_1vGQnXz1iY_wKIutv1xhCvEfG3iyLLt3QYl9siZ-V2wcotoHPW0QOhsg30RmKthUMJSqIJqlFId5vVyA9fpTtGkQNb7X67eUB8G38vnm_pSvrnSzr8OGMdUEJv_tULTWnkXMkxeBnDxCvLLWqKyFMMDy0adIIY_wbxNXb3vU-OyUXJKhWiOZIVkj3XkeqiIMGODv05ozvnTG5K-Vg-ihluilW5LPly8bicdZtq8bB6XKzWD3XJC1w1JcpVJZFe4frhgc_Uhud8kRf5Ol8vluUyeywlr3Fd1cVqtWqqmj3kdPnQmdaXU2ZdO1Pe97gpOM9X-UyLCrWP_x_gnM7S-JWue4sfM7ehTfOqbz17yLXywd_CBBU0bv7W66Dm0V278WL8a369I6nL1KLjeeHTyJ0_Z7Pe6U0XwtmTa_ie8X2rQtdXWW1PjO8pz_hrfnb2n1gHxvexOs_4fiz_suH_DgAA__8NXiKo">