<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>Okay, this isn't a patch, I just want to know if this is the direction you're looking for. The included changes will result in an error message like:</div><div><br></div><div>tmp.cpp:3:10: error: illegal sequence in character literal<br>char a = '<f1>';<br>         ^<br><br></div><div>It doesn't do anything to try to make the highlighted ranges look correct for the modified source line. This will cause a problem similar to the one that already exists with multi-byte UTF-8 encodings, where carets and ranges do not appear correct after these sequences.</div><div><br></div><div><img id="b0a53f43-6dab-4afc-9771-1e4585346087" height="34" width="718" apple-width="yes" apple-height="yes" src="cid:BCCFADDA-545F-4A19-8081-6A92BFED35A1@zoomtown.com"></div><div><br></div><div>Also, I think it would be better if these <XX> strings could be displayed with a reversed color scheme, just like vi, if the console supports it. Is there any possibility of doing that?</div><div><br></div><div>Anyway, here are the changes I am contemplating:</div><div><br></div><div>diff --git a/lib/Frontend/TextDiagnostic.cpp b/lib/Frontend/TextDiagnostic.cpp<br>index d2b8660..464b81b 100644<br>--- a/lib/Frontend/TextDiagnostic.cpp<br>+++ b/lib/Frontend/TextDiagnostic.cpp<br>@@ -10,6 +10,7 @@<br> #include "clang/Frontend/TextDiagnostic.h"<br> #include "clang/Basic/FileManager.h"<br> #include "clang/Basic/SourceManager.h"<br>+#include "clang/Basic/ConvertUTF.h"<br> #include "clang/Frontend/DiagnosticOptions.h"<br> #include "clang/Lex/Lexer.h"<br> #include "llvm/Support/MemoryBuffer.h"<br>@@ -17,6 +18,10 @@<br> #include "llvm/Support/ErrorHandling.h"<br> #include "llvm/ADT/SmallString.h"<br> #include <algorithm><br>+<br>+#include <sstream><br>+#include <iomanip><br>+<br> using namespace clang;<br> <br> static const enum raw_ostream::Colors noteColor =<br>@@ -592,6 +597,36 @@ void TextDiagnostic::emitSnippetAndCaret(<br>   // Copy the line of code into an std::string for ease of manipulation.<br>   std::string SourceLine(LineStart, LineEnd);<br> <br>+<br>+  static const char trailingBytesForUTF8[256] = {<br>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,<br>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,<br>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,<br>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,<br>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,<br>+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,<br>+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,<br>+    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5<br>+  };<br>+<br>+  unsigned char const *i = reinterpret_cast<unsigned char const *>(LineStart);<br>+  unsigned char const *end = reinterpret_cast<unsigned char const *>(LineEnd);<br>+  std::string SourceLineMod;<br>+  while(i!=end) {<br>+    if(isLegalUTF8Sequence(i,end)) {<br>+      std::copy(i,i+trailingBytesForUTF8[*i]+1,back_inserter(SourceLineMod));<br>+      i+=trailingBytesForUTF8[*i]+1;<br>+    } else {<br>+      std::stringstream ss;<br>+      ss.fill('0');<br>+      ss << std::hex << '<' << std::setw(2) << (unsigned int)*i << '>';<br>+      SourceLineMod.append(ss.str());<br>+      ++i;<br>+    }<br>+  }<br>+<br>+  SourceLine = SourceLineMod;<br>+<br>   // Create a line for the caret that is filled with spaces that is the same<br>   // length as the line of source code.<br>   std::string CaretLine(LineEnd-LineStart, ' ');<br><br><br></div><div><br></div><br><div><div>On Jan 19, 2012, at 10:36 PM, Seth Cantrell wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Here's a patch that improves that error message. A patch that improves the display of encoding errors will take me longer to get to.<br><br>On Jan 18, 2012, at 8:09 PM, Eli Friedman wrote:<br><br><blockquote type="cite">On Wed, Jan 18, 2012 at 4:44 PM, Seth Cantrell <<a href="mailto:seth.cantrell@gmail.com">seth.cantrell@gmail.com</a>> wrote:<br></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">On Jan 18, 2012, at 5:49 PM, Eli Friedman wrote:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">On Wed, Jan 18, 2012 at 4:27 AM, Seth Cantrell <<a href="mailto:seth.cantrell@gmail.com">seth.cantrell@gmail.com</a>> wrote:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+  while (begin!=end) {<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+    // Is this a span of non-escape characters?<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+    if (begin[0] != '\\') {<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+      char const *start = begin;<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+      do {<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+        ++begin;<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+      } while (begin != end && *begin != '\\');<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+      uint32_t *tmp_begin = buffer_begin;<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+      ConversionResult res =<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+      ConvertUTF8toUTF32(reinterpret_cast<UTF8 const **>(&start),<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+                         reinterpret_cast<UTF8 const *>(begin),<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+                         &buffer_begin,buffer_end,strictConversion);<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+      if (res!=conversionOK) {<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">+        PP.Diag(Loc, diag::err_bad_character_encoding);<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">This error message can lead to rather uninformative complaints which<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">look like the following:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">fribidi_char_sets_cp1256.c:214:9: error:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">     illegal sequence in character literal<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">return '?';<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">       ^<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Any ideas for how we could improve this diagnostic?<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">-Eli<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">I suppose a marginally better message could be 'illegal character encoding in character literal'.<br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Yes, that would be a bit better.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite">It'd also be good if the actual bytes could be highlighted. Something like vi's method of displaying illegal encodings using reversed colors would work to display them on the command line. Also adding a range to highlight the exact issue inside the literal. We'd need a way to calculate the locations for bytes inside the literal (there's a method there that looks like it works only for purely ascii strings). The console display for such ranges would need to be smarter about displaying ranges for lines that include multi-byte characters and also know about whatever method is chosen to show illegal bytes.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">fribidi_char_sets_cp1256.c:214:9: error:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">     illegal character encoding in character literal<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">return '123<F1>';<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">       ^   ~~~~<br></blockquote></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Displaying illegal bytes vi-style would be a big improvement.  Hacking<br></blockquote><blockquote type="cite">up TextDiagnostic::emitSnippetAndCaret to do that should be<br></blockquote><blockquote type="cite">straightforward, given a function to figure out whether a byte is<br></blockquote><blockquote type="cite">illegal.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">-Eli<br></blockquote><span><0002-improve-error-message-for-file-encoding-errors.patch></span></div></blockquote></div><br></body></html>