[cfe-commits] Patch for review - unicode step 1

Sun Jul 12 14:43:36 PDT 2009

On Jul 10, 2009, at 1:30 PM, AlisdairM wrote:
> Attached is a patch for step 1 of my Unicode and enhanced literals  
> support.
> Looking for approval to commit.
>
> For step 1, I updated StringLiteralParser to cope with all varieties  
> of
> prefix for a literal, but have not yet started work on suffices for
> user-defined literals.
> Note that although the logic is here, there is no change in  
> functionality
> yet!  The next step is to update the lexer to pass along these new  
> literals.
>
> As there is not supposed to be a change of functionality, there are  
> no tests
> associated with this update.

This looks pretty nice to me, just a couple of things:

-    // Remember if we see any wide strings.
-    AnyWide |= StringToks[i].is(tok::wide_string_literal);
+    // Check for concatenating literals of different types
+    ThisTokBuf = &TokenBuf[0];
+    unsigned ThisTokLen = PP.getSpelling(StringToks[i], ThisTokBuf);
+    PrefixCode = DecodeStringLiteralPrefix(ThisTokBuf) & 3;

This can overflow the 'TokenBuf' buffer if the subsequent string was  
longer than the first string.  Just make sure to use something like:
   TokenBuf.resize(StringToks[i].getLength());

before you call PP.getSpelling there.  This also generates a warning  
because "ThisTokLen" is dead.

+/// u8 literals are stored as char strings, but cannot promote.  u8  
can be
+/// checked for by testing the 3rd bit of the result.
+/// On return, the passed buffer pointer should be pointing at the  
initial
+/// '"' character.
+static unsigned int DecodeStringLiteralPrefix(char const *& TokenBuf) {

Please use an enum to make the magic values explicit.  Returning them  
as an 'unsigned' is still fine, but it would be nice to have:

+  unsigned int Representation = PrefixCode & PREFIX_IS_U8;
instead of:
+  unsigned int Representation = PrefixCode & 3;

Also, in DecodeStringLiteralPrefix, please use |= instead of += to  
emphasize that you're setting a bit (even though your code is  
functionally correct).

A very trivial point is that I generally prefer "unsigned" instead of  
"unsigned int", just because it is shorter and less redundant.

+  case 1:  // char16_t
+    CodeUnitWidth = 2U; // FIXME : lookup correct width of char16_t  
for target
+    break;
+  case 2:  // wchar_t
+    CodeUnitWidth = PP.getTargetInfo().getWCharWidth();
+    assert((CodeUnitWidth & 7) == 0 && "Assumes wchar_t is byte  
multiple!");
+    CodeUnitWidth /= 8;
+    break;
+  case 3:  // char32_t
+    CodeUnitWidth = 4U; // FIXME : lookup correct width of char32_t  
for target
+    break;

Please stub these to go through PP.getTargetInfo(), even if the  
methods on targetinfo are hard coded to return 2/4 for now.  That puts  
the fixme in targetinfo, not in the lexer.

-                         hadError, StringToks[i].getLocation(),  
ThisIsWide, PP);
+                         hadError, StringToks[i].getLocation(),  
RepKind > 0, PP);

it would be nice to use something less magic than "RepKind > 0" for  
this.  Maybe using enums can help.

Thanks for working on this Alisdair!

-Chris