[LLVMdev] LLVM supports Unicode?

bagel bagel99 at gmail.com
Sun Aug 28 18:00:15 PDT 2011


I think a very related question is "Does LLVM support UTF-8?  The answer has 
two parts:
1. As strings (arrays of bytes) - yes
2. As identifiers - no

The fix to the second part depends partly on the object file formats.  But to 
at least accept UTF-8 as identifiers, the following patch helps.  (I know that 
it does not descriminate between valid and in-valid UTF-8.)

--- lib/AsmParser/LLLexer.cpp	(revision 138730)
+++ lib/AsmParser/LLLexer.cpp	(working copy)
@@ -348,10 +348,10 @@
  bool LLLexer::ReadVarName() {
    const char *NameStart = CurPtr;
    if (isalpha(CurPtr[0]) || CurPtr[0] == '-' || CurPtr[0] == '$' ||
-      CurPtr[0] == '.' || CurPtr[0] == '_') {
+      CurPtr[0] == '.' || CurPtr[0] == '_' || (CurPtr[0]&0x80) != 0) {
      ++CurPtr;
      while (isalnum(CurPtr[0]) || CurPtr[0] == '-' || CurPtr[0] == '$' ||
-           CurPtr[0] == '.' || CurPtr[0] == '_')
+           CurPtr[0] == '.' || CurPtr[0] == '_' || (CurPtr[0]&0x80) != 0)
        ++CurPtr;

      StrVal.assign(NameStart, CurPtr);



More information about the llvm-dev mailing list