[PATCH] (Part 1/2) non-Unicode response file on Windows: UTF-8 BOM

Rafael EspĂ­ndola rafael.espindola at gmail.com
Fri Jan 23 15:50:21 PST 2015


Pleas use a test with one of the existing tools instead of a unit test.

In fact, you should be able to just add this to test/Other/ResponseFile.ll.

On 23 January 2015 at 18:44, Yunzhong Gao <Yunzhong_Gao at playstation.sony.com
> wrote:

> Hi all,
> This is spun off from D7133, based on Rafael's suggestion that the UTF-8
> BOM changes should be in its separate patch.
> In the process of writing a regression test case, I looked at what was
> done for UTF-16 BOM, and tried to add a similar unit test for UTF-8 BOM in
> the same file, but this also means that hasUTF8ByteOrderMark() needs be
> exposed as an external function instead of a static helper function. I hope
> that is okay.
> - Gao
>
> http://reviews.llvm.org/D7156
>
> Files:
>   llvm/include/llvm/Support/ConvertUTF.h
>   llvm/lib/Support/CommandLine.cpp
>   llvm/lib/Support/ConvertUTFWrapper.cpp
>   llvm/test/Other/ResponseFile.ll
>   llvm/unittests/Support/ConvertUTFTest.cpp
>
> Index: llvm/include/llvm/Support/ConvertUTF.h
> ===================================================================
> --- llvm/include/llvm/Support/ConvertUTF.h
> +++ llvm/include/llvm/Support/ConvertUTF.h
> @@ -243,6 +243,13 @@
>  bool hasUTF16ByteOrderMark(ArrayRef<char> SrcBytes);
>
>  /**
> + * Returns true if a blob of text starts with a UTF-8 byte order mark.
> + * UTF-8 BOM is a sequence of bytes on Windows and is not affected by the
> host
> + * system's endianness.
> + */
> +bool hasUTF8ByteOrderMark(ArrayRef<char> SrcBytes);
> +
> +/**
>   * Converts a stream of raw bytes assumed to be UTF16 into a UTF8
> std::string.
>   *
>   * \param [in] SrcBytes A buffer of what is assumed to be UTF-16 encoded
> text.
> Index: llvm/lib/Support/CommandLine.cpp
> ===================================================================
> --- llvm/lib/Support/CommandLine.cpp
> +++ llvm/lib/Support/CommandLine.cpp
> @@ -674,6 +674,11 @@
>        return false;
>      Str = StringRef(UTF8Buf);
>    }
> +  // If we see UTF-8 BOM sequence at the beginning of a file, we shall
> remove
> +  // these bytes before parsing.
> +  // Reference: http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark
> +  else if (hasUTF8ByteOrderMark(BufRef))
> +    Str = StringRef(BufRef.data() + 3, BufRef.size() - 3);
>
>    // Tokenize the contents into NewArgv.
>    Tokenizer(Str, Saver, NewArgv, MarkEOLs);
> Index: llvm/lib/Support/ConvertUTFWrapper.cpp
> ===================================================================
> --- llvm/lib/Support/ConvertUTFWrapper.cpp
> +++ llvm/lib/Support/ConvertUTFWrapper.cpp
> @@ -81,6 +81,13 @@
>             (S[0] == '\xfe' && S[1] == '\xff')));
>  }
>
> +// It is called byte order marker but the UTF-8 BOM is actually not
> affected
> +// by the host system's endianness.
> +bool hasUTF8ByteOrderMark(ArrayRef<char> S) {
> +  return (S.size() >= 3 &&
> +          S[0] == '\xef' && S[1] == '\xbb' && S[2] == '\xbf');
> +}
> +
>  bool convertUTF16ToUTF8String(ArrayRef<char> SrcBytes, std::string &Out) {
>    assert(Out.empty());
>
> Index: llvm/test/Other/ResponseFile.ll
> ===================================================================
> --- llvm/test/Other/ResponseFile.ll
> +++ llvm/test/Other/ResponseFile.ll
> @@ -6,6 +6,13 @@
>  ; RUN: llvm-as @%t.list2 -o %t.bc
>  ; RUN: llvm-nm %t.bc 2>&1 | FileCheck %s
>
> +; When the response file begins with UTF8 BOM sequence, we shall remove
> them.
> +; RUN: echo -e "\xef\xbb\xbf" > %t.list3
> +; RUN: echo %s >> %t.list3
> +; RUN: echo -e "\xef\xbb\xbf-time-passes @%t.list3" > %t.list4
> +; RUN: llvm-as @%t.list4 -o %t.bc
> +; RUN: llvm-nm %t.bc 2>&1 | FileCheck %s
> +
>  ; CHECK: T foobar
>
>  define void @foobar() {
> Index: llvm/unittests/Support/ConvertUTFTest.cpp
> ===================================================================
> --- llvm/unittests/Support/ConvertUTFTest.cpp
> +++ llvm/unittests/Support/ConvertUTFTest.cpp
> @@ -66,6 +66,20 @@
>    EXPECT_FALSE(HasBOM);
>  }
>
> +TEST(ConvertUTFTest, HasUTF8BOM) {
> +  bool HasBOM = hasUTF8ByteOrderMark(makeArrayRef("\xef\xbb\xbf", 3));
> +  EXPECT_TRUE(HasBOM);
> +  HasBOM = hasUTF8ByteOrderMark(makeArrayRef("\xef\xbb\xbf ", 4));
> +  EXPECT_TRUE(HasBOM);
> +  HasBOM = hasUTF8ByteOrderMark(makeArrayRef("\xef\xbb\xbf\x00asdf", 7));
> +  EXPECT_TRUE(HasBOM);
> +
> +  HasBOM = hasUTF8ByteOrderMark(None);
> +  EXPECT_FALSE(HasBOM);
> +  HasBOM = hasUTF8ByteOrderMark(makeArrayRef("\xef", 1));
> +  EXPECT_FALSE(HasBOM);
> +}
> +
>  struct ConvertUTFResultContainer {
>    ConversionResult ErrorCode;
>    std::vector<unsigned> UnicodeScalars;
>
> EMAIL PREFERENCES
>   http://reviews.llvm.org/settings/panel/emailpreferences/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150123/5bb5430c/attachment.html>


More information about the llvm-commits mailing list