[cfe-commits] [PATCH] -finput-charset, multi-byte character and BOM support

Fri Jul 22 22:51:00 PDT 2011

I hate to ping the list again, but since I didn't get a response in a
week it makes me wonder if I committed a faux pas by mailing the wrong
list or having an over-large patch.

I'd like to get something working, even if it's a smaller patch.

-Scott

On Sat, Jul 16, 2011 at 8:23 PM, Scott Conger <scott.conger at gmail.com> wrote:
> And I attached the wrong llvm diff. Here is the correct one.
>
> On Sat, Jul 16, 2011 at 8:21 PM, Scott Conger <scott.conger at gmail.com> wrote:
>> Attached patch adds support for -finput-charset and automatic text
>> conversion when there are multibyte characters or a byte-order-mark is
>> present. The net effect is that all internal text should now be in
>> UTF-8.
>>
>> I have the exec charset options mostly working, but I trimmed it down
>> to this for now, as it's a decently sized patch as-is.
>>
>>
>> Performance impact:
>>
>> At a minimum, we have to scan through the input text to see if there
>> are any multi-byte characters. There are usually none as portable code
>> won't have any. The cost of this is lower if you have SSE2 support as
>> I added an optimized version using intrinsics:
>>
>> For 1000 calls against a 16 MB ASCII buffer, on an AMD Athlon 7850
>> (2.81 Ghz) rough costs with GCC were:
>> Default checkAscii - 13050 ms
>> SSE2 checkAscii - 4025 ms
>>
>> If you do use -finput-charset, there is multi-byte text, or some
>> byte-order-mark is present, the cost to convert the text to UTF-8 is
>> somewhere between 10 to 20 times higher than the default checkAscii
>> implementation. It varies considerably depending on the input and
>> character set.
>>
>> As a special case, UTF-8 input avoids most of this cost and it just
>> checks that it's valid UTF-8.
>>
>> GCC differences:
>>
>> * Didn't add GCC's support for IBM character encodings, although
>> -finput-charset should work if iconv supports it.
>> * Didn't add GCC's special handling of a few character sets like
>> Shift-Jis when no iconv present.
>> * GCC's only seems to do byte-order-mark detection if the underlying
>> iconv does, which apparently varies.
>>
>> Issues:
>>
>> * It turned out to be quite ugly to get iconv working on Windows. See
>> comment in NativeIconv.cpp. If what's there is objectionable, I'd
>> prefer to rip out Windows support of iconv for now.
>> * Difficult to automatically test as iconv implementations support
>> very different sets of encodings.
>> * It looks like I picked up some non-checked in changes when I
>> regenerated configure relating to a bug report URL?
>>
>> Testing:
>>
>> Did Linux GCC, Windows Visual Studio 10 and Cygwin GCC builds. Ran all
>> tests on Linux.
>>
>> You can run a simple input conversion test like so:
>>
>> sconger at scott-ubuntu:~/dev/llvmpatch/build$ iconv -f ASCII -t UTF-16BE
>> test.c > test_utf16be.c
>> sconger at scott-ubuntu:~/dev/llvmpatch/build$ ./bin/clang
>> -finput-charset=UTF-16BE test_utf16be.c
>> sconger at scott-ubuntu:~/dev/llvmpatch/build$ ./a.out
>> Hello World
>>
>> -Scott
>>
>