[cfe-commits] [PATCH] -finput-charset, multi-byte character and BOM support

Sat Jul 16 20:23:15 PDT 2011

And I attached the wrong llvm diff. Here is the correct one.

On Sat, Jul 16, 2011 at 8:21 PM, Scott Conger <scott.conger at gmail.com> wrote:
> Attached patch adds support for -finput-charset and automatic text
> conversion when there are multibyte characters or a byte-order-mark is
> present. The net effect is that all internal text should now be in
> UTF-8.
>
> I have the exec charset options mostly working, but I trimmed it down
> to this for now, as it's a decently sized patch as-is.
>
>
> Performance impact:
>
> At a minimum, we have to scan through the input text to see if there
> are any multi-byte characters. There are usually none as portable code
> won't have any. The cost of this is lower if you have SSE2 support as
> I added an optimized version using intrinsics:
>
> For 1000 calls against a 16 MB ASCII buffer, on an AMD Athlon 7850
> (2.81 Ghz) rough costs with GCC were:
> Default checkAscii - 13050 ms
> SSE2 checkAscii - 4025 ms
>
> If you do use -finput-charset, there is multi-byte text, or some
> byte-order-mark is present, the cost to convert the text to UTF-8 is
> somewhere between 10 to 20 times higher than the default checkAscii
> implementation. It varies considerably depending on the input and
> character set.
>
> As a special case, UTF-8 input avoids most of this cost and it just
> checks that it's valid UTF-8.
>
> GCC differences:
>
> * Didn't add GCC's support for IBM character encodings, although
> -finput-charset should work if iconv supports it.
> * Didn't add GCC's special handling of a few character sets like
> Shift-Jis when no iconv present.
> * GCC's only seems to do byte-order-mark detection if the underlying
> iconv does, which apparently varies.
>
> Issues:
>
> * It turned out to be quite ugly to get iconv working on Windows. See
> comment in NativeIconv.cpp. If what's there is objectionable, I'd
> prefer to rip out Windows support of iconv for now.
> * Difficult to automatically test as iconv implementations support
> very different sets of encodings.
> * It looks like I picked up some non-checked in changes when I
> regenerated configure relating to a bug report URL?
>
> Testing:
>
> Did Linux GCC, Windows Visual Studio 10 and Cygwin GCC builds. Ran all
> tests on Linux.
>
> You can run a simple input conversion test like so:
>
> sconger at scott-ubuntu:~/dev/llvmpatch/build$ iconv -f ASCII -t UTF-16BE
> test.c > test_utf16be.c
> sconger at scott-ubuntu:~/dev/llvmpatch/build$ ./bin/clang
> -finput-charset=UTF-16BE test_utf16be.c
> sconger at scott-ubuntu:~/dev/llvmpatch/build$ ./a.out
> Hello World
>
> -Scott
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_135359.diff
Type: text/x-patch
Size: 10369 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20110716/0a674043/attachment.bin>