[cfe-commits] [PATCH] -finput-charset, multi-byte character and BOM support

Scott Conger scott.conger at gmail.com
Sat Jul 16 20:21:19 PDT 2011


Attached patch adds support for -finput-charset and automatic text
conversion when there are multibyte characters or a byte-order-mark is
present. The net effect is that all internal text should now be in
UTF-8.

I have the exec charset options mostly working, but I trimmed it down
to this for now, as it's a decently sized patch as-is.


Performance impact:

At a minimum, we have to scan through the input text to see if there
are any multi-byte characters. There are usually none as portable code
won't have any. The cost of this is lower if you have SSE2 support as
I added an optimized version using intrinsics:

For 1000 calls against a 16 MB ASCII buffer, on an AMD Athlon 7850
(2.81 Ghz) rough costs with GCC were:
Default checkAscii - 13050 ms
SSE2 checkAscii - 4025 ms

If you do use -finput-charset, there is multi-byte text, or some
byte-order-mark is present, the cost to convert the text to UTF-8 is
somewhere between 10 to 20 times higher than the default checkAscii
implementation. It varies considerably depending on the input and
character set.

As a special case, UTF-8 input avoids most of this cost and it just
checks that it's valid UTF-8.

GCC differences:

* Didn't add GCC's support for IBM character encodings, although
-finput-charset should work if iconv supports it.
* Didn't add GCC's special handling of a few character sets like
Shift-Jis when no iconv present.
* GCC's only seems to do byte-order-mark detection if the underlying
iconv does, which apparently varies.

Issues:

* It turned out to be quite ugly to get iconv working on Windows. See
comment in NativeIconv.cpp. If what's there is objectionable, I'd
prefer to rip out Windows support of iconv for now.
* Difficult to automatically test as iconv implementations support
very different sets of encodings.
* It looks like I picked up some non-checked in changes when I
regenerated configure relating to a bug report URL?

Testing:

Did Linux GCC, Windows Visual Studio 10 and Cygwin GCC builds. Ran all
tests on Linux.

You can run a simple input conversion test like so:

sconger at scott-ubuntu:~/dev/llvmpatch/build$ iconv -f ASCII -t UTF-16BE
test.c > test_utf16be.c
sconger at scott-ubuntu:~/dev/llvmpatch/build$ ./bin/clang
-finput-charset=UTF-16BE test_utf16be.c
sconger at scott-ubuntu:~/dev/llvmpatch/build$ ./a.out
Hello World

-Scott
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clang_135359.diff
Type: text/x-patch
Size: 91444 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20110716/c3b2d2aa/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm.diff
Type: text/x-patch
Size: 10361 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20110716/c3b2d2aa/attachment-0001.bin>


More information about the cfe-commits mailing list