[cfe-commits] [PATCH] -finput-charset, multi-byte character and BOM support

Chris Lattner clattner at apple.com
Mon Jul 25 23:52:10 PDT 2011


Sorry for the delay, have been very backed up on patch review:

On Jul 16, 2011, at 8:21 PM, Scott Conger wrote:
> Attached patch adds support for -finput-charset and automatic text
> conversion when there are multibyte characters or a byte-order-mark is
> present. The net effect is that all internal text should now be in
> UTF-8.

Wow, awesome!  I'm thrilled you're tackling this!

> I have the exec charset options mostly working, but I trimmed it down
> to this for now, as it's a decently sized patch as-is.

Smallish incremental steps greatly appreciated.

> Issues:
> 
> * It turned out to be quite ugly to get iconv working on Windows. See
> comment in NativeIconv.cpp. If what's there is objectionable, I'd
> prefer to rip out Windows support of iconv for now.


I have a higher level concern: the functionality of "an iconv portability wrapper" should be sunk out of clang into the llvm lib/Support library.  That is where we isolate system specific stuff (see lib/Support/Unix/* for example) and ifdefs.

I would split this part of the patch out and get it in (on llvm-commits) after your autoconfigury patch goes in.  Here are some specific thoughts on the functionality at this level:

+#ifdef NEED_FOR_EXEC_CHARSET
+// Encoder for converting from UTF-8 to any iconv supported encoding
+class IconvEncoder : public Encoder {
+public:

There shouldn't be configuration-specific #ifdefs like this in header files, these should be in .cpp files.  Just make both the encoder and the decoder interfaces unconditionally available.

Also, instead of a base class with multiple derived classes and a virtual decode method, I'd suggest a simple interface like this (in llvm namespace, not codec):


// In include/llvm/Support/UnicodeTranscoder.h:

class UnicodeDecoder {
public:
  UnicodeDecoder();
  ~UnicodeDecoder();

  /// Initialize the decoder, returning true on failure.
  bool init(const char *sourceEncoding);

  /// Converts the passed source string from its encoding specified by 'init' to 
  /// UTF-8 and places it in dest.  If there is a decoding error, this returns true.
  /// 
  bool decodeToUtf8(std::vector<char> &dest, StringRef Source) = 0;

private:
  UnicodeDecoder(const UnicodeDecoder&);
  const UnicodeDecoder& operator=(const UnicodeDecoder&);
  void *Impl;  // Backing implementation.
};

Then put all of the configury based logic and any basic implementations that we want to provide (for hosts without iconv) into the UnicodeTranscoder.cpp file.

Also, as a follow-on patch after this basic support is in, the BOM detection and UnicodeUtil.h stuff can be added as static functions to the UnicodeDecoder/UnicoderEncoder classes as makes sense.


> * It turned out to be quite ugly to get iconv working on Windows. See
> comment in NativeIconv.cpp. If what's there is objectionable, I'd
> prefer to rip out Windows support of iconv for now.

I'd be completely fine leaving this out and either tackling it as a follow-on patch or leaving it for someone who cares about windows to tackle it.  It would also be nice to eventually add iconv-less support for Shift-Jis, but again, you don't have to be the one to do it, just get the interface set up so someone can drop the code in.

> * Difficult to automatically test as iconv implementations support
> very different sets of encodings.


You can add some simple tests to llvm/unittests, and those tests can be #ifdef'd on iconv support.  It doesn't need to cover every possible encoding, just something to prove validity of the theme.

Once this and the configury patch go into LLVM, we can take another look at the Clang patch.

-Chris



More information about the cfe-commits mailing list