[cfe-commits] [patch] Fix fixme about duplicated utf conversion code in codegen and lex

Wed Jun 27 07:06:58 PDT 2012

On Mon, Jun 25, 2012 at 4:54 AM, Nico Weber <thakis at chromium.org> wrote:
> Hi,
>
> the attached patch moves the duplicated code in CGExpr.cpp and
> LiteralSupport.cpp into a new file Basic/ConvertUTFWrapper.h/cpp.
> Please let me know what you think.
>
> Nico

Looks reasonable as far as I can tell. Just some nits:

> +#ifndef LLVM_CLANG_BASIC_CONVERT_UTF_WRAPPER_H
> +#define LLVM_CLANG_BASIC_CONVERT_UTF_WRAPPER_H
> +
> +#include "llvm/ADT/StringRef.h"
> +
> +namespace clang {
> +
> +/// Convert an UTF8 StringRef to UTF8, UTF16, or UTF32 depending on
> +/// WideCharWidth. On success, ResultPtr will point one after the end of the
> +/// copied string.
> +/// \returns true on success.
> +bool ConvertUTF8toWide(unsigned WideCharWidth, llvm::StringRef Source,
> +                       char*& ResultPtr);

Should the *& be on ResultPtr here? (and in the .cpp file)
Should the comment say that it's copying into ResultPtr?

> Index: lib/Basic/ConvertUTFWrapper.cpp
> ===================================================================
> --- lib/Basic/ConvertUTFWrapper.cpp	(revision 0)
> +++ lib/Basic/ConvertUTFWrapper.cpp	(revision 0)
> @@ -0,0 +1,55 @@
> +//===-- ConvertUTFWrapper.cpp - Wrap ConvertUTF.h with clang data types -----===
> +//
> +//                     The LLVM Compiler Infrastructure
> +//
> +// This file is distributed under the University of Illinois Open Source
> +// License. See LICENSE.TXT for details.
> +//
> +//===----------------------------------------------------------------------===//
> +
> +#include "clang/Basic/ConvertUTFWrapper.h"
> +#include "clang/Basic/ConvertUTF.h"
> +#include "clang/Basic/LLVM.h"
> +
> +namespace clang {
> +
> +bool ConvertUTF8toWide(unsigned WideCharWidth, llvm::StringRef Source,
> +                       char*& ResultPtr) {
> +  assert(WideCharWidth==1 || WideCharWidth==2 || WideCharWidth==4);
> +  ConversionResult result = conversionOK;
> +  // Copy the character span over.
> +  if (WideCharWidth == 1) {
> +    if (!isLegalUTF8String(reinterpret_cast<const UTF8*>(&*Source.begin()),
> +                           reinterpret_cast<const UTF8*>(&*Source.end())))

The &* isn't really necessary for Source.begin() / end(). Do we
normally do this for StringRefs? (I see it wasn't there in
lib/Lex/LiteralSupport.cpp)

> +      result = sourceIllegal;
> +    memcpy(ResultPtr, Source.data(), Source.size());
> +    ResultPtr += Source.size();
> +  } else if (WideCharWidth == 2) {
> +    UTF8 const *sourceStart = (UTF8 const *)Source.data();

This is equivalent to "const UTF8 *sourceStart", right? At least I
find that easier to read. And maybe static_cast it?

> +    // FIXME: Make the type of the result buffer correct instead of
> +    // using reinterpret_cast.
> +    UTF16 *targetStart = reinterpret_cast<UTF16*>(ResultPtr);
> +    ConversionFlags flags = strictConversion;
> +    result = ConvertUTF8toUTF16(
> +      &sourceStart,sourceStart + Source.size(),

Nit: space between &targetStart and targetStart?

> +        &targetStart,targetStart + 2*Source.size(),flags);

And some more spaces here too..

> +    if (result==conversionOK)

And here :)

> +      ResultPtr = reinterpret_cast<char*>(targetStart);
> +  } else if (WideCharWidth == 4) {
> +    UTF8 const *sourceStart = (UTF8 const *)Source.data();

Ditto about "const UTF8* sourceStart" etc.

> +    // FIXME: Make the type of the result buffer correct instead of
> +    // using reinterpret_cast.
> +    UTF32 *targetStart = reinterpret_cast<UTF32*>(ResultPtr);
> +    ConversionFlags flags = strictConversion;
> +    result = ConvertUTF8toUTF32(

> +        &sourceStart,sourceStart + Source.size(),
> +        &targetStart,targetStart + 4*Source.size(),flags);
> +    if (result==conversionOK)

Ditto about spaces.

Thanks,
Hans