[llvm-commits] [PATCH] YAML parser.

Wed Mar 7 17:12:14 PST 2012

On Fri, Feb 24, 2012 at 4:40 PM, Manuel Klimek <klimek at google.com> wrote:
> Ok, generally looks good for our use case minus style (which I don't
> feel comfortable making calls on). I assume you'll need to get
> approval from somebody else...
>
> Cheers,
> /Manuel

Ping.

- Michael Spencer

> On Tue, Feb 21, 2012 at 11:12 AM, Michael Spencer <bigcheesegs at gmail.com> wrote:
>> On Mon, Feb 20, 2012 at 5:04 PM, Manuel Klimek <klimek at google.com> wrote:
>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
>>> File lib/Support/YAMLParser.cpp (right):
>>>
>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
>>> lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
>>> std::string(HexStr.size() - 2, '0') + HexStr;
>>> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>> > Why's the middle part there?
>>>>
>>>> \x requires two characters after it. This adds the 0 if needed.
>>>
>>> Ok, shouldn't that be 2 - HexStr.size() then?
>>
>> Oh, you're right. I fixed that later in the code, but not up here. Fixed.
>>
>>>
>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
>>> lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
>>> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>> > Somehow I cannot match this up with 5.1.
>>>> > Especially:
>>>> > """On input, a YAML processor must accept all Unicode characters except
>>> those
>>>> > explicitly excluded above."""
>>>> >
>>>> > And, why do we care about non-printable chars at all?
>>>>
>>>> This excludes the characters listed in 5.1 in addition to the other stuff
>>>> removed from nb-char.
>>>>
>>>> This is needed to skip entire UTF-8 minimal well formed subsequences at a time
>>>> to keep proper column counts.
>>>
>>> If I'm not completely misreading, this excludes much more:
>>> """To ensure readability, YAML streams use only the printable subset of the
>>> Unicode character set. The allowed character range explicitly excludes the C0
>>> control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are
>>> allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is
>>> allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF."""
>>>
>>> The surrogate block cannot actually come out of decodeUTF8.
>>
>> All of the ranges excluded there are excluded by skip_nb_char. Yes,
>> the surrogate block check is redundant. It documents the exact ranges
>> that will be accepted.
>>
>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
>>> lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
>>> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>> > Do we want to support all possible escapes, or can we get away with the
>>> subset
>>>> > we really need?
>>>>
>>>> It's only a few extra lines with no extra complexity...
>>>
>>> It is code that needs to be maintained ... shrug.
>>>
>>> On Thu, Feb 16, 2012 at 2:38 PM, Michael Spencer <bigcheesegs at gmail.com> wrote:
>>>> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h
>>>> File include/llvm/Support/YAMLParser.h (right):
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h#newcode71
>>>> include/llvm/Support/YAMLParser.h:71: std::string escape(StringRef
>>>> Input);
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> This is not called anywhere yet(?) Why do we need it? Are we planning
>>>>
>>>> to write
>>>>>
>>>>> yaml or do we only need to parse it?
>>>>
>>>>
>>>> It's used when writing out YAML files.
>>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
>>>> File lib/Support/YAMLParser.cpp (right):
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode667
>>>> lib/Support/YAMLParser.cpp:667: std::string EscapedInput;
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> Curious: is there a reason not to use a Twine here (which would be my
>>>>
>>>> default
>>>>>
>>>>> choice for string concatenation).
>>>>
>>>>
>>>> Twine cannot be used here as it is only valid until the end of the
>>>> statement. See Twine.h.
>>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode693
>>>> lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
>>>> std::string(HexStr.size() - 2, '0') + HexStr;
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> Why's the middle part there?
>>>>
>>>>
>>>> \x requires two characters after it. This adds the 0 if needed.
>>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode694
>>>> lib/Support/YAMLParser.cpp:694: } else if (*i & 0x80) { // utf8
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> // utf8
>>>>> is confusing to me here:
>>>>> I thought we always assume utf8, and 0x80 merely means that we have a
>>>>
>>>> multi-byte
>>>>>
>>>>> utf8 sequence...
>>>>
>>>>
>>>>> So I'd expect a comment like:
>>>>> // utf8 multi-byte sequence
>>>>
>>>>
>>>> Done.
>>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode788
>>>> lib/Support/YAMLParser.cpp:788: if (   *Position == 0x09
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> Indentation off?
>>>>
>>>>
>>>> It lines up with the other side of the || below.
>>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode795
>>>> lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> Somehow I cannot match this up with 5.1.
>>>>> Especially:
>>>>> """On input, a YAML processor must accept all Unicode characters
>>>>
>>>> except those
>>>>>
>>>>> explicitly excluded above."""
>>>>
>>>>
>>>>> And, why do we care about non-printable chars at all?
>>>>
>>>>
>>>> This excludes the characters listed in 5.1 in addition to the other
>>>> stuff removed from nb-char.
>>>>
>>>> This is needed to skip entire UTF-8 minimal well formed subsequences at
>>>> a time to keep proper column counts.
>>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode839
>>>> lib/Support/YAMLParser.cpp:839: if (  (C >= '0' && C <= '9')
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> Just return.
>>>>
>>>>
>>>> Done.
>>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode847
>>>> lib/Support/YAMLParser.cpp:847: if (  C == '-'
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> Just return.
>>>>
>>>>
>>>> Done.
>>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode860
>>>> lib/Support/YAMLParser.cpp:860: && Current + 2 < End
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> So we're not inc'ing Current by 3 when encountering %NN because NN
>>>>
>>>> will be
>>>>>
>>>>> parsable anyway?
>>>>
>>>>
>>>> Yes, however, there's no reason it can't skip all 3.
>>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode1627
>>>> lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>
>>>>> Do we want to support all possible escapes, or can we get away with
>>>>
>>>> the subset
>>>>>
>>>>> we really need?
>>>>
>>>>
>>>> It's only a few extra lines with no extra complexity...
>>>>
>>>> http://codereview.appspot.com/5604054/
>>>>
>>>> - Michael Spencer
>>
>> Attached is a new patch that includes your changes, and the updated
>> decodeUTF8 you posted on IRC.
>>
>> - Michael Spencer