[llvm-commits] [PATCH] YAML parser.

Tue Feb 21 11:12:32 PST 2012

On Mon, Feb 20, 2012 at 5:04 PM, Manuel Klimek <klimek at google.com> wrote:
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
> File lib/Support/YAMLParser.cpp (right):
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
> lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
> std::string(HexStr.size() - 2, '0') + HexStr;
> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>> On 2012/02/16 15:52:35, klimek wrote:
>> > Why's the middle part there?
>>
>> \x requires two characters after it. This adds the 0 if needed.
>
> Ok, shouldn't that be 2 - HexStr.size() then?

Oh, you're right. I fixed that later in the code, but not up here. Fixed.

>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
> lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>> On 2012/02/16 15:52:35, klimek wrote:
>> > Somehow I cannot match this up with 5.1.
>> > Especially:
>> > """On input, a YAML processor must accept all Unicode characters except
> those
>> > explicitly excluded above."""
>> >
>> > And, why do we care about non-printable chars at all?
>>
>> This excludes the characters listed in 5.1 in addition to the other stuff
>> removed from nb-char.
>>
>> This is needed to skip entire UTF-8 minimal well formed subsequences at a time
>> to keep proper column counts.
>
> If I'm not completely misreading, this excludes much more:
> """To ensure readability, YAML streams use only the printable subset of the
> Unicode character set. The allowed character range explicitly excludes the C0
> control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are
> allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is
> allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF."""
>
> The surrogate block cannot actually come out of decodeUTF8.

All of the ranges excluded there are excluded by skip_nb_char. Yes,
the surrogate block check is redundant. It documents the exact ranges
that will be accepted.

> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
> lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>> On 2012/02/16 15:52:35, klimek wrote:
>> > Do we want to support all possible escapes, or can we get away with the
> subset
>> > we really need?
>>
>> It's only a few extra lines with no extra complexity...
>
> It is code that needs to be maintained ... shrug.
>
> On Thu, Feb 16, 2012 at 2:38 PM, Michael Spencer <bigcheesegs at gmail.com> wrote:
>> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h
>> File include/llvm/Support/YAMLParser.h (right):
>>
>> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h#newcode71
>> include/llvm/Support/YAMLParser.h:71: std::string escape(StringRef
>> Input);
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> This is not called anywhere yet(?) Why do we need it? Are we planning
>>
>> to write
>>>
>>> yaml or do we only need to parse it?
>>
>>
>> It's used when writing out YAML files.
>>
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
>> File lib/Support/YAMLParser.cpp (right):
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode667
>> lib/Support/YAMLParser.cpp:667: std::string EscapedInput;
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> Curious: is there a reason not to use a Twine here (which would be my
>>
>> default
>>>
>>> choice for string concatenation).
>>
>>
>> Twine cannot be used here as it is only valid until the end of the
>> statement. See Twine.h.
>>
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode693
>> lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
>> std::string(HexStr.size() - 2, '0') + HexStr;
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> Why's the middle part there?
>>
>>
>> \x requires two characters after it. This adds the 0 if needed.
>>
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode694
>> lib/Support/YAMLParser.cpp:694: } else if (*i & 0x80) { // utf8
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> // utf8
>>> is confusing to me here:
>>> I thought we always assume utf8, and 0x80 merely means that we have a
>>
>> multi-byte
>>>
>>> utf8 sequence...
>>
>>
>>> So I'd expect a comment like:
>>> // utf8 multi-byte sequence
>>
>>
>> Done.
>>
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode788
>> lib/Support/YAMLParser.cpp:788: if (   *Position == 0x09
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> Indentation off?
>>
>>
>> It lines up with the other side of the || below.
>>
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode795
>> lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> Somehow I cannot match this up with 5.1.
>>> Especially:
>>> """On input, a YAML processor must accept all Unicode characters
>>
>> except those
>>>
>>> explicitly excluded above."""
>>
>>
>>> And, why do we care about non-printable chars at all?
>>
>>
>> This excludes the characters listed in 5.1 in addition to the other
>> stuff removed from nb-char.
>>
>> This is needed to skip entire UTF-8 minimal well formed subsequences at
>> a time to keep proper column counts.
>>
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode839
>> lib/Support/YAMLParser.cpp:839: if (  (C >= '0' && C <= '9')
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> Just return.
>>
>>
>> Done.
>>
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode847
>> lib/Support/YAMLParser.cpp:847: if (  C == '-'
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> Just return.
>>
>>
>> Done.
>>
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode860
>> lib/Support/YAMLParser.cpp:860: && Current + 2 < End
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> So we're not inc'ing Current by 3 when encountering %NN because NN
>>
>> will be
>>>
>>> parsable anyway?
>>
>>
>> Yes, however, there's no reason it can't skip all 3.
>>
>>
>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode1627
>> lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
>> On 2012/02/16 15:52:35, klimek wrote:
>>>
>>> Do we want to support all possible escapes, or can we get away with
>>
>> the subset
>>>
>>> we really need?
>>
>>
>> It's only a few extra lines with no extra complexity...
>>
>> http://codereview.appspot.com/5604054/
>>
>> - Michael Spencer

Attached is a new patch that includes your changes, and the updated
decodeUTF8 you posted on IRC.

- Michael Spencer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: yaml-parser-21-02-2012.patch
Type: application/octet-stream
Size: 164929 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120221/a2022c4a/attachment.obj>