[llvm-commits] [PATCH] YAML parser.

Michael Spencer bigcheesegs at gmail.com
Fri Mar 16 08:35:41 PDT 2012


On Wed, Mar 7, 2012 at 5:12 PM, Michael Spencer <bigcheesegs at gmail.com> wrote:
> On Fri, Feb 24, 2012 at 4:40 PM, Manuel Klimek <klimek at google.com> wrote:
>> Ok, generally looks good for our use case minus style (which I don't
>> feel comfortable making calls on). I assume you'll need to get
>> approval from somebody else...
>>
>> Cheers,
>> /Manuel
>
> Ping.
>
> - Michael Spencer
>
>> On Tue, Feb 21, 2012 at 11:12 AM, Michael Spencer <bigcheesegs at gmail.com> wrote:
>>> On Mon, Feb 20, 2012 at 5:04 PM, Manuel Klimek <klimek at google.com> wrote:
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
>>>> File lib/Support/YAMLParser.cpp (right):
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
>>>> lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
>>>> std::string(HexStr.size() - 2, '0') + HexStr;
>>>> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>> > Why's the middle part there?
>>>>>
>>>>> \x requires two characters after it. This adds the 0 if needed.
>>>>
>>>> Ok, shouldn't that be 2 - HexStr.size() then?
>>>
>>> Oh, you're right. I fixed that later in the code, but not up here. Fixed.
>>>
>>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
>>>> lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
>>>> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>> > Somehow I cannot match this up with 5.1.
>>>>> > Especially:
>>>>> > """On input, a YAML processor must accept all Unicode characters except
>>>> those
>>>>> > explicitly excluded above."""
>>>>> >
>>>>> > And, why do we care about non-printable chars at all?
>>>>>
>>>>> This excludes the characters listed in 5.1 in addition to the other stuff
>>>>> removed from nb-char.
>>>>>
>>>>> This is needed to skip entire UTF-8 minimal well formed subsequences at a time
>>>>> to keep proper column counts.
>>>>
>>>> If I'm not completely misreading, this excludes much more:
>>>> """To ensure readability, YAML streams use only the printable subset of the
>>>> Unicode character set. The allowed character range explicitly excludes the C0
>>>> control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are
>>>> allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is
>>>> allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF."""
>>>>
>>>> The surrogate block cannot actually come out of decodeUTF8.
>>>
>>> All of the ranges excluded there are excluded by skip_nb_char. Yes,
>>> the surrogate block check is redundant. It documents the exact ranges
>>> that will be accepted.
>>>
>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
>>>> lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
>>>> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>> > Do we want to support all possible escapes, or can we get away with the
>>>> subset
>>>>> > we really need?
>>>>>
>>>>> It's only a few extra lines with no extra complexity...
>>>>
>>>> It is code that needs to be maintained ... shrug.
>>>>
>>>> On Thu, Feb 16, 2012 at 2:38 PM, Michael Spencer <bigcheesegs at gmail.com> wrote:
>>>>> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h
>>>>> File include/llvm/Support/YAMLParser.h (right):
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h#newcode71
>>>>> include/llvm/Support/YAMLParser.h:71: std::string escape(StringRef
>>>>> Input);
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> This is not called anywhere yet(?) Why do we need it? Are we planning
>>>>>
>>>>> to write
>>>>>>
>>>>>> yaml or do we only need to parse it?
>>>>>
>>>>>
>>>>> It's used when writing out YAML files.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
>>>>> File lib/Support/YAMLParser.cpp (right):
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode667
>>>>> lib/Support/YAMLParser.cpp:667: std::string EscapedInput;
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> Curious: is there a reason not to use a Twine here (which would be my
>>>>>
>>>>> default
>>>>>>
>>>>>> choice for string concatenation).
>>>>>
>>>>>
>>>>> Twine cannot be used here as it is only valid until the end of the
>>>>> statement. See Twine.h.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode693
>>>>> lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
>>>>> std::string(HexStr.size() - 2, '0') + HexStr;
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> Why's the middle part there?
>>>>>
>>>>>
>>>>> \x requires two characters after it. This adds the 0 if needed.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode694
>>>>> lib/Support/YAMLParser.cpp:694: } else if (*i & 0x80) { // utf8
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> // utf8
>>>>>> is confusing to me here:
>>>>>> I thought we always assume utf8, and 0x80 merely means that we have a
>>>>>
>>>>> multi-byte
>>>>>>
>>>>>> utf8 sequence...
>>>>>
>>>>>
>>>>>> So I'd expect a comment like:
>>>>>> // utf8 multi-byte sequence
>>>>>
>>>>>
>>>>> Done.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode788
>>>>> lib/Support/YAMLParser.cpp:788: if (   *Position == 0x09
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> Indentation off?
>>>>>
>>>>>
>>>>> It lines up with the other side of the || below.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode795
>>>>> lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> Somehow I cannot match this up with 5.1.
>>>>>> Especially:
>>>>>> """On input, a YAML processor must accept all Unicode characters
>>>>>
>>>>> except those
>>>>>>
>>>>>> explicitly excluded above."""
>>>>>
>>>>>
>>>>>> And, why do we care about non-printable chars at all?
>>>>>
>>>>>
>>>>> This excludes the characters listed in 5.1 in addition to the other
>>>>> stuff removed from nb-char.
>>>>>
>>>>> This is needed to skip entire UTF-8 minimal well formed subsequences at
>>>>> a time to keep proper column counts.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode839
>>>>> lib/Support/YAMLParser.cpp:839: if (  (C >= '0' && C <= '9')
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> Just return.
>>>>>
>>>>>
>>>>> Done.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode847
>>>>> lib/Support/YAMLParser.cpp:847: if (  C == '-'
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> Just return.
>>>>>
>>>>>
>>>>> Done.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode860
>>>>> lib/Support/YAMLParser.cpp:860: && Current + 2 < End
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> So we're not inc'ing Current by 3 when encountering %NN because NN
>>>>>
>>>>> will be
>>>>>>
>>>>>> parsable anyway?
>>>>>
>>>>>
>>>>> Yes, however, there's no reason it can't skip all 3.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode1627
>>>>> lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>
>>>>>> Do we want to support all possible escapes, or can we get away with
>>>>>
>>>>> the subset
>>>>>>
>>>>>> we really need?
>>>>>
>>>>>
>>>>> It's only a few extra lines with no extra complexity...
>>>>>
>>>>> http://codereview.appspot.com/5604054/
>>>>>
>>>>> - Michael Spencer
>>>
>>> Attached is a new patch that includes your changes, and the updated
>>> decodeUTF8 you posted on IRC.
>>>
>>> - Michael Spencer

Ping

- Michael Spencer




More information about the llvm-commits mailing list