[llvm-commits] [PATCH] YAML parser.

Michael Spencer bigcheesegs at gmail.com
Mon Mar 19 11:44:42 PDT 2012


On Fri, Mar 16, 2012 at 8:35 AM, Michael Spencer <bigcheesegs at gmail.com> wrote:
> On Wed, Mar 7, 2012 at 5:12 PM, Michael Spencer <bigcheesegs at gmail.com> wrote:
>> On Fri, Feb 24, 2012 at 4:40 PM, Manuel Klimek <klimek at google.com> wrote:
>>> Ok, generally looks good for our use case minus style (which I don't
>>> feel comfortable making calls on). I assume you'll need to get
>>> approval from somebody else...
>>>
>>> Cheers,
>>> /Manuel
>>
>> Ping.
>>
>> - Michael Spencer
>>
>>> On Tue, Feb 21, 2012 at 11:12 AM, Michael Spencer <bigcheesegs at gmail.com> wrote:
>>>> On Mon, Feb 20, 2012 at 5:04 PM, Manuel Klimek <klimek at google.com> wrote:
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
>>>>> File lib/Support/YAMLParser.cpp (right):
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
>>>>> lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
>>>>> std::string(HexStr.size() - 2, '0') + HexStr;
>>>>> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>> > Why's the middle part there?
>>>>>>
>>>>>> \x requires two characters after it. This adds the 0 if needed.
>>>>>
>>>>> Ok, shouldn't that be 2 - HexStr.size() then?
>>>>
>>>> Oh, you're right. I fixed that later in the code, but not up here. Fixed.
>>>>
>>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
>>>>> lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
>>>>> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>> > Somehow I cannot match this up with 5.1.
>>>>>> > Especially:
>>>>>> > """On input, a YAML processor must accept all Unicode characters except
>>>>> those
>>>>>> > explicitly excluded above."""
>>>>>> >
>>>>>> > And, why do we care about non-printable chars at all?
>>>>>>
>>>>>> This excludes the characters listed in 5.1 in addition to the other stuff
>>>>>> removed from nb-char.
>>>>>>
>>>>>> This is needed to skip entire UTF-8 minimal well formed subsequences at a time
>>>>>> to keep proper column counts.
>>>>>
>>>>> If I'm not completely misreading, this excludes much more:
>>>>> """To ensure readability, YAML streams use only the printable subset of the
>>>>> Unicode character set. The allowed character range explicitly excludes the C0
>>>>> control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are
>>>>> allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is
>>>>> allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF."""
>>>>>
>>>>> The surrogate block cannot actually come out of decodeUTF8.
>>>>
>>>> All of the ranges excluded there are excluded by skip_nb_char. Yes,
>>>> the surrogate block check is redundant. It documents the exact ranges
>>>> that will be accepted.
>>>>
>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
>>>>> lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
>>>>> On 2012/02/16 22:32:42, Bigcheesegs wrote:
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>> > Do we want to support all possible escapes, or can we get away with the
>>>>> subset
>>>>>> > we really need?
>>>>>>
>>>>>> It's only a few extra lines with no extra complexity...
>>>>>
>>>>> It is code that needs to be maintained ... shrug.
>>>>>
>>>>> On Thu, Feb 16, 2012 at 2:38 PM, Michael Spencer <bigcheesegs at gmail.com> wrote:
>>>>>> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h
>>>>>> File include/llvm/Support/YAMLParser.h (right):
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h#newcode71
>>>>>> include/llvm/Support/YAMLParser.h:71: std::string escape(StringRef
>>>>>> Input);
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> This is not called anywhere yet(?) Why do we need it? Are we planning
>>>>>>
>>>>>> to write
>>>>>>>
>>>>>>> yaml or do we only need to parse it?
>>>>>>
>>>>>>
>>>>>> It's used when writing out YAML files.
>>>>>>
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
>>>>>> File lib/Support/YAMLParser.cpp (right):
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode667
>>>>>> lib/Support/YAMLParser.cpp:667: std::string EscapedInput;
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> Curious: is there a reason not to use a Twine here (which would be my
>>>>>>
>>>>>> default
>>>>>>>
>>>>>>> choice for string concatenation).
>>>>>>
>>>>>>
>>>>>> Twine cannot be used here as it is only valid until the end of the
>>>>>> statement. See Twine.h.
>>>>>>
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode693
>>>>>> lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
>>>>>> std::string(HexStr.size() - 2, '0') + HexStr;
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> Why's the middle part there?
>>>>>>
>>>>>>
>>>>>> \x requires two characters after it. This adds the 0 if needed.
>>>>>>
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode694
>>>>>> lib/Support/YAMLParser.cpp:694: } else if (*i & 0x80) { // utf8
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> // utf8
>>>>>>> is confusing to me here:
>>>>>>> I thought we always assume utf8, and 0x80 merely means that we have a
>>>>>>
>>>>>> multi-byte
>>>>>>>
>>>>>>> utf8 sequence...
>>>>>>
>>>>>>
>>>>>>> So I'd expect a comment like:
>>>>>>> // utf8 multi-byte sequence
>>>>>>
>>>>>>
>>>>>> Done.
>>>>>>
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode788
>>>>>> lib/Support/YAMLParser.cpp:788: if (   *Position == 0x09
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> Indentation off?
>>>>>>
>>>>>>
>>>>>> It lines up with the other side of the || below.
>>>>>>
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode795
>>>>>> lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> Somehow I cannot match this up with 5.1.
>>>>>>> Especially:
>>>>>>> """On input, a YAML processor must accept all Unicode characters
>>>>>>
>>>>>> except those
>>>>>>>
>>>>>>> explicitly excluded above."""
>>>>>>
>>>>>>
>>>>>>> And, why do we care about non-printable chars at all?
>>>>>>
>>>>>>
>>>>>> This excludes the characters listed in 5.1 in addition to the other
>>>>>> stuff removed from nb-char.
>>>>>>
>>>>>> This is needed to skip entire UTF-8 minimal well formed subsequences at
>>>>>> a time to keep proper column counts.
>>>>>>
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode839
>>>>>> lib/Support/YAMLParser.cpp:839: if (  (C >= '0' && C <= '9')
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> Just return.
>>>>>>
>>>>>>
>>>>>> Done.
>>>>>>
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode847
>>>>>> lib/Support/YAMLParser.cpp:847: if (  C == '-'
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> Just return.
>>>>>>
>>>>>>
>>>>>> Done.
>>>>>>
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode860
>>>>>> lib/Support/YAMLParser.cpp:860: && Current + 2 < End
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> So we're not inc'ing Current by 3 when encountering %NN because NN
>>>>>>
>>>>>> will be
>>>>>>>
>>>>>>> parsable anyway?
>>>>>>
>>>>>>
>>>>>> Yes, however, there's no reason it can't skip all 3.
>>>>>>
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode1627
>>>>>> lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
>>>>>> On 2012/02/16 15:52:35, klimek wrote:
>>>>>>>
>>>>>>> Do we want to support all possible escapes, or can we get away with
>>>>>>
>>>>>> the subset
>>>>>>>
>>>>>>> we really need?
>>>>>>
>>>>>>
>>>>>> It's only a few extra lines with no extra complexity...
>>>>>>
>>>>>> http://codereview.appspot.com/5604054/
>>>>>>
>>>>>> - Michael Spencer
>>>>
>>>> Attached is a new patch that includes your changes, and the updated
>>>> decodeUTF8 you posted on IRC.
>>>>
>>>> - Michael Spencer
>
> Ping
>
> - Michael Spencer

Ping

- Michael Spencer




More information about the llvm-commits mailing list