[llvm-commits] [PATCH] YAML parser.

Manuel Klimek klimek at google.com
Mon Feb 20 17:04:53 PST 2012


http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
File lib/Support/YAMLParser.cpp (right):

http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
std::string(HexStr.size() - 2, '0') + HexStr;
On 2012/02/16 22:32:42, Bigcheesegs wrote:
> On 2012/02/16 15:52:35, klimek wrote:
> > Why's the middle part there?
>
> \x requires two characters after it. This adds the 0 if needed.

Ok, shouldn't that be 2 - HexStr.size() then?

http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
On 2012/02/16 22:32:42, Bigcheesegs wrote:
> On 2012/02/16 15:52:35, klimek wrote:
> > Somehow I cannot match this up with 5.1.
> > Especially:
> > """On input, a YAML processor must accept all Unicode characters except
those
> > explicitly excluded above."""
> >
> > And, why do we care about non-printable chars at all?
>
> This excludes the characters listed in 5.1 in addition to the other stuff
> removed from nb-char.
>
> This is needed to skip entire UTF-8 minimal well formed subsequences at a time
> to keep proper column counts.

If I'm not completely misreading, this excludes much more:
"""To ensure readability, YAML streams use only the printable subset of the
Unicode character set. The allowed character range explicitly excludes the C0
control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are
allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is
allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF."""

The surrogate block cannot actually come out of decodeUTF8.

http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#ne...
lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
On 2012/02/16 22:32:42, Bigcheesegs wrote:
> On 2012/02/16 15:52:35, klimek wrote:
> > Do we want to support all possible escapes, or can we get away with the
subset
> > we really need?
>
> It's only a few extra lines with no extra complexity...

It is code that needs to be maintained ... shrug.

On Thu, Feb 16, 2012 at 2:38 PM, Michael Spencer <bigcheesegs at gmail.com> wrote:
> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h
> File include/llvm/Support/YAMLParser.h (right):
>
> http://codereview.appspot.com/5604054/diff/6001/include/llvm/Support/YAMLParser.h#newcode71
> include/llvm/Support/YAMLParser.h:71: std::string escape(StringRef
> Input);
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> This is not called anywhere yet(?) Why do we need it? Are we planning
>
> to write
>>
>> yaml or do we only need to parse it?
>
>
> It's used when writing out YAML files.
>
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp
> File lib/Support/YAMLParser.cpp (right):
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode667
> lib/Support/YAMLParser.cpp:667: std::string EscapedInput;
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> Curious: is there a reason not to use a Twine here (which would be my
>
> default
>>
>> choice for string concatenation).
>
>
> Twine cannot be used here as it is only valid until the end of the
> statement. See Twine.h.
>
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode693
> lib/Support/YAMLParser.cpp:693: EscapedInput += "\\x" +
> std::string(HexStr.size() - 2, '0') + HexStr;
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> Why's the middle part there?
>
>
> \x requires two characters after it. This adds the 0 if needed.
>
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode694
> lib/Support/YAMLParser.cpp:694: } else if (*i & 0x80) { // utf8
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> // utf8
>> is confusing to me here:
>> I thought we always assume utf8, and 0x80 merely means that we have a
>
> multi-byte
>>
>> utf8 sequence...
>
>
>> So I'd expect a comment like:
>> // utf8 multi-byte sequence
>
>
> Done.
>
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode788
> lib/Support/YAMLParser.cpp:788: if (   *Position == 0x09
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> Indentation off?
>
>
> It lines up with the other side of the || below.
>
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode795
> lib/Support/YAMLParser.cpp:795: if (   u8d.second != 0
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> Somehow I cannot match this up with 5.1.
>> Especially:
>> """On input, a YAML processor must accept all Unicode characters
>
> except those
>>
>> explicitly excluded above."""
>
>
>> And, why do we care about non-printable chars at all?
>
>
> This excludes the characters listed in 5.1 in addition to the other
> stuff removed from nb-char.
>
> This is needed to skip entire UTF-8 minimal well formed subsequences at
> a time to keep proper column counts.
>
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode839
> lib/Support/YAMLParser.cpp:839: if (  (C >= '0' && C <= '9')
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> Just return.
>
>
> Done.
>
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode847
> lib/Support/YAMLParser.cpp:847: if (  C == '-'
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> Just return.
>
>
> Done.
>
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode860
> lib/Support/YAMLParser.cpp:860: && Current + 2 < End
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> So we're not inc'ing Current by 3 when encountering %NN because NN
>
> will be
>>
>> parsable anyway?
>
>
> Yes, however, there's no reason it can't skip all 3.
>
>
> http://codereview.appspot.com/5604054/diff/6001/lib/Support/YAMLParser.cpp#newcode1627
> lib/Support/YAMLParser.cpp:1627: switch (UnquotedValue[0]) {
> On 2012/02/16 15:52:35, klimek wrote:
>>
>> Do we want to support all possible escapes, or can we get away with
>
> the subset
>>
>> we really need?
>
>
> It's only a few extra lines with no extra complexity...
>
> http://codereview.appspot.com/5604054/
>
> - Michael Spencer




More information about the llvm-commits mailing list