[llvm-commits] [PATCH] YAML parser.

Thu Feb 2 02:24:02 PST 2012

Rietveld link: http://codereview.appspot.com/5604054/
This is an attempt to use rietveld for the review, but it looks like
the generated mails are not really up to the task :( Sorry for the
verbosity, will look at different ways in the future...

First run over the header. Generally, the tooling code doesn't change
too much with the new interface, so it's definitely in an "I can live
with it" state.

One main high-level design question: Do we really need all the
implemented YAML features?

Some general design things we probably just have different opinions
about, which I ran into while changing the tooling code to use the
YAMLParser. Feel free to strongly object :)
- I find it easier as a user to have the public section as first section
in a class definition; you seem to have a mixture, which I find a little
confusing
- const iterators are missing
- I find references to higher-level scope objects as members in a class
to be bug-prone (you cannot see that the reference is kept from the call
site)
- I'm a strong opponent of abbrs; I'm not sure whether as a non-native
speaker it's just harder to read them or I'm just obsessive, but I'm a
big fan of spending the 0.01 seconds it takes to spell out Current or
Position.

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h
File include/llvm/Support/YAMLParser.h (right):

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode18
include/llvm/Support/YAMLParser.h:18: //    * BOMs anywhere other than
the first Unicode scalar value in the file.
It would be cool to have a pointer to the first class I have to look at
as a user (Stream). In general, I'm wondering whether we can reduce the
size of the visible interface here...

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode44
include/llvm/Support/YAMLParser.h:44: UEF_UTF32_LE,
Perhaps document LE vs. BE...

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode64
include/llvm/Support/YAMLParser.h:64: struct StreamStartInfo {
Is this only used for the scanning? If so, would it make sense to split
up the header? Why's this a struct instead of a typedef? (do you intend
to extend this?) Some documentation might help. Same for the next 2
structs.

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode77
include/llvm/Support/YAMLParser.h:77: struct Token : ilist_node<Token> {
Are Tokens meant to be changed after being created? If not, I'd vote for
a non-modifiable design.

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode103
include/llvm/Support/YAMLParser.h:103: /// A string of length 0 or more
who's begin() points to the logical location
s/who's/whose/

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode108
include/llvm/Support/YAMLParser.h:108: ///       can't be in a union in
C++03.
Why not use inheritance instead?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode155
include/llvm/Support/YAMLParser.h:155: /// Simple keys are handled by
creating an entry in SimpleKeys for each Token
I don't find the concept of a "simple key" in the YAML spec - I think it
would be helpful to reference the standard.

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode161
include/llvm/Support/YAMLParser.h:161: /// others) before the
SimpleKey's Tok.
In general, this comment doesn't really help me understand the concept.
It basically explains some coupling, but that looks pretty random when
I'm contextually at this part of the file.

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode162
include/llvm/Support/YAMLParser.h:162: struct SimpleKey {
At a first glance this looks like an internal structure used for parsing
- can we hide this?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode177
include/llvm/Support/YAMLParser.h:177: Scanner(const StringRef input,
SourceMgr &sm);
Style: variables start with upper case letter (or is there a special
rule I'm missing?)

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode191
include/llvm/Support/YAMLParser.h:191: if (Pos >= End)
Why can that happen?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode215
include/llvm/Support/YAMLParser.h:215: bool isAtEnd(StringRef::iterator
i = 0) {
Unused?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode233
include/llvm/Support/YAMLParser.h:233: /// @brief Skip a single
nb-char[27] starting at Pos.
What does 'nb' stand for here? Same for the functions below, I have no
idea what those initializations mean.

Also, here and elsewhere - are those bracketed numbers referring to
documentation somewhere?

And lastly, style: shouldn't methods be camelCase?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode262
include/llvm/Support/YAMLParser.h:262: /// @brief Skip minimal
well-formed code unit subsequence's until Func
s/'//?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode270
include/llvm/Support/YAMLParser.h:270: StringRef::iterator i =
(this->*Func)(Pos);
Style: capital letter for variables?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode304
include/llvm/Support/YAMLParser.h:304: /// @brief Remove simple keys
that can no longer be valid simple keys.
If they can no longer be valid simple keys, perhaps call this
removeStaleSimpleKeyCandidates?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode319
include/llvm/Support/YAMLParser.h:319: bool rollIndent( int Col
Is the indent here intentional?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode413
include/llvm/Support/YAMLParser.h:413: TokenQueueT TokenQueue;
If this only holds a limited number of tokens (as simple keys are
limited to 1024 tokens if I understand it correctly), would using a
SmallVector lead to the same performance with less code?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode422
include/llvm/Support/YAMLParser.h:422: class document_iterator;
>From your naming I assume that using lower_case for a class is correct
style for iterators?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode437
include/llvm/Support/YAMLParser.h:437: Stream(StringRef input, SourceMgr
&sm);
Capitalization of variables?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode467
include/llvm/Support/YAMLParser.h:467: Node(unsigned int Type, Document
*D, StringRef A);
It's really hard to guess what 'A' stands for here without reading the
context - can we call it Anchor?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode477
include/llvm/Support/YAMLParser.h:477: // These functions forward to
Document and Scanner.
This makes the Node interface pretty large... Is that really necessary?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode493
include/llvm/Support/YAMLParser.h:493: static inline bool classof(const
NullNode *) { return true; }
Does the 'inline' keyword actually change anything here?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode512
include/llvm/Support/YAMLParser.h:512: // utf8).
Why not whatever the original encoding was?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode536
include/llvm/Support/YAMLParser.h:536: static inline bool classof(const
KeyValueNode *) { return true; }
Is there an underlying strategy how we order the type information vs.
the normal methods?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode541
include/llvm/Support/YAMLParser.h:541: /// @brief Parse and return the
key.
Document that calling this multiple times is ok?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode541
include/llvm/Support/YAMLParser.h:541: /// @brief Parse and return the
key.
I'd document that calling this multiple times is ok...

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode586
include/llvm/Support/YAMLParser.h:586: if ((Base && Other.Base) &&
Base->CurrentEntry != Other.Base->CurrentEntry)
Just return...

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode613
include/llvm/Support/YAMLParser.h:613: Type MType;
MappingType? Or call the enum MappingType and this Type?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode702
include/llvm/Support/YAMLParser.h:702: iterator begin() {
I'm not really happy about the duplication in begin() and skip(),
especially as the change to mid-parse skip() support probably will end
up to be exactly the same... Pull out a base class?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode745
include/llvm/Support/YAMLParser.h:745: Stream &S;
/me cries. But yea, I know, I'm not in the majority with my strong
aversion of Os.

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode756
include/llvm/Support/YAMLParser.h:756: return S.S.peekNext();
And this is why: I can deduce locally that the first S is the Stream,
but I had to go back and read the Stream implementation to see that the
second S is the Scanner.

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode771
include/llvm/Support/YAMLParser.h:771: void handleTagDirective(const
Token &t) {
I assume this is a todo?

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode777
include/llvm/Support/YAMLParser.h:777: bool dir = false;
Style: CapVariables. Also, 'dir' sounds like 'directory'... (yea yea, I
know, I'm strange).

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode809
include/llvm/Support/YAMLParser.h:809: Token &t = peekNext();
Style: CapVariables

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode846
include/llvm/Support/YAMLParser.h:846: document_iterator(Document *d) :
Doc(d) {}
Style: CapVariables

http://codereview.appspot.com/5604054/diff/1/include/llvm/Support/YAMLParser.h#newcode856
include/llvm/Support/YAMLParser.h:856: // In place destruct and new are
used here to avoid an unneeded
This looks a little over the top - I don't expect this to be performance
relevant at all...

Cheers,
/Manuel

On Mon, Jan 30, 2012 at 9:08 PM, Michael Spencer <bigcheesegs at gmail.com> wrote:
> Attached is the patch for the YAML parser I've been working on. YAML
> is a super set of JSON that adds many features that I want for writing
> tests for lld (the llvm linker) and other places where we use object
> files.
>
> The API is very similar to the existing JSON API, but is not exactly
> the same as YAML has an extended data model.
>
> This parser is slower than the currently existing JSON parser. For
> files with large scalars, there is almost no difference. For medium
> scalars, there's a ~2x slowdown. And for small scalars, there's a ~6x
> slowdown.
>
> Here are some performance numbers for {yaml,json}-bench -memory-limit
> 100 (a 32bit build can't do much more than that :P). Note that for
> YAML. The Parsing time includes the Tokenizing time.
>
> c:\Users\mspencer\Projects\llvm-project\llvm>yaml-bench -memory-limit 100
> ===-------------------------------------------------------------------------===
>                             YAML parser benchmark
> ===-------------------------------------------------------------------------===
>  Total Execution Time: 5.2104 seconds (5.2185 wall clock)
>
>   ---User Time---   --System Time--   --User+System--   ---Wall
> Time---  --- Name ---
>   2.8392 ( 56.3%)   0.1716 (100.0%)   3.0108 ( 57.8%)   3.0118 (
> 57.7%)  Small Values: Parsing
>   2.1216 ( 42.1%)   0.0000 (  0.0%)   2.1216 ( 40.7%)   2.1257 (
> 40.7%)  Small Values: Tokenizing
>   0.0780 (  1.5%)   0.0000 (  0.0%)   0.0780 (  1.5%)   0.0810 (
> 1.6%)  Small Values: Loop
>   5.0388 (100.0%)   0.1716 (100.0%)   5.2104 (100.0%)   5.2185 (100.0%)  Total
>
> ===-------------------------------------------------------------------------===
>                             YAML parser benchmark
> ===-------------------------------------------------------------------------===
>  Total Execution Time: 0.4836 seconds (0.4740 wall clock)
>
>   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>   0.2184 ( 45.2%)   0.2184 ( 45.2%)   0.2200 ( 46.4%)  Medium Values: Parsing
>   0.1716 ( 35.5%)   0.1716 ( 35.5%)   0.1710 ( 36.1%)  Medium Values:
> Tokenizing
>   0.0936 ( 19.4%)   0.0936 ( 19.4%)   0.0830 ( 17.5%)  Medium Values: Loop
>   0.4836 (100.0%)   0.4836 (100.0%)   0.4740 (100.0%)  Total
>
> ===-------------------------------------------------------------------------===
>                             YAML parser benchmark
> ===-------------------------------------------------------------------------===
>  Total Execution Time: 0.2496 seconds (0.2480 wall clock)
>
>   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>   0.0780 ( 31.3%)   0.0780 ( 31.3%)   0.0830 ( 33.5%)  Large Values: Parsing
>   0.0936 ( 37.5%)   0.0936 ( 37.5%)   0.0830 ( 33.5%)  Large Values: Tokenizing
>   0.0780 ( 31.3%)   0.0780 ( 31.3%)   0.0820 ( 33.1%)  Large Values: Loop
>   0.2496 (100.0%)   0.2496 (100.0%)   0.2480 (100.0%)  Total
>
> c:\Users\mspencer\Projects\llvm-project\llvm>json-bench -memory-limit 100
> ===-------------------------------------------------------------------------===
>                             JSON parser benchmark
> ===-------------------------------------------------------------------------===
>  Total Execution Time: 0.6552 seconds (0.6531 wall clock)
>
>   ---User Time---   --System Time--   --User+System--   ---Wall
> Time---  --- Name ---
>   0.5460 ( 87.5%)   0.0312 (100.0%)   0.5772 ( 88.1%)   0.5721 (
> 87.6%)  Small Values: Parsing
>   0.0780 ( 12.5%)   0.0000 (  0.0%)   0.0780 ( 11.9%)   0.0810 (
> 12.4%)  Small Values: Loop
>   0.6240 (100.0%)   0.0312 (100.0%)   0.6552 (100.0%)   0.6531 (100.0%)  Total
>
> ===-------------------------------------------------------------------------===
>                             JSON parser benchmark
> ===-------------------------------------------------------------------------===
>  Total Execution Time: 0.1872 seconds (0.1830 wall clock)
>
>   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>   0.1092 ( 58.3%)   0.1092 ( 58.3%)   0.1030 ( 56.3%)  Medium Values: Parsing
>   0.0780 ( 41.7%)   0.0780 ( 41.7%)   0.0800 ( 43.7%)  Medium Values: Loop
>   0.1872 (100.0%)   0.1872 (100.0%)   0.1830 (100.0%)  Total
>
> ===-------------------------------------------------------------------------===
>                             JSON parser benchmark
> ===-------------------------------------------------------------------------===
>  Total Execution Time: 0.1716 seconds (0.1620 wall clock)
>
>   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>   0.0780 ( 45.5%)   0.0780 ( 45.5%)   0.0810 ( 50.0%)  Large Values: Parsing
>   0.0936 ( 54.5%)   0.0936 ( 54.5%)   0.0810 ( 50.0%)  Large Values: Loop
>   0.1716 (100.0%)   0.1716 (100.0%)   0.1620 (100.0%)  Total
>
>
> - Michael Spencer