[llvm-commits] [PATCH] YAML parser.

Thu Mar 22 18:25:01 PDT 2012

On Mon, Mar 19, 2012 at 9:48 PM, Chris Lattner <clattner at apple.com> wrote:
> On Mar 19, 2012, at 5:44 PM, Michael Spencer wrote:
>>> Who do you want to review this?  Also, please attach the latest patch.
>>>
>>> -Chris
>>
>> Attached. I'm fine with anyone reviewing it, Manuel felt like it
>> needed another reviewer. However, you should take a look at the
>> inclusion of the pyyaml tests (specifically LICENSE.TXT and
>> test/YAMLParser/LICENSE.txt). This license _only_ applies to the files
>> in test/YAMLParser.
>
> I'm not particularly thrilled with adding another license dependency (even for testcases) for something like this: is it reasonably possible to keep it out of tree or plop it into the test-suite repository?

Keeping it out of tree or in the test-suite makes it very easy to
break, and I can't even run the test-suite tests. However, I can write
my own tests based on the standard.

> The patch is generally looking good, dramatically improved from the last time I looked at it.  Some comments:
>
> +++ b/include/llvm/Support/YAMLParser.h
> @@ -0,0 +1,541 @@
> +//===--- YAMLParser.h - Simple YAML parser --------------------------------===//
> ...
> +#include "llvm/ADT/APInt.h"
> +#include "llvm/ADT/OwningPtr.h"
> +#include "llvm/ADT/SmallString.h"
> +#include "llvm/ADT/StringRef.h"
> +#include "llvm/Support/Allocator.h"
> +#include "llvm/Support/Casting.h"
> +#include "llvm/Support/type_traits.h"
> +#include "llvm/Support/SMLoc.h"
>
> Can these be pruned at all?  It doesn't seem obvious that you're using APInt, type_traits.h, Casting.h, or SmallString.h (see below) here, for example.

Done.

> +namespace {
> +bool getAs(const ScalarNode *SN, bool &Result) {
> +  SmallString<4> Storage;
> +  StringRef Value = SN->getValue(Storage);
> +  if (Value == "true")
> +    Result = true;
> +  else if (Value == "false")
> +    Result = false;
> +  else
> +    return false;
> +  return true;
> +}
> +}
> +
>
> Please use 'static' instead of anon-namespace for functions.  Why is this a static function in a header?  It seems like it should be out of line, also it looks dead, along with:
>
> +template<class T>
> +typename enable_if_c<std::numeric_limits<T>::is_integer, bool>::type
> +getAs(const ScalarNode *SN, T &Result) {
> +  SmallString<4> Storage;
> +  StringRef Value = SN->getValue(Storage);
> +  if (Value.getAsInteger(0, Result))
> +    return false;
> +  return true;
> +}
>
> ... I don't get what this is doing, it looks dead to me.  If these are intended to be used by other clients, we should come up with a better and more elegant way to do this.

They are used by clients. More will be needed, but this can actually
be pushed into ScalarNode. The original intent was to allow client
template specializations to allow concise conversion to user defined
types of any node. I have code that would actually use this, but the
current interface wouldn't allow proper error reporting.

> +class Node {
> ...
> +  // These functions forward to Document and Scanner.
> +  Token &peekNext();
> +  Token getNext();
> +  Node *parseBlockNode();
>
> I'm not deeply familiar with the code, but it seems odd to have parser logic buried in the node classes.  Moving this out of the class would also allow Token to be dropped into an anonymous namespace.

These are required because the parser is incremental. Given a Sequence
or Mapping node, every time you ask for the next entry it must be
parsed.

> +  Document *Doc;
>
> It also seems strange for Node to have a pointer to its document.  If you have space to burn :), it would probably be more useful to have a PointerUnion<Node*,Document*> to represent parent pointer or document pointer.

Nodes don't need to know anything about their parent. It need document
for parsing reasons mentioned above.

> +  SMRange SourceRange;
>
> Also strange - a typical pattern for an AST would be for the sourcerange to be computed, allowing classes to store their minimal representation (ScalarNode could derive it from Value, for example, saving two words per ScalarNode).

This is only applicable to ScalarNode, the rest don't store any other
information relating to source range, as everything is parsed
incrementally.

> typo: modifcation

Done.

>
> +/// @brief A key and value pair. While not technically a Node under the YAML
> +///        representation graph, it is easier to treat them this way.
> +///
> +/// TODO: Consider making this not a child of Node.
> +class KeyValueNode : public Node {
>
> +/// @brief Represents an alias to a Node with an anchor.
>
> For each of these ASTs, it would be nice to give an example of what they look like in a YAML document, just for intuition.

Done.

>
> +class document_iterator {
> +  document_iterator operator ++() {
> +    if (!Doc->skip()) {
> +      delete Doc;
> +      Doc = 0;
> +    } else {
> +      Stream &S = Doc->stream;
> +      delete Doc;
> +      Doc = new Document(S);
> +    }
>
> Should "Doc" just be an OwningPtr?

Done.

> +++ b/lib/Support/YAMLParser.cpp
>
> You've got some redundant #includes.

Done.

> +namespace llvm {
> +namespace yaml {
>
> Please don't put huge amounts of stuff in namespaces like this, per the coding standards.

Done.

> +/// getBOM - Reads up to the first 4 bytes to determine the Unicode encoding
> +///          form of \a Input.
> +///
> +/// @param Input A string of length 0 or more.
> +/// @returns An EncodingInfo indicating the Unicode encoding form of the input
> +///          and how long the byte order mark is if one exists.
> +EncodingInfo getUnicodeEncoding(StringRef Input) {
>
> This should be static, the doxygen comment is out of date.

Done.

> This function looks like an over-optimized version of something like this:
>
>  if (Input.starts_with(StringRef("\x00\x00\xFE\xFF"))
>     return std::make_pair(UEF_UTF32_BE, 4);
> ...

It looks like it does because it was implemented from the table in the spec.

> +/// Token - A single YAML token.
> +struct Token : ilist_node<Token> {
>
> Unless I'm missing something, Tokens are all fixed size with no subclasses.  If this is the case, ilist is inappropriate, just use std::list<Token>, eliminating the ilist stuff.

std::list makes it more difficult to use the BumpPtrAllocator. It also
makes Iterators not directly convertible to Token*.

> +struct SimpleKey {
>
> Looks like it could be in an anon namespace.

Done.

> +UTF8Decoded decodeUTF8(StringRef Range) {
>
> Looks like it should be marked static and/or moved out to some other part of libsupport.  Perhaps a method on StringRef.

Marked static. If we actually had any other users of it I would want a
proper API.

> +/// @brief Scans YAML tokens from a MemoryBuffer.
> +class Scanner {
>
> Looks like it should be in an anonymous namespace.

Done.

> +  /// @brief Skip a single nb-char[27] starting at Position.
> +  ///
> +  /// A nb-char is 0x9 | [0x20-0x7E] | 0x85 | [0xA0-0xD7FF] | [0xE000-0xFEFE]
> +  ///                  | [0xFF00-0xFFFD] | [0x10000-0x10FFFF]
> +  ///
> +  /// @returns The code unit after the nb-char, or Position if it's not an
> +  ///          nb-char.
> +  StringRef::iterator skip_nb_char(StringRef::iterator Position);
>
> These sorts of methods seem like they should all be static helper functions, not methods on Scanner.  If you prefer to keep them as methods, please make them be static methods if they don't use any Scanner state.

They all use Scanner state to keep track of the end of file and to
report errors.

> +  template<StringRef::iterator (Scanner::*Func)(StringRef::iterator)>
> +  StringRef::iterator skip_while(StringRef::iterator Position) {
>
> It looks like you're just using a template here to force the compiler to instantiate it.  It would be more clear to use a standard function argument, and use always_inline if the optimizer isn't doing the right thing. Incidentally, if you can make all these methods static, then you can use a function pointer instead of member pointer.

Changed to a member function pointer.

> +  /// @brief Current column number in Unicode code points.
> +  unsigned Column;
> +
> +  /// @brief Current line number.
> +  unsigned Line;
>
> As a random comment, if you care about parser performance, it would be better to not calculate these on the fly: instead, just calculate them on demand (assuming they are rarely used).  This is a well established pattern in Clang and is important for lexer performance.

As YAML is a white-space sensitive format these are both used a lot.

> +namespace {
> +SmallString<4> encodeUTF8(uint32_t UnicodeScalarValue) {
>
> Please use static (see coding conventions).  Also, returning a SmallString isn't great.  It looks like a vast majority of clients already have a SmallVectorImpl<char> that they're appending to, just pass that in, instead of returning a temporary smallstring.  This sort of thing pains me:
>
> +              SmallString<4> Val = encodeUTF8(0xA0);
> +              Storage.insert(Storage.end(), Val.begin(), Val.end());
> +              break;

Static and made it return via SmallVectorImpl<char>& out param.

> +bool yaml::dumpTokens(StringRef Input, raw_ostream &OS) {
>
> It's fairly obvious, bug please add a doc comment for each of these main entrypoints.

These are all documented in the header.

> +void Scanner::skip(uint32_t Distance) {
> +  Current += Distance;
> +  Column += Distance;
> +}
>
> The column computation seems dubious to me here, best to compute it lazily on demand :)

Same as above.

> +  if (*Current == '[')
> +    return scanFlowCollectionStart(true);
> +
> +  if (*Current == '{')
> +    return scanFlowCollectionStart(false);
> +
> +  if (*Current == ']')
> +    return scanFlowCollectionEnd(true);
> ...
>
> switch statement?

I find it easier to read without going to a switch statement between
each conditional test that cannot be part of a switch statement, which
quite a few of them are.

> +void Node::setError(const Twine &Msg, Token &Tok) const {
> +  Doc->setError(Msg, Tok);
> +}
> +
> +bool Node::failed() const {
> +  return Doc->failed();
> +}
>
> Are these helper methods really that important?

The functions they call are private members of Document, so directly
accessing them would require explicitly making all nodes friends of
Document.

> +StringRef ScalarNode::getValue(SmallVectorImpl<char> &Storage) const {
>
> The nesting is too great :), please split out part of the body (perhaps the "Insert all previous chars into Storage" loop) out to its own helper function.

Done.

> This is an odd pattern:
> +      return Key = new (getAllocator().Allocate<NullNode>()) NullNode(Doc);
>
>
> Can't you overload "operator new" for nodes to take an allocator, allowing:
> +      return Key = new (getAllocator()) NullNode(Doc);
>
> ?

Done.

> +++ b/utils/yaml-bench/YAMLBench.cpp
>
> I know it doesn't matter much, but all the functions could be marked static.

Done.

> -Chris

Updated patch attached.

Thanks for the review!

- Michael Spencer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: yaml-22-03-2012.patch
Type: application/octet-stream
Size: 155755 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120322/09561b19/attachment.obj>