[llvm-commits] [PATCH] YAML parser.

Chris Lattner clattner at apple.com
Mon Mar 19 21:48:51 PDT 2012


On Mar 19, 2012, at 5:44 PM, Michael Spencer wrote:
>> Who do you want to review this?  Also, please attach the latest patch.
>> 
>> -Chris
> 
> Attached. I'm fine with anyone reviewing it, Manuel felt like it
> needed another reviewer. However, you should take a look at the
> inclusion of the pyyaml tests (specifically LICENSE.TXT and
> test/YAMLParser/LICENSE.txt). This license _only_ applies to the files
> in test/YAMLParser.

I'm not particularly thrilled with adding another license dependency (even for testcases) for something like this: is it reasonably possible to keep it out of tree or plop it into the test-suite repository?

The patch is generally looking good, dramatically improved from the last time I looked at it.  Some comments:

+++ b/include/llvm/Support/YAMLParser.h
@@ -0,0 +1,541 @@
+//===--- YAMLParser.h - Simple YAML parser --------------------------------===//
...
+#include "llvm/ADT/APInt.h"
+#include "llvm/ADT/OwningPtr.h"
+#include "llvm/ADT/SmallString.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Support/Allocator.h"
+#include "llvm/Support/Casting.h"
+#include "llvm/Support/type_traits.h"
+#include "llvm/Support/SMLoc.h"

Can these be pruned at all?  It doesn't seem obvious that you're using APInt, type_traits.h, Casting.h, or SmallString.h (see below) here, for example.

+namespace {
+bool getAs(const ScalarNode *SN, bool &Result) {
+  SmallString<4> Storage;
+  StringRef Value = SN->getValue(Storage);
+  if (Value == "true")
+    Result = true;
+  else if (Value == "false")
+    Result = false;
+  else
+    return false;
+  return true;
+}
+}
+

Please use 'static' instead of anon-namespace for functions.  Why is this a static function in a header?  It seems like it should be out of line, also it looks dead, along with:

+template<class T>
+typename enable_if_c<std::numeric_limits<T>::is_integer, bool>::type
+getAs(const ScalarNode *SN, T &Result) {
+  SmallString<4> Storage;
+  StringRef Value = SN->getValue(Storage);
+  if (Value.getAsInteger(0, Result))
+    return false;
+  return true;
+}

... I don't get what this is doing, it looks dead to me.  If these are intended to be used by other clients, we should come up with a better and more elegant way to do this.



+class Node {
...
+  // These functions forward to Document and Scanner.
+  Token &peekNext();
+  Token getNext();
+  Node *parseBlockNode();

I'm not deeply familiar with the code, but it seems odd to have parser logic buried in the node classes.  Moving this out of the class would also allow Token to be dropped into an anonymous namespace.

+  Document *Doc;

It also seems strange for Node to have a pointer to its document.  If you have space to burn :), it would probably be more useful to have a PointerUnion<Node*,Document*> to represent parent pointer or document pointer.

+  SMRange SourceRange;

Also strange - a typical pattern for an AST would be for the sourcerange to be computed, allowing classes to store their minimal representation (ScalarNode could derive it from Value, for example, saving two words per ScalarNode).  

typo: modifcation


+/// @brief A key and value pair. While not technically a Node under the YAML
+///        representation graph, it is easier to treat them this way.
+///
+/// TODO: Consider making this not a child of Node.
+class KeyValueNode : public Node {

+/// @brief Represents an alias to a Node with an anchor.

For each of these ASTs, it would be nice to give an example of what they look like in a YAML document, just for intuition.


+class document_iterator {
+  document_iterator operator ++() {
+    if (!Doc->skip()) {
+      delete Doc;
+      Doc = 0;
+    } else {
+      Stream &S = Doc->stream;
+      delete Doc;
+      Doc = new Document(S);
+    }

Should "Doc" just be an OwningPtr?


+++ b/lib/Support/YAMLParser.cpp

You've got some redundant #includes.

+namespace llvm {
+namespace yaml {

Please don't put huge amounts of stuff in namespaces like this, per the coding standards.


+/// getBOM - Reads up to the first 4 bytes to determine the Unicode encoding
+///          form of \a Input.
+///
+/// @param Input A string of length 0 or more.
+/// @returns An EncodingInfo indicating the Unicode encoding form of the input
+///          and how long the byte order mark is if one exists.
+EncodingInfo getUnicodeEncoding(StringRef Input) {

This should be static, the doxygen comment is out of date.


This function looks like an over-optimized version of something like this:

  if (Input.starts_with(StringRef("\x00\x00\xFE\xFF"))
     return std::make_pair(UEF_UTF32_BE, 4);
...


+/// Token - A single YAML token.
+struct Token : ilist_node<Token> {

Unless I'm missing something, Tokens are all fixed size with no subclasses.  If this is the case, ilist is inappropriate, just use std::list<Token>, eliminating the ilist stuff.


+struct SimpleKey {

Looks like it could be in an anon namespace.


+UTF8Decoded decodeUTF8(StringRef Range) {

Looks like it should be marked static and/or moved out to some other part of libsupport.  Perhaps a method on StringRef.


+/// @brief Scans YAML tokens from a MemoryBuffer.
+class Scanner {

Looks like it should be in an anonymous namespace.


+  /// @brief Skip a single nb-char[27] starting at Position.
+  ///
+  /// A nb-char is 0x9 | [0x20-0x7E] | 0x85 | [0xA0-0xD7FF] | [0xE000-0xFEFE]
+  ///                  | [0xFF00-0xFFFD] | [0x10000-0x10FFFF]
+  ///
+  /// @returns The code unit after the nb-char, or Position if it's not an
+  ///          nb-char.
+  StringRef::iterator skip_nb_char(StringRef::iterator Position);

These sorts of methods seem like they should all be static helper functions, not methods on Scanner.  If you prefer to keep them as methods, please make them be static methods if they don't use any Scanner state.

+  template<StringRef::iterator (Scanner::*Func)(StringRef::iterator)>
+  StringRef::iterator skip_while(StringRef::iterator Position) {

It looks like you're just using a template here to force the compiler to instantiate it.  It would be more clear to use a standard function argument, and use always_inline if the optimizer isn't doing the right thing. Incidentally, if you can make all these methods static, then you can use a function pointer instead of member pointer.


+  /// @brief Current column number in Unicode code points.
+  unsigned Column;
+
+  /// @brief Current line number.
+  unsigned Line;

As a random comment, if you care about parser performance, it would be better to not calculate these on the fly: instead, just calculate them on demand (assuming they are rarely used).  This is a well established pattern in Clang and is important for lexer performance.

+namespace {
+SmallString<4> encodeUTF8(uint32_t UnicodeScalarValue) {

Please use static (see coding conventions).  Also, returning a SmallString isn't great.  It looks like a vast majority of clients already have a SmallVectorImpl<char> that they're appending to, just pass that in, instead of returning a temporary smallstring.  This sort of thing pains me:

+              SmallString<4> Val = encodeUTF8(0xA0);
+              Storage.insert(Storage.end(), Val.begin(), Val.end());
+              break;



+bool yaml::dumpTokens(StringRef Input, raw_ostream &OS) {

It's fairly obvious, bug please add a doc comment for each of these main entrypoints.


+void Scanner::skip(uint32_t Distance) {
+  Current += Distance;
+  Column += Distance;
+}

The column computation seems dubious to me here, best to compute it lazily on demand :)


+  if (*Current == '[')
+    return scanFlowCollectionStart(true);
+
+  if (*Current == '{')
+    return scanFlowCollectionStart(false);
+
+  if (*Current == ']')
+    return scanFlowCollectionEnd(true);
... 

switch statement?


+void Node::setError(const Twine &Msg, Token &Tok) const {
+  Doc->setError(Msg, Tok);
+}
+
+bool Node::failed() const {
+  return Doc->failed();
+}

Are these helper methods really that important?


+StringRef ScalarNode::getValue(SmallVectorImpl<char> &Storage) const {

The nesting is too great :), please split out part of the body (perhaps the "Insert all previous chars into Storage" loop) out to its own helper function.


This is an odd pattern:
+      return Key = new (getAllocator().Allocate<NullNode>()) NullNode(Doc);


Can't you overload "operator new" for nodes to take an allocator, allowing:
+      return Key = new (getAllocator()) NullNode(Doc);

?

+++ b/utils/yaml-bench/YAMLBench.cpp

I know it doesn't matter much, but all the functions could be marked static.

-Chris


















More information about the llvm-commits mailing list