[llvm-commits] [PATCH] YAML I/O

Wed Aug 22 17:49:38 PDT 2012

Sean,

I've working on reimplementing YAML I/O to use a traits based approach.  I'm using lld's internal object as a test bed.  The File/Atom/Reference objects in lld have no public ivars.  Everything is accessed through virtual methods.  So, if I can do yaml I/O on those classes just by defining trait specializations, then the mechanism should be very adaptable. 

I have something working now, but it is all using C++11.  I still need to discover what issues will arise when used by C++03.  

Here is a flavor of what I have working.  I want to make sure this is the right direction:

If you have an enum like:

   enum Color { cRed, cBlue, cGreen };

You can write a trait like this:

  template <>
  struct llvm::yaml::ScalarTrait<Color> {
    static void doScalar(IO &io, Color &value) {
      io.beginEnumScalar();
      io.enumScalarMatch(value, "red",   cRed);
      io.enumScalarMatch(value, "blue",  cBlue);
      io.enumScalarMatch(value, "green", cGreen);
      io.endEnumScalar();
    }
  };

Which describes how to convert the in-memory enum value to a yaml scalar and back.  I'm also working on a way that you can do arbitrary conversion of scalars.  

If you have a simple POD struct like this:

struct MyInfo {
  int    hat_size;
  int    age;
  Color  hat_color;
};

You can write a trait like this:

template <>
struct llvm::yaml::MapTraits<MyInfo> {
  static void mapping(IO &io, MyInfo& info) {
    io.reqKey("hat-size",    info.hat_size);
    io.optKey("age",         info.age,         21);
    io.optKey("hat-color",   hat_color,        cBlue);
  }
};

Which is used to both read and write yaml.   The "age" and "hat-color" keys are optional in yaml.  If not specified (in yaml), they default to 21 and cBlue. The "hat-size" key is required, and you will get an error it is not present in the yaml.

There is no trait for yaml sequences.  Instead, if your data type is a class with begin, end, and push_back methods, it is assumed to be a sequence.

Now, the interesting case is the handling of non-POD data types.  The reqKey() and optKey() methods need a lvalue to they can be read (when creating yaml) and write (when parsing yaml).  It may also be the case that your existing data structures is not a container of structs, but a container of pointers to structs.  But in both those cases, you want to be able to have the same yaml representation.  Lastly, in the parsing yaml case, you need to be able to instantiate an internal object, whereas the writing yaml case needs to examine an existing object.

Here is an example of the lld Reference type and the trait for converting it to and from yaml:

  template <>
  struct MapTraits<const lld::Reference*> {

    class MyReference : public lld::Reference {
    public:
      MyReference()
        : _target(nullptr), _targetName(), _offset(0), _addend(0) , _kind(0) {
      }
      MyReference(const lld::Reference* ref)
        : _target(nullptr), 
        _targetName(ref->target() ? ref->target()->name() : ""), 
        _offset(ref->offsetInAtom()), 
        _addend(ref->addend()),
        _kind(ref->kind()) {
      }

      virtual uint64_t         offsetInAtom() const { return _offset; }
      virtual Kind             kind() const         { return _kind; }
      virtual const lld::Atom *target() const       { return _target; }
      virtual Addend           addend() const       { return _addend; }
      virtual void             setKind(Kind k)      { _kind = k; }
      virtual void             setAddend(Addend a)  { _addend = a; }
      virtual void             setTarget(const lld::Atom *a) { _target = a; }

      const lld::Atom *_target;
      StringRef        _targetName;
      uint32_t         _offset;
      Addend           _addend;
      Kind             _kind;
    };

    static void mapping(IO &io, const lld::Reference*& ref) {
      MappingHelper<MyReference, const lld::Reference*> keys(io, ref);

      io.reqKey("kind",           keys->_kind);
      io.optKey("offset",         keys->_offset);
      io.optKey("target",         keys->_targetName);
      io.optKey("addend",         keys->_addend);
    }

  };

Some salient points:
* The trait is on "const lld::Reference*" because only pointers to References are passed around inside lld.
* The lld class Reference in an abstract base class, so a concrete instance must be defined (MyReference).
* There are two constructors for MyReference.  The default  constructor is used when parsing yaml to create the initial object which is then overridden as key/values are found in yaml.  The other constructor is used when writing yaml to create a temporary (stack) instance which contains the fields needed for mapping() to access.
* MappingHelper<> is a utility which detects if you are reading or writing and constructs the appropriate object.  It is only needed for non-POD structs.

-Nick

On Aug 8, 2012, at 6:34 PM, Sean Silva wrote:
>> Your suggestion is to remove the intermediate data structures and instead define the schema via external trait templates.   I can see how this would seem easier (not having to write glue code to copy to and from the intermediate data types).  But that copying also does normalization.  For instance, your native object may have two ivars that together make one yaml key-value, or one ivar is best represented a as couple of yaml key-values.  Or your sequence may have a preferred sort order in yaml, but that is not the actual list order in memory.
> 
> I don't get what you're saying here. A traits class can easily handle
> all those conversions easily.
> 
> It would look something like:
> 
> template<>
> class YamlMapTraits<Person> {
>  void yamlMapping(IO &io, Person *P) {
>    requiredKey(io, &P->name, "name");
>    optionalKey(io, &P->hatSize, "hat-size");
>  }
> };
> 
> Here I was just trying to mimic the one of the examples from your
> documentation so the feel should be similar. However, the door is open
> to specifying the correspondence however you really want in the traits
> class.
> 
>> I think the hard part of a traits approach is figuring out how clients will write the normalization code.  And how to make the difficulty of that code scale to how denormalized the native objects are.
> 
> One possibility I can think of off the top of my head is to have the
> traits class declare a private intermediate struct which it
> deserializes to (similar to the intermediate that the current API
> _forces_ you to have), and then just construct the object from the
> intermediate. It's so much more flexible to do this with a traits
> class.
> 
> --Sean Silva
> 
> On Wed, Aug 8, 2012 at 5:34 PM, Nick Kledzik <kledzik at apple.com> wrote:
>> 
>> On Aug 8, 2012, at 12:46 PM, Sean Silva wrote:
>> 
>>>> But EnumValue is not quite right because it can be used with #defines too.
>>> 
>>> Do we really want to encourage people to use #defines? Is there any
>>> set of constants in the LLVM tree which are defined with #defines and
>>> not in an enum?
>>> 
>>> 
>>>> I'm not sure what you mean by traits-based in this context.
>>> 
>>> A traits-based design means that you have a class template which
>>> provides a collection of type-specific information which is provided
>>> by specializing the class template for a particular type. For example,
>>> see include/llvm/ADT/GraphTraits.h, which uses GraphTraits<T> to
>>> specify how to adapt T to a common interface that graph algorithms can
>>> use. This is noninvasive (maybe needing a friend declaration at most).
>>> Your current approach using inheritance and virtual functions is
>>> invasive, forces the serializable class to inherit (causing multiple
>>> inheritance in the case that the serializable class already has a
>>> base), and forces the serializable class to suddenly have virtual
>>> functions.
>>> 
>>> Overall, I think a traits-based design would be simpler, more loosely
>>> coupled, and seems to fit the use case more naturally.
>> I as wrote in the documentation this was not intended to allow you to go directly from existing data structures to yaml and back.  Instead the schema "language" is written in terms of new data structure declarations (subclass of YamlMap and specialization of Sequence<>).
>> 
>> Your suggestion is to remove the intermediate data structures and instead define the schema via external trait templates.   I can see how this would seem easier (not having to write glue code to copy to and from the intermediate data types).  But that copying also does normalization.  For instance, your native object may have two ivars that together make one yaml key-value, or one ivar is best represented a as couple of yaml key-values.  Or your sequence may have a preferred sort order in yaml, but that is not the actual list order in memory.
>> 
>> I think the hard part of a traits approach is figuring out how clients will write the normalization code.  And how to make the difficulty of that code scale to how denormalized the native objects are.
>> 
>> I'll play around with this idea and see what works and what does not.
>> 
>> -Nick
>> 
>> 
>>> 
>>> On Tue, Aug 7, 2012 at 4:57 PM, Nick Kledzik <kledzik at apple.com> wrote:
>>>> On Aug 7, 2012, at 2:07 PM, Sean Silva wrote:
>>>>> Thanks for writing awesome docs!
>>>>> 
>>>>> +Sometime sequences are known to be short and the one entry per line is too
>>>>> +verbose, so YAML offers an alternate syntax for sequences called a "Flow
>>>>> +Sequence" in which you put comma separated sequence elements into square
>>>>> +brackets.  The above example could then be simplified to :
>>>>> 
>>>>> It's probably worth mentioning here that the "Flow" syntax is
>>>>> (exactly?) JSON. Also, noting that JSON is a proper subset of YAML is
>>>>> in general is probably worth mentioning.
>>>>> 
>>>>> +   .. code-block:: none
>>>>> 
>>>>> pygments (and hence Sphinx) supports `yaml` highlighting
>>>>> <http://pygments.org/docs/lexers/>
>>>>> 
>>>>> +the following document:
>>>>> +
>>>>> +   .. code-block:: none
>>>>> 
>>>>> The precedent for code listings is generally that the `..
>>>>> code-block::` is at the same level of indentation as the paragraph
>>>>> introducing it.
>>>>> 
>>>>> +You can combine mappings and squences by indenting.  For example a sequence
>>>>> +of mappings in which one of the mapping values is itself a sequence:
>>>>> 
>>>>> s/squences/sequences/
>>>>> 
>>>>> +of a new document is denoted with "---".  So in order for Input to handle
>>>>> +multiple documents, it operators on an llvm::yaml::Document<>.
>>>>> 
>>>>> s/operators/operates/
>>>>> 
>>>>> +can set values in the context in the outer map's yamlMapping() method and
>>>>> +retrive those values in the inner map's yamlMapping() method.
>>>>> 
>>>>> s/retrive/retrieve/
>>>>> 
>>>>> +of a new document is denoted with "---".  So in order for Input to handle
>>>>> 
>>>>> For clarity, I would put the --- in monospace (e.g. "``---``"), here
>>>>> and in other places.
>>>> Thanks for the Sphinx tips.  I've incorporated them and ran a spell checker too ;-)
>>>> 
>>>> 
>>>>> 
>>>>> +UniqueValue
>>>>> +-----------
>>>>> 
>>>>> I think that EnumValue be more self-documenting than UniqueValue.
>>>> I'm happy to give UniqueValue a better name.  But EnumValue is not quite right because it can be used with #defines too.  The real constraint is that there be a one-to-one mapping of strings to values.    I want it to contrast with BitValue which maps a set (sequence) of strings to a set of values OR'ed together.
>>>> 
>>>> 
>>>> 
>>>>> At a design level, what are the pros/cons of this approach compared
>>>>> with a traits-based approach? What made you choose this design versus
>>>>> a traits-based approach?
>>>> 
>>>> I'm not sure what you mean by traits-based in this context.    The back story is that for lld I've been writing code to read and write yaml documents.  Michael's YAMLParser.h certainly makes reading more robust, but there is still tons of (semantic level) error checking you to hand code.  It seemed like most of my code was checking for errors.  Also it was a pain to keep the yaml reading code is sync with yaml writing code.
>>>> 
>>>> What we really need was a way to describe the schema of the yaml documents and have some tool generate the code to read and write.  There is a tool called Kwalify which defines a way to express a yaml schema and can check it.  But it has a number of limitations.
>>>> 
>>>> Last month a wrote up a proposal for defining a yaml schema language and a tool that would use that schema to generate C++ code to read/validate and write yaml conforming to the schema.  The best feedback I got  (from Daniel Dunbar) was that rather than create another language (yaml schema language) and tools, to try to see if you could express the schema in C++ directly, using meta-programming or whatever.   I looked at Boost serialization for inspiration and came up with this Yaml I/O library.
>>>> 
>>>> -Nick
>>>> 
>>>> 
>>>>> 
>>>>> On Mon, Aug 6, 2012 at 12:17 PM, Nick Kledzik <kledzik at apple.com> wrote:
>>>>>> Attached is a patch for review which implements the Yaml I/O library I proposed on llvm-dev July 25th.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> The patch includes the implementation, test cases, and documentation.
>>>>>> 
>>>>>> I've included a PDF of the documentation, so you don't have to install the patch and run sphinx to read it.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> There are probably more aspects of yaml we can support in YAML I/O, but the current patch is enough to support my needs for encoding mach-o as yaml for lld test cases.
>>>>>> 
>>>>>> I was initially planning on just adding this code to lld, but I've had two requests to push it down into llvm.
>>>>>> 
>>>>>> Again, here are examples of the mach-o schema and an example mach-o document:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -Nick
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>> 
>>>> 
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120822/62819cca/attachment.html>