[llvm-commits] [PATCH] YAML I/O

Tue Aug 28 09:28:45 PDT 2012

> This format is the most compact.  It is also the format that is easiest for
> YAML I/O to validate, since the legal keys at any point are well defined.
> What is lost though, is the original order of shapes.  For lld, that does
> not matter.  In fact the lld::File model does not have one list of all
> Atoms.  It already has four lists.  One for each Atom kind.

Ok, this seems satisfactory. I think it's important though that there
actually be an order here, for the sake of testing. E.g. it could be
that the keys are output in alphabetical order.

--Sean Silva

On Mon, Aug 27, 2012 at 6:51 PM, Nick Kledzik <kledzik at apple.com> wrote:
> On Aug 24, 2012, at 6:18 AM, Sean Silva wrote:
>
> Good points.
>
> The only thing that I'm still a little bit shaky on is the handling of
> dynamic classes (e.g. that have virtual methods). So I'd like to paint
> a use case, and then you can tell me whether this is something that
> you are hoping to deal with, and if so, which how you intend to
> address it.
>
> The use case is about deserializing a type which, when manipulated in
> the program, is always held through a pointer to an "interface" type,
> as it appears that lld::Reference is. In particular, I'm shaky about
> how the fields of derived classes get serialized/deserialized; since
> you are holding them through a pointer to the interface, there needs
> to be some sort of dynamic dispatch. To make this concrete:
>
> class Shape {
>  virtual double getArea() = 0;
> };
>
> class Rectangle : public Shape {
>  double width, height;
>  double getArea() { return width * height; };
> };
>
> class Circle : public Shape {
>  double radius;
>  double getArea() { return M_PI * radius * radius; };
> };
>
> So in the program, you only hold Shape*'s. So how does a Shape* get
> serialized? When serializing a Circle, the mapping will need to have a
> `radius` field, while when serializing a Rectangle, it will need to
> have `width` and `height` fields.
>
> It seems like this inherently needs dynamic dispatch. For writing, the
> traits can hand off to a virtual method on Shape*, like
> toYamlMapping(). However, reading seems a lot more difficult, since
> you have to dynamically create the right class. The only approach that
> I can think of off the top of my head is that when serializing, the
> class needs to add its name as a field on the mapping (`__class:
> Circle`, maybe?), and then this is used to select which concrete
> subclass to instantiate.
>
>
> Yes, handling dynamic subclasses is interesting.  We have this problem in
> lld with the various kinds of Atoms (all subclasses of ll::Atom).  I'll get
> to what I think we should to with Atoms and yaml it a bit.  First I want to
> look at your Shape example.
>
> It usually helps to try writing out examples in yaml.  Your write up implied
> that your native data structure is something like std::vector<Shape*>.  That
> is, a heterogeneous list of Shapes.  Here are a couple of ways to express
> that in yaml.
>
> A) Have a key/value that specifies the class type, only key values for that
> type are then legal.
>
>     shapes:
>        - type:    rectangle
>          width:   10
>          height:  15
>        - type:    circle
>          radius:  5
>        - type:    rectangle
>          width:   10
>          height:  12
>
> This is what we currently do for Atoms (and I'm not happy with it).  This is
> hard to use with YAML I/O because the semantic legality of some keys depends
> on other keys (For instance, the radius: key is only valid if in a map that
> also contains the type: circle key/value.
>
>
> B) Have a key that specifies the class type, and whose value is itself a
> mapping.
>
>     shapes:
>        - rectangle:
>             width:   10
>             height:  15
>        - circle:
>             radius:  5
>        - rectangle:
>             width:   10
>             height:  12
>
> This is  easier for YAML I/O to enforce the lower keys are valid
> combinations (e.g. error it radius and width are used together).  But, since
> in any sequence element, every key is optional (rectangle: and circle: are
> both optional), it is hard to enforce that exactly one of those keys are
> used.
>
>
> C) Do some more normalization such that you no longer have a heterogeneous
> sequence, but instead have a couple of homogenous lists:
>
>     rectangles:
>        - width:   10
>          height:  15
>        - width:   10
>          height:  12
>     circles:
>        - radius:  5
>
> This format is the most compact.  It is also the format that is easiest for
> YAML I/O to validate, since the legal keys at any point are well defined.
> What is lost though, is the original order of shapes.  For lld, that does
> not matter.  In fact the lld::File model does not have one list of all
> Atoms.  It already has four lists.  One for each Atom kind.
>
>
> In summary, the dynamic subclass problem can be fixed by normalizing.
>
> -Nick
>
>
>
> On Thu, Aug 23, 2012 at 9:38 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
>
> On Aug 22, 2012, at 6:30 PM, Sean Silva wrote:
>
>
> template <>
>
> struct llvm::yaml::ScalarTrait<Color> {
>
>   static void doScalar(IO &io, Color &value) {
>
>     io.beginEnumScalar();
>
>     io.enumScalarMatch(value, "red",   cRed);
>
>     io.enumScalarMatch(value, "blue",  cBlue);
>
>     io.enumScalarMatch(value, "green", cGreen);
>
>     io.endEnumScalar();
>
>   }
>
> };
>
>
> To be honest, I was quite fond of the static table based approach that
>
> the original patch was using. What happened to that? My preference is
>
> because the current approach ends up emitting a bunch of code, whereas
>
> the static tables are much more compact and simple (IMO).
>
> I was never happy look/complexity of the UniqueValue<> template and the
> syntax for the list of pair and its NULL termination.   I figured most enums
> are only a couple of cases, so the code overhead is not that much.  Anyone
> seen a clean syntax for constructing a static list or pairs?
>
>
>
> If you do go for this more "code" approach, then I would prefer to get
>
> rid of the explicit begin/end here. Instead, I would have an ancillary
>
> type whose constructor does the "begin" stuff and destructor does the
>
> "end" stuff, and on which you call the methods. something like
>
>
> Helper h(io)
>
> h.enumScalarMatch(value, "red",   cRed);
>
> h.enumScalarMatch(value, "blue",   cBlue);
>
> h.enumScalarMatch(value, "green",   cGreen);
>
> // ~Helper() does the "end" stuff.
>
> I think I can do better and get rid of the begin/end by having the code that
> calls doScalar() do the begin/end call.
>
>
>
>
>
> There is no trait for yaml sequences.  Instead, if your data type is a class
>
> with begin, end, and push_back methods, it is assumed to be a sequence.
>
>
> This is a really good idea, and really simple!
>
>
>   static void mapping(IO &io, const lld::Reference*& ref) {
>
>     MappingHelper<MyReference, const lld::Reference*> keys(io, ref);
>
>
>     io.reqKey("kind",           keys->_kind);
>
>     io.optKey("offset",         keys->_offset);
>
>     io.optKey("target",         keys->_targetName);
>
>     io.optKey("addend",         keys->_addend);
>
>   }
>
>
> This approach seems insanely convoluted. Why not just use pointer to
>
> member functions? E.g.
>
>
>
> "convoluted"? You should have seen my previous iterations ;-)
>
>
> But lets put this in context.  The POD case is simple.  It is the non-POD
> case that gets weird.  But there is a reason, you need to code to
> "normalize" and "denormalize" the map values.  Your native data structure
> could be anything, but you have decided that the cleanest, least redundant
> yaml is a mapping with a particular set of key-value pairs.  So, you when
> writing yaml, you need to code to create those value out of your data
> structures, and when reading yaml you need a way to instantiate your data
> structures from those normalized values.
>
>
> For the normalization code, given that the mapping needs a set of fields
> which are initialized from some data structure, it is a clean fit to defined
> an intermediate class with one ivar for each normalized field and have a
> constructor which takes your data structure as an argument.
>
>
> For the denormalization code, you start with the normalized fields (or just
> and instance of the intermediate class) and instantiate what ever internal
> data structures you need.  In the case of lld, since the intermediate class
> (MyReference) happens to also be able to stand in as the native data
> structure.  But the won't be the case for other clients, so I should rework
> this so that the denormalization code can return anything.
>
>
>
> struct MapTraits<const lld::Reference*> {
>
> static void mapping(IO &io, const lld::Reference *&ref) {
>
>   io.reqKey("kind", &lld::Reference::kind, &lld::Reference::setKind);
>
>   ...
>
> }
>
> };
>
>
> However, presumably there needs to be some form of dynamic dispatch
>
> here as well, otherwise how will you serialize/deserialize an
>
> arbitrary lld::Reference (where you don't necessarily know the dynamic
>
> type)?
>
>
> There is a couple of problems with having getter/setter methods instead of
> an lvalue as parameter to reqKey().
>
> * This code is only needed in the non-POD case.  Your native data structures
> probably do not already have a getter and setter method that happens to
> match the normalization model.  So, now you would have to create wrapper
> objects just to add those methods.
>
> * The value for that field (e.g. "kind") may not be a scalar.  It may be a
> sequence type or another mapping type.
>
> * Because reqKey() is defined to take an lvalue (reference to a variable),
> the template expansions are used for reading and writing.  If you have
> separate parameters for reading and writing, then each expands separately
> and you can get a combinatorial expansion.
>
> * And the problem you pointed out of whether these are virtual or
> non-virtual methods.
>
>
> -Nick
>
>
>
>
> --Sean Silva
>
>
> On Wed, Aug 22, 2012 at 8:49 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
> Sean,
>
>
> I've working on reimplementing YAML I/O to use a traits based approach.  I'm
>
> using lld's internal object as a test bed.  The File/Atom/Reference objects
>
> in lld have no public ivars.  Everything is accessed through virtual
>
> methods.  So, if I can do yaml I/O on those classes just by defining trait
>
> specializations, then the mechanism should be very adaptable.
>
>
> I have something working now, but it is all using C++11.  I still need to
>
> discover what issues will arise when used by C++03.
>
>
> Here is a flavor of what I have working.  I want to make sure this is the
>
> right direction:
>
>
> If you have an enum like:
>
>
>  enum Color { cRed, cBlue, cGreen };
>
>
> You can write a trait like this:
>
>
> template <>
>
> struct llvm::yaml::ScalarTrait<Color> {
>
>   static void doScalar(IO &io, Color &value) {
>
>     io.beginEnumScalar();
>
>     io.enumScalarMatch(value, "red",   cRed);
>
>     io.enumScalarMatch(value, "blue",  cBlue);
>
>     io.enumScalarMatch(value, "green", cGreen);
>
>     io.endEnumScalar();
>
>   }
>
> };
>
>
> Which describes how to convert the in-memory enum value to a yaml scalar and
>
> back.  I'm also working on a way that you can do arbitrary conversion of
>
> scalars.
>
>
>
> If you have a simple POD struct like this:
>
>
> struct MyInfo {
>
> int    hat_size;
>
> int    age;
>
> Color  hat_color;
>
> };
>
>
> You can write a trait like this:
>
>
> template <>
>
> struct llvm::yaml::MapTraits<MyInfo> {
>
> static void mapping(IO &io, MyInfo& info) {
>
>   io.reqKey("hat-size",    info.hat_size);
>
>   io.optKey("age",         info.age,         21);
>
>   io.optKey("hat-color",   hat_color,        cBlue);
>
> }
>
> };
>
>
> Which is used to both read and write yaml.   The "age" and "hat-color" keys
>
> are optional in yaml.  If not specified (in yaml), they default to 21 and
>
> cBlue. The "hat-size" key is required, and you will get an error it is not
>
> present in the yaml.
>
>
> There is no trait for yaml sequences.  Instead, if your data type is a class
>
> with begin, end, and push_back methods, it is assumed to be a sequence.
>
>
>
> Now, the interesting case is the handling of non-POD data types.  The
>
> reqKey() and optKey() methods need a lvalue to they can be read (when
>
> creating yaml) and write (when parsing yaml).  It may also be the case that
>
> your existing data structures is not a container of structs, but a container
>
> of pointers to structs.  But in both those cases, you want to be able to
>
> have the same yaml representation.  Lastly, in the parsing yaml case, you
>
> need to be able to instantiate an internal object, whereas the writing yaml
>
> case needs to examine an existing object.
>
>
> Here is an example of the lld Reference type and the trait for converting it
>
> to and from yaml:
>
>
> template <>
>
> struct MapTraits<const lld::Reference*> {
>
>
>   class MyReference : public lld::Reference {
>
>   public:
>
>     MyReference()
>
>       : _target(nullptr), _targetName(), _offset(0), _addend(0) , _kind(0)
>
> {
>
>     }
>
>     MyReference(const lld::Reference* ref)
>
>       : _target(nullptr),
>
>       _targetName(ref->target() ? ref->target()->name() : ""),
>
>       _offset(ref->offsetInAtom()),
>
>       _addend(ref->addend()),
>
>       _kind(ref->kind()) {
>
>     }
>
>
>     virtual uint64_t         offsetInAtom() const { return _offset; }
>
>     virtual Kind             kind() const         { return _kind; }
>
>     virtual const lld::Atom *target() const       { return _target; }
>
>     virtual Addend           addend() const       { return _addend; }
>
>     virtual void             setKind(Kind k)      { _kind = k; }
>
>     virtual void             setAddend(Addend a)  { _addend = a; }
>
>     virtual void             setTarget(const lld::Atom *a) { _target = a;
>
> }
>
>
>
>
>     const lld::Atom *_target;
>
>     StringRef        _targetName;
>
>     uint32_t         _offset;
>
>     Addend           _addend;
>
>     Kind             _kind;
>
>   };
>
>
>
>   static void mapping(IO &io, const lld::Reference*& ref) {
>
>     MappingHelper<MyReference, const lld::Reference*> keys(io, ref);
>
>
>     io.reqKey("kind",           keys->_kind);
>
>     io.optKey("offset",         keys->_offset);
>
>     io.optKey("target",         keys->_targetName);
>
>     io.optKey("addend",         keys->_addend);
>
>   }
>
>
>
>
> };
>
>
> Some salient points:
>
> * The trait is on "const lld::Reference*" because only pointers to
>
> References are passed around inside lld.
>
> * The lld class Reference in an abstract base class, so a concrete instance
>
> must be defined (MyReference).
>
> * There are two constructors for MyReference.  The default  constructor is
>
> used when parsing yaml to create the initial object which is then overridden
>
> as key/values are found in yaml.  The other constructor is used when writing
>
> yaml to create a temporary (stack) instance which contains the fields needed
>
> for mapping() to access.
>
> * MappingHelper<> is a utility which detects if you are reading or writing
>
> and constructs the appropriate object.  It is only needed for non-POD
>
> structs.
>
>
> -Nick
>
>
> On Aug 8, 2012, at 6:34 PM, Sean Silva wrote:
>
>
> Your suggestion is to remove the intermediate data structures and instead
>
> define the schema via external trait templates.   I can see how this would
>
> seem easier (not having to write glue code to copy to and from the
>
> intermediate data types).  But that copying also does normalization.  For
>
> instance, your native object may have two ivars that together make one yaml
>
> key-value, or one ivar is best represented a as couple of yaml key-values.
>
> Or your sequence may have a preferred sort order in yaml, but that is not
>
> the actual list order in memory.
>
>
>
> I don't get what you're saying here. A traits class can easily handle
>
> all those conversions easily.
>
>
> It would look something like:
>
>
> template<>
>
> class YamlMapTraits<Person> {
>
> void yamlMapping(IO &io, Person *P) {
>
>  requiredKey(io, &P->name, "name");
>
>  optionalKey(io, &P->hatSize, "hat-size");
>
> }
>
> };
>
>
> Here I was just trying to mimic the one of the examples from your
>
> documentation so the feel should be similar. However, the door is open
>
> to specifying the correspondence however you really want in the traits
>
> class.
>
>
> I think the hard part of a traits approach is figuring out how clients will
>
> write the normalization code.  And how to make the difficulty of that code
>
> scale to how denormalized the native objects are.
>
>
>
> One possibility I can think of off the top of my head is to have the
>
> traits class declare a private intermediate struct which it
>
> deserializes to (similar to the intermediate that the current API
>
> _forces_ you to have), and then just construct the object from the
>
> intermediate. It's so much more flexible to do this with a traits
>
> class.
>
>
> --Sean Silva
>
>
> On Wed, Aug 8, 2012 at 5:34 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
>
>
> On Aug 8, 2012, at 12:46 PM, Sean Silva wrote:
>
>
>
> But EnumValue is not quite right because it can be used with #defines too.
>
>
>
> Do we really want to encourage people to use #defines? Is there any
>
>
> set of constants in the LLVM tree which are defined with #defines and
>
>
> not in an enum?
>
>
>
>
> I'm not sure what you mean by traits-based in this context.
>
>
>
> A traits-based design means that you have a class template which
>
>
> provides a collection of type-specific information which is provided
>
>
> by specializing the class template for a particular type. For example,
>
>
> see include/llvm/ADT/GraphTraits.h, which uses GraphTraits<T> to
>
>
> specify how to adapt T to a common interface that graph algorithms can
>
>
> use. This is noninvasive (maybe needing a friend declaration at most).
>
>
> Your current approach using inheritance and virtual functions is
>
>
> invasive, forces the serializable class to inherit (causing multiple
>
>
> inheritance in the case that the serializable class already has a
>
>
> base), and forces the serializable class to suddenly have virtual
>
>
> functions.
>
>
>
> Overall, I think a traits-based design would be simpler, more loosely
>
>
> coupled, and seems to fit the use case more naturally.
>
>
> I as wrote in the documentation this was not intended to allow you to go
>
> directly from existing data structures to yaml and back.  Instead the schema
>
> "language" is written in terms of new data structure declarations (subclass
>
> of YamlMap and specialization of Sequence<>).
>
>
>
> Your suggestion is to remove the intermediate data structures and instead
>
> define the schema via external trait templates.   I can see how this would
>
> seem easier (not having to write glue code to copy to and from the
>
> intermediate data types).  But that copying also does normalization.  For
>
> instance, your native object may have two ivars that together make one yaml
>
> key-value, or one ivar is best represented a as couple of yaml key-values.
>
> Or your sequence may have a preferred sort order in yaml, but that is not
>
> the actual list order in memory.
>
>
>
> I think the hard part of a traits approach is figuring out how clients will
>
> write the normalization code.  And how to make the difficulty of that code
>
> scale to how denormalized the native objects are.
>
>
>
> I'll play around with this idea and see what works and what does not.
>
>
>
> -Nick
>
>
>
>
>
> On Tue, Aug 7, 2012 at 4:57 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
>
> On Aug 7, 2012, at 2:07 PM, Sean Silva wrote:
>
>
> Thanks for writing awesome docs!
>
>
>
> +Sometime sequences are known to be short and the one entry per line is too
>
>
> +verbose, so YAML offers an alternate syntax for sequences called a "Flow
>
>
> +Sequence" in which you put comma separated sequence elements into square
>
>
> +brackets.  The above example could then be simplified to :
>
>
>
> It's probably worth mentioning here that the "Flow" syntax is
>
>
> (exactly?) JSON. Also, noting that JSON is a proper subset of YAML is
>
>
> in general is probably worth mentioning.
>
>
>
> +   .. code-block:: none
>
>
>
> pygments (and hence Sphinx) supports `yaml` highlighting
>
>
> <http://pygments.org/docs/lexers/>
>
>
>
> +the following document:
>
>
> +
>
>
> +   .. code-block:: none
>
>
>
> The precedent for code listings is generally that the `..
>
>
> code-block::` is at the same level of indentation as the paragraph
>
>
> introducing it.
>
>
>
> +You can combine mappings and squences by indenting.  For example a sequence
>
>
> +of mappings in which one of the mapping values is itself a sequence:
>
>
>
> s/squences/sequences/
>
>
>
> +of a new document is denoted with "---".  So in order for Input to handle
>
>
> +multiple documents, it operators on an llvm::yaml::Document<>.
>
>
>
> s/operators/operates/
>
>
>
> +can set values in the context in the outer map's yamlMapping() method and
>
>
> +retrive those values in the inner map's yamlMapping() method.
>
>
>
> s/retrive/retrieve/
>
>
>
> +of a new document is denoted with "---".  So in order for Input to handle
>
>
>
> For clarity, I would put the --- in monospace (e.g. "``---``"), here
>
>
> and in other places.
>
>
> Thanks for the Sphinx tips.  I've incorporated them and ran a spell checker
>
> too ;-)
>
>
>
>
>
> +UniqueValue
>
>
> +-----------
>
>
>
> I think that EnumValue be more self-documenting than UniqueValue.
>
>
> I'm happy to give UniqueValue a better name.  But EnumValue is not quite
>
> right because it can be used with #defines too.  The real constraint is that
>
> there be a one-to-one mapping of strings to values.    I want it to contrast
>
> with BitValue which maps a set (sequence) of strings to a set of values
>
> OR'ed together.
>
>
>
>
>
> At a design level, what are the pros/cons of this approach compared
>
>
> with a traits-based approach? What made you choose this design versus
>
>
> a traits-based approach?
>
>
>
> I'm not sure what you mean by traits-based in this context.    The back
>
> story is that for lld I've been writing code to read and write yaml
>
> documents.  Michael's YAMLParser.h certainly makes reading more robust, but
>
> there is still tons of (semantic level) error checking you to hand code.  It
>
> seemed like most of my code was checking for errors.  Also it was a pain to
>
> keep the yaml reading code is sync with yaml writing code.
>
>
>
> What we really need was a way to describe the schema of the yaml documents
>
> and have some tool generate the code to read and write.  There is a tool
>
> called Kwalify which defines a way to express a yaml schema and can check
>
> it.  But it has a number of limitations.
>
>
>
> Last month a wrote up a proposal for defining a yaml schema language and a
>
> tool that would use that schema to generate C++ code to read/validate and
>
> write yaml conforming to the schema.  The best feedback I got  (from Daniel
>
> Dunbar) was that rather than create another language (yaml schema language)
>
> and tools, to try to see if you could express the schema in C++ directly,
>
> using meta-programming or whatever.   I looked at Boost serialization for
>
> inspiration and came up with this Yaml I/O library.
>
>
>
> -Nick
>
>
>
>
>
> On Mon, Aug 6, 2012 at 12:17 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
>
> Attached is a patch for review which implements the Yaml I/O library I
>
> proposed on llvm-dev July 25th.
>
>
>
>
>
>
> The patch includes the implementation, test cases, and documentation.
>
>
>
> I've included a PDF of the documentation, so you don't have to install the
>
> patch and run sphinx to read it.
>
>
>
>
>
> There are probably more aspects of yaml we can support in YAML I/O, but the
>
> current patch is enough to support my needs for encoding mach-o as yaml for
>
> lld test cases.
>
>
>
> I was initially planning on just adding this code to lld, but I've had two
>
> requests to push it down into llvm.
>
>
>
> Again, here are examples of the mach-o schema and an example mach-o
>
> document:
>
>
>
>
>
>
>
>
> -Nick
>
>
>
>
>
>
> _______________________________________________
>
>
> llvm-commits mailing list
>
>
> llvm-commits at cs.uiuc.edu
>
>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
>
>
>
>
>