[llvm-commits] [PATCH] YAML I/O
Sean Silva
silvas at purdue.edu
Tue Aug 28 09:28:45 PDT 2012
> This format is the most compact. It is also the format that is easiest for
> YAML I/O to validate, since the legal keys at any point are well defined.
> What is lost though, is the original order of shapes. For lld, that does
> not matter. In fact the lld::File model does not have one list of all
> Atoms. It already has four lists. One for each Atom kind.
Ok, this seems satisfactory. I think it's important though that there
actually be an order here, for the sake of testing. E.g. it could be
that the keys are output in alphabetical order.
--Sean Silva
On Mon, Aug 27, 2012 at 6:51 PM, Nick Kledzik <kledzik at apple.com> wrote:
> On Aug 24, 2012, at 6:18 AM, Sean Silva wrote:
>
> Good points.
>
> The only thing that I'm still a little bit shaky on is the handling of
> dynamic classes (e.g. that have virtual methods). So I'd like to paint
> a use case, and then you can tell me whether this is something that
> you are hoping to deal with, and if so, which how you intend to
> address it.
>
> The use case is about deserializing a type which, when manipulated in
> the program, is always held through a pointer to an "interface" type,
> as it appears that lld::Reference is. In particular, I'm shaky about
> how the fields of derived classes get serialized/deserialized; since
> you are holding them through a pointer to the interface, there needs
> to be some sort of dynamic dispatch. To make this concrete:
>
> class Shape {
> virtual double getArea() = 0;
> };
>
> class Rectangle : public Shape {
> double width, height;
> double getArea() { return width * height; };
> };
>
> class Circle : public Shape {
> double radius;
> double getArea() { return M_PI * radius * radius; };
> };
>
> So in the program, you only hold Shape*'s. So how does a Shape* get
> serialized? When serializing a Circle, the mapping will need to have a
> `radius` field, while when serializing a Rectangle, it will need to
> have `width` and `height` fields.
>
> It seems like this inherently needs dynamic dispatch. For writing, the
> traits can hand off to a virtual method on Shape*, like
> toYamlMapping(). However, reading seems a lot more difficult, since
> you have to dynamically create the right class. The only approach that
> I can think of off the top of my head is that when serializing, the
> class needs to add its name as a field on the mapping (`__class:
> Circle`, maybe?), and then this is used to select which concrete
> subclass to instantiate.
>
>
> Yes, handling dynamic subclasses is interesting. We have this problem in
> lld with the various kinds of Atoms (all subclasses of ll::Atom). I'll get
> to what I think we should to with Atoms and yaml it a bit. First I want to
> look at your Shape example.
>
> It usually helps to try writing out examples in yaml. Your write up implied
> that your native data structure is something like std::vector<Shape*>. That
> is, a heterogeneous list of Shapes. Here are a couple of ways to express
> that in yaml.
>
> A) Have a key/value that specifies the class type, only key values for that
> type are then legal.
>
> shapes:
> - type: rectangle
> width: 10
> height: 15
> - type: circle
> radius: 5
> - type: rectangle
> width: 10
> height: 12
>
> This is what we currently do for Atoms (and I'm not happy with it). This is
> hard to use with YAML I/O because the semantic legality of some keys depends
> on other keys (For instance, the radius: key is only valid if in a map that
> also contains the type: circle key/value.
>
>
> B) Have a key that specifies the class type, and whose value is itself a
> mapping.
>
> shapes:
> - rectangle:
> width: 10
> height: 15
> - circle:
> radius: 5
> - rectangle:
> width: 10
> height: 12
>
> This is easier for YAML I/O to enforce the lower keys are valid
> combinations (e.g. error it radius and width are used together). But, since
> in any sequence element, every key is optional (rectangle: and circle: are
> both optional), it is hard to enforce that exactly one of those keys are
> used.
>
>
> C) Do some more normalization such that you no longer have a heterogeneous
> sequence, but instead have a couple of homogenous lists:
>
> rectangles:
> - width: 10
> height: 15
> - width: 10
> height: 12
> circles:
> - radius: 5
>
> This format is the most compact. It is also the format that is easiest for
> YAML I/O to validate, since the legal keys at any point are well defined.
> What is lost though, is the original order of shapes. For lld, that does
> not matter. In fact the lld::File model does not have one list of all
> Atoms. It already has four lists. One for each Atom kind.
>
>
> In summary, the dynamic subclass problem can be fixed by normalizing.
>
> -Nick
>
>
>
> On Thu, Aug 23, 2012 at 9:38 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
>
> On Aug 22, 2012, at 6:30 PM, Sean Silva wrote:
>
>
> template <>
>
> struct llvm::yaml::ScalarTrait<Color> {
>
> static void doScalar(IO &io, Color &value) {
>
> io.beginEnumScalar();
>
> io.enumScalarMatch(value, "red", cRed);
>
> io.enumScalarMatch(value, "blue", cBlue);
>
> io.enumScalarMatch(value, "green", cGreen);
>
> io.endEnumScalar();
>
> }
>
> };
>
>
> To be honest, I was quite fond of the static table based approach that
>
> the original patch was using. What happened to that? My preference is
>
> because the current approach ends up emitting a bunch of code, whereas
>
> the static tables are much more compact and simple (IMO).
>
> I was never happy look/complexity of the UniqueValue<> template and the
> syntax for the list of pair and its NULL termination. I figured most enums
> are only a couple of cases, so the code overhead is not that much. Anyone
> seen a clean syntax for constructing a static list or pairs?
>
>
>
> If you do go for this more "code" approach, then I would prefer to get
>
> rid of the explicit begin/end here. Instead, I would have an ancillary
>
> type whose constructor does the "begin" stuff and destructor does the
>
> "end" stuff, and on which you call the methods. something like
>
>
> Helper h(io)
>
> h.enumScalarMatch(value, "red", cRed);
>
> h.enumScalarMatch(value, "blue", cBlue);
>
> h.enumScalarMatch(value, "green", cGreen);
>
> // ~Helper() does the "end" stuff.
>
> I think I can do better and get rid of the begin/end by having the code that
> calls doScalar() do the begin/end call.
>
>
>
>
>
> There is no trait for yaml sequences. Instead, if your data type is a class
>
> with begin, end, and push_back methods, it is assumed to be a sequence.
>
>
> This is a really good idea, and really simple!
>
>
> static void mapping(IO &io, const lld::Reference*& ref) {
>
> MappingHelper<MyReference, const lld::Reference*> keys(io, ref);
>
>
> io.reqKey("kind", keys->_kind);
>
> io.optKey("offset", keys->_offset);
>
> io.optKey("target", keys->_targetName);
>
> io.optKey("addend", keys->_addend);
>
> }
>
>
> This approach seems insanely convoluted. Why not just use pointer to
>
> member functions? E.g.
>
>
>
> "convoluted"? You should have seen my previous iterations ;-)
>
>
> But lets put this in context. The POD case is simple. It is the non-POD
> case that gets weird. But there is a reason, you need to code to
> "normalize" and "denormalize" the map values. Your native data structure
> could be anything, but you have decided that the cleanest, least redundant
> yaml is a mapping with a particular set of key-value pairs. So, you when
> writing yaml, you need to code to create those value out of your data
> structures, and when reading yaml you need a way to instantiate your data
> structures from those normalized values.
>
>
> For the normalization code, given that the mapping needs a set of fields
> which are initialized from some data structure, it is a clean fit to defined
> an intermediate class with one ivar for each normalized field and have a
> constructor which takes your data structure as an argument.
>
>
> For the denormalization code, you start with the normalized fields (or just
> and instance of the intermediate class) and instantiate what ever internal
> data structures you need. In the case of lld, since the intermediate class
> (MyReference) happens to also be able to stand in as the native data
> structure. But the won't be the case for other clients, so I should rework
> this so that the denormalization code can return anything.
>
>
>
> struct MapTraits<const lld::Reference*> {
>
> static void mapping(IO &io, const lld::Reference *&ref) {
>
> io.reqKey("kind", &lld::Reference::kind, &lld::Reference::setKind);
>
> ...
>
> }
>
> };
>
>
> However, presumably there needs to be some form of dynamic dispatch
>
> here as well, otherwise how will you serialize/deserialize an
>
> arbitrary lld::Reference (where you don't necessarily know the dynamic
>
> type)?
>
>
> There is a couple of problems with having getter/setter methods instead of
> an lvalue as parameter to reqKey().
>
> * This code is only needed in the non-POD case. Your native data structures
> probably do not already have a getter and setter method that happens to
> match the normalization model. So, now you would have to create wrapper
> objects just to add those methods.
>
> * The value for that field (e.g. "kind") may not be a scalar. It may be a
> sequence type or another mapping type.
>
> * Because reqKey() is defined to take an lvalue (reference to a variable),
> the template expansions are used for reading and writing. If you have
> separate parameters for reading and writing, then each expands separately
> and you can get a combinatorial expansion.
>
> * And the problem you pointed out of whether these are virtual or
> non-virtual methods.
>
>
> -Nick
>
>
>
>
> --Sean Silva
>
>
> On Wed, Aug 22, 2012 at 8:49 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
> Sean,
>
>
> I've working on reimplementing YAML I/O to use a traits based approach. I'm
>
> using lld's internal object as a test bed. The File/Atom/Reference objects
>
> in lld have no public ivars. Everything is accessed through virtual
>
> methods. So, if I can do yaml I/O on those classes just by defining trait
>
> specializations, then the mechanism should be very adaptable.
>
>
> I have something working now, but it is all using C++11. I still need to
>
> discover what issues will arise when used by C++03.
>
>
> Here is a flavor of what I have working. I want to make sure this is the
>
> right direction:
>
>
> If you have an enum like:
>
>
> enum Color { cRed, cBlue, cGreen };
>
>
> You can write a trait like this:
>
>
> template <>
>
> struct llvm::yaml::ScalarTrait<Color> {
>
> static void doScalar(IO &io, Color &value) {
>
> io.beginEnumScalar();
>
> io.enumScalarMatch(value, "red", cRed);
>
> io.enumScalarMatch(value, "blue", cBlue);
>
> io.enumScalarMatch(value, "green", cGreen);
>
> io.endEnumScalar();
>
> }
>
> };
>
>
> Which describes how to convert the in-memory enum value to a yaml scalar and
>
> back. I'm also working on a way that you can do arbitrary conversion of
>
> scalars.
>
>
>
> If you have a simple POD struct like this:
>
>
> struct MyInfo {
>
> int hat_size;
>
> int age;
>
> Color hat_color;
>
> };
>
>
> You can write a trait like this:
>
>
> template <>
>
> struct llvm::yaml::MapTraits<MyInfo> {
>
> static void mapping(IO &io, MyInfo& info) {
>
> io.reqKey("hat-size", info.hat_size);
>
> io.optKey("age", info.age, 21);
>
> io.optKey("hat-color", hat_color, cBlue);
>
> }
>
> };
>
>
> Which is used to both read and write yaml. The "age" and "hat-color" keys
>
> are optional in yaml. If not specified (in yaml), they default to 21 and
>
> cBlue. The "hat-size" key is required, and you will get an error it is not
>
> present in the yaml.
>
>
> There is no trait for yaml sequences. Instead, if your data type is a class
>
> with begin, end, and push_back methods, it is assumed to be a sequence.
>
>
>
> Now, the interesting case is the handling of non-POD data types. The
>
> reqKey() and optKey() methods need a lvalue to they can be read (when
>
> creating yaml) and write (when parsing yaml). It may also be the case that
>
> your existing data structures is not a container of structs, but a container
>
> of pointers to structs. But in both those cases, you want to be able to
>
> have the same yaml representation. Lastly, in the parsing yaml case, you
>
> need to be able to instantiate an internal object, whereas the writing yaml
>
> case needs to examine an existing object.
>
>
> Here is an example of the lld Reference type and the trait for converting it
>
> to and from yaml:
>
>
> template <>
>
> struct MapTraits<const lld::Reference*> {
>
>
> class MyReference : public lld::Reference {
>
> public:
>
> MyReference()
>
> : _target(nullptr), _targetName(), _offset(0), _addend(0) , _kind(0)
>
> {
>
> }
>
> MyReference(const lld::Reference* ref)
>
> : _target(nullptr),
>
> _targetName(ref->target() ? ref->target()->name() : ""),
>
> _offset(ref->offsetInAtom()),
>
> _addend(ref->addend()),
>
> _kind(ref->kind()) {
>
> }
>
>
> virtual uint64_t offsetInAtom() const { return _offset; }
>
> virtual Kind kind() const { return _kind; }
>
> virtual const lld::Atom *target() const { return _target; }
>
> virtual Addend addend() const { return _addend; }
>
> virtual void setKind(Kind k) { _kind = k; }
>
> virtual void setAddend(Addend a) { _addend = a; }
>
> virtual void setTarget(const lld::Atom *a) { _target = a;
>
> }
>
>
>
>
> const lld::Atom *_target;
>
> StringRef _targetName;
>
> uint32_t _offset;
>
> Addend _addend;
>
> Kind _kind;
>
> };
>
>
>
> static void mapping(IO &io, const lld::Reference*& ref) {
>
> MappingHelper<MyReference, const lld::Reference*> keys(io, ref);
>
>
> io.reqKey("kind", keys->_kind);
>
> io.optKey("offset", keys->_offset);
>
> io.optKey("target", keys->_targetName);
>
> io.optKey("addend", keys->_addend);
>
> }
>
>
>
>
> };
>
>
> Some salient points:
>
> * The trait is on "const lld::Reference*" because only pointers to
>
> References are passed around inside lld.
>
> * The lld class Reference in an abstract base class, so a concrete instance
>
> must be defined (MyReference).
>
> * There are two constructors for MyReference. The default constructor is
>
> used when parsing yaml to create the initial object which is then overridden
>
> as key/values are found in yaml. The other constructor is used when writing
>
> yaml to create a temporary (stack) instance which contains the fields needed
>
> for mapping() to access.
>
> * MappingHelper<> is a utility which detects if you are reading or writing
>
> and constructs the appropriate object. It is only needed for non-POD
>
> structs.
>
>
> -Nick
>
>
> On Aug 8, 2012, at 6:34 PM, Sean Silva wrote:
>
>
> Your suggestion is to remove the intermediate data structures and instead
>
> define the schema via external trait templates. I can see how this would
>
> seem easier (not having to write glue code to copy to and from the
>
> intermediate data types). But that copying also does normalization. For
>
> instance, your native object may have two ivars that together make one yaml
>
> key-value, or one ivar is best represented a as couple of yaml key-values.
>
> Or your sequence may have a preferred sort order in yaml, but that is not
>
> the actual list order in memory.
>
>
>
> I don't get what you're saying here. A traits class can easily handle
>
> all those conversions easily.
>
>
> It would look something like:
>
>
> template<>
>
> class YamlMapTraits<Person> {
>
> void yamlMapping(IO &io, Person *P) {
>
> requiredKey(io, &P->name, "name");
>
> optionalKey(io, &P->hatSize, "hat-size");
>
> }
>
> };
>
>
> Here I was just trying to mimic the one of the examples from your
>
> documentation so the feel should be similar. However, the door is open
>
> to specifying the correspondence however you really want in the traits
>
> class.
>
>
> I think the hard part of a traits approach is figuring out how clients will
>
> write the normalization code. And how to make the difficulty of that code
>
> scale to how denormalized the native objects are.
>
>
>
> One possibility I can think of off the top of my head is to have the
>
> traits class declare a private intermediate struct which it
>
> deserializes to (similar to the intermediate that the current API
>
> _forces_ you to have), and then just construct the object from the
>
> intermediate. It's so much more flexible to do this with a traits
>
> class.
>
>
> --Sean Silva
>
>
> On Wed, Aug 8, 2012 at 5:34 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
>
>
> On Aug 8, 2012, at 12:46 PM, Sean Silva wrote:
>
>
>
> But EnumValue is not quite right because it can be used with #defines too.
>
>
>
> Do we really want to encourage people to use #defines? Is there any
>
>
> set of constants in the LLVM tree which are defined with #defines and
>
>
> not in an enum?
>
>
>
>
> I'm not sure what you mean by traits-based in this context.
>
>
>
> A traits-based design means that you have a class template which
>
>
> provides a collection of type-specific information which is provided
>
>
> by specializing the class template for a particular type. For example,
>
>
> see include/llvm/ADT/GraphTraits.h, which uses GraphTraits<T> to
>
>
> specify how to adapt T to a common interface that graph algorithms can
>
>
> use. This is noninvasive (maybe needing a friend declaration at most).
>
>
> Your current approach using inheritance and virtual functions is
>
>
> invasive, forces the serializable class to inherit (causing multiple
>
>
> inheritance in the case that the serializable class already has a
>
>
> base), and forces the serializable class to suddenly have virtual
>
>
> functions.
>
>
>
> Overall, I think a traits-based design would be simpler, more loosely
>
>
> coupled, and seems to fit the use case more naturally.
>
>
> I as wrote in the documentation this was not intended to allow you to go
>
> directly from existing data structures to yaml and back. Instead the schema
>
> "language" is written in terms of new data structure declarations (subclass
>
> of YamlMap and specialization of Sequence<>).
>
>
>
> Your suggestion is to remove the intermediate data structures and instead
>
> define the schema via external trait templates. I can see how this would
>
> seem easier (not having to write glue code to copy to and from the
>
> intermediate data types). But that copying also does normalization. For
>
> instance, your native object may have two ivars that together make one yaml
>
> key-value, or one ivar is best represented a as couple of yaml key-values.
>
> Or your sequence may have a preferred sort order in yaml, but that is not
>
> the actual list order in memory.
>
>
>
> I think the hard part of a traits approach is figuring out how clients will
>
> write the normalization code. And how to make the difficulty of that code
>
> scale to how denormalized the native objects are.
>
>
>
> I'll play around with this idea and see what works and what does not.
>
>
>
> -Nick
>
>
>
>
>
> On Tue, Aug 7, 2012 at 4:57 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
>
> On Aug 7, 2012, at 2:07 PM, Sean Silva wrote:
>
>
> Thanks for writing awesome docs!
>
>
>
> +Sometime sequences are known to be short and the one entry per line is too
>
>
> +verbose, so YAML offers an alternate syntax for sequences called a "Flow
>
>
> +Sequence" in which you put comma separated sequence elements into square
>
>
> +brackets. The above example could then be simplified to :
>
>
>
> It's probably worth mentioning here that the "Flow" syntax is
>
>
> (exactly?) JSON. Also, noting that JSON is a proper subset of YAML is
>
>
> in general is probably worth mentioning.
>
>
>
> + .. code-block:: none
>
>
>
> pygments (and hence Sphinx) supports `yaml` highlighting
>
>
> <http://pygments.org/docs/lexers/>
>
>
>
> +the following document:
>
>
> +
>
>
> + .. code-block:: none
>
>
>
> The precedent for code listings is generally that the `..
>
>
> code-block::` is at the same level of indentation as the paragraph
>
>
> introducing it.
>
>
>
> +You can combine mappings and squences by indenting. For example a sequence
>
>
> +of mappings in which one of the mapping values is itself a sequence:
>
>
>
> s/squences/sequences/
>
>
>
> +of a new document is denoted with "---". So in order for Input to handle
>
>
> +multiple documents, it operators on an llvm::yaml::Document<>.
>
>
>
> s/operators/operates/
>
>
>
> +can set values in the context in the outer map's yamlMapping() method and
>
>
> +retrive those values in the inner map's yamlMapping() method.
>
>
>
> s/retrive/retrieve/
>
>
>
> +of a new document is denoted with "---". So in order for Input to handle
>
>
>
> For clarity, I would put the --- in monospace (e.g. "``---``"), here
>
>
> and in other places.
>
>
> Thanks for the Sphinx tips. I've incorporated them and ran a spell checker
>
> too ;-)
>
>
>
>
>
> +UniqueValue
>
>
> +-----------
>
>
>
> I think that EnumValue be more self-documenting than UniqueValue.
>
>
> I'm happy to give UniqueValue a better name. But EnumValue is not quite
>
> right because it can be used with #defines too. The real constraint is that
>
> there be a one-to-one mapping of strings to values. I want it to contrast
>
> with BitValue which maps a set (sequence) of strings to a set of values
>
> OR'ed together.
>
>
>
>
>
> At a design level, what are the pros/cons of this approach compared
>
>
> with a traits-based approach? What made you choose this design versus
>
>
> a traits-based approach?
>
>
>
> I'm not sure what you mean by traits-based in this context. The back
>
> story is that for lld I've been writing code to read and write yaml
>
> documents. Michael's YAMLParser.h certainly makes reading more robust, but
>
> there is still tons of (semantic level) error checking you to hand code. It
>
> seemed like most of my code was checking for errors. Also it was a pain to
>
> keep the yaml reading code is sync with yaml writing code.
>
>
>
> What we really need was a way to describe the schema of the yaml documents
>
> and have some tool generate the code to read and write. There is a tool
>
> called Kwalify which defines a way to express a yaml schema and can check
>
> it. But it has a number of limitations.
>
>
>
> Last month a wrote up a proposal for defining a yaml schema language and a
>
> tool that would use that schema to generate C++ code to read/validate and
>
> write yaml conforming to the schema. The best feedback I got (from Daniel
>
> Dunbar) was that rather than create another language (yaml schema language)
>
> and tools, to try to see if you could express the schema in C++ directly,
>
> using meta-programming or whatever. I looked at Boost serialization for
>
> inspiration and came up with this Yaml I/O library.
>
>
>
> -Nick
>
>
>
>
>
> On Mon, Aug 6, 2012 at 12:17 PM, Nick Kledzik <kledzik at apple.com> wrote:
>
>
> Attached is a patch for review which implements the Yaml I/O library I
>
> proposed on llvm-dev July 25th.
>
>
>
>
>
>
> The patch includes the implementation, test cases, and documentation.
>
>
>
> I've included a PDF of the documentation, so you don't have to install the
>
> patch and run sphinx to read it.
>
>
>
>
>
> There are probably more aspects of yaml we can support in YAML I/O, but the
>
> current patch is enough to support my needs for encoding mach-o as yaml for
>
> lld test cases.
>
>
>
> I was initially planning on just adding this code to lld, but I've had two
>
> requests to push it down into llvm.
>
>
>
> Again, here are examples of the mach-o schema and an example mach-o
>
> document:
>
>
>
>
>
>
>
>
> -Nick
>
>
>
>
>
>
> _______________________________________________
>
>
> llvm-commits mailing list
>
>
> llvm-commits at cs.uiuc.edu
>
>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
>
>
>
>
>
More information about the llvm-commits
mailing list