[cfe-dev] RFC: abstract serialization

Thu Sep 19 10:45:46 PDT 2019

On Thu, Sep 19, 2019 at 1:29 PM John McCall <rjmccall at apple.com> wrote:
>
> On 19 Sep 2019, at 9:20, Aaron Ballman wrote:
> > On Wed, Sep 18, 2019 at 6:50 PM John McCall via cfe-dev
> > <cfe-dev at lists.llvm.org> wrote:
> >>
> >> Swift’s AST is largely self-contained, but it occasionally needs to
> >> refer to entities from Clang’s AST. Up until now, we’ve only
> >> needed to embed the occasional clang::Decl*, but we’ve recently
> >> found a reason why it’d be useful to embed a clang::Type* That
> >> creates a problem for us, because while we know how to serialize a
> >> reference to an external Clang declaration (or at least a subset of
> >> them), we don’t have a way to serialize a reference to an external
> >> Clang type. Now, obviously we could reproduce the structure of that
> >> Clang type in our serialization and deserialization code, but the
> >> reason we want to use Clang’s AST in the first place is that C
> >> types can have a surprising amount of structure; for example,
> >> function types can have calling conventions, regparm attributes, ARC
> >> parameter conventions, and all sorts of other things that have been
> >> added over the years by various extensions. Including all of that
> >> structure, across the entire AST, would be a significant ongoing
> >> maintenance burden. Therefore, we’d rather find some way to take
> >> advantage of Clang’s own serialization logic.
> >>
> >> At the same time, Clang has a longstanding problem with debugging
> >> dumps. We have several different debugging-dump formats, and
> >> they’re all pretty much destined to be incomplete because anybody
> >> augmenting the AST has to remember to include the new information in
> >> all the dumping code. Exhaustiveness checking lets us verify that we
> >> haven’t forgotten an entire node class, but it doesn’t tell us
> >> whether we’ve forgotten a field of that class. We only have one
> >> piece of code that has to get that information right, and that’s
> >> the serialization logic.
> >>
> >> I’d like to propose solving both of these problems in one pass by
> >> introducing a new level of abstraction into the serializer and
> >> deserializer. The basic idea is that we’d write the node-specific
> >> serialization and deserialization code as if it were generating and
> >> consuming some simple JSON-like structured format; it would be
> >> templated to make calls against some abstract physical serialization
> >> layer.
> >>
> >> That is, for code today that looks like this:
> >>
> >> void ASTTypeWriter::VisitVariableArrayType(const VariableArrayType
> >> *T) {
> >>   VisitArrayType(T);
> >>   Record.AddSourceLocation(T->getLBracketLoc());
> >>   Record.AddSourceLocation(T->getRBracketLoc());
> >>   Record.AddStmt(T->getSizeExpr());
> >>   Code = TYPE_VARIABLE_ARRAY;
> >> }
> >>
> >> We’d instead write something more like:
> >>
> >> void AbstractTypeWriter<Serializer>::VisitVariableArrayType(const
> >> VariableArrayType *T) {
> >>   VisitArrayType(T);
> >>   S.addSourceLocation(TYPE_VARIABLE_ARRAY_LBRACKET_LOC,
> >> T->getLBRacketLoc());
> >>   S.addSourceLocation(TYPE_VARIABLE_ARRAY_RBRACKET_LOC,
> >> T->getRBRacketLoc());
> >>   S.addStmt(TYPE_VARIABLE_ARRAY_SIZE_EXPR, T->getSizeExpr());
> >>   S.setNodeKind(TYPE_VARIABLE_ARRAY);
> >> }
> >>
> >> And the Serializer type would be expected to implement a dozen or so
> >> of these addFoo methods: bool, int, string, begin/end array,
> >> begin/end substructure, SourceLocation, types, sub-statements,
> >> declaration references, maybe some cases I’m forgetting.
> >>
> >> On the deserialization side, we would promise to make deserialization
> >> calls in the same order that we make serialization calls so that we
> >> can continue to use a flat representation in our main serialization
> >> path.
> >>
> >> The current deserialization code does not actually check for failure
> >> in deserializing components, and I would probably continue that for
> >> now.
> >>
> >> I haven’t thought very carefully about what these attribute
> >> arguments would be. They could be strings, but an enum might allow
> >> clever metaprograms. Maybe some of this could be tblgen’ed.
> >>
> >> Thoughts?
> >
> > I think the idea has a lot of merit and is definitely worth exploring.
> > Thank you for bringing the idea up! The only concern I have is if the
> > plan is to implement AST dumping through this interface, you should be
> > aware that both the default and JSON dumpers have some odd quirks that
> > may make it difficult to get identical output through another
> > interface.
>
> Thank you.  Do people consider the existing dumper output stable?
> I certainly wouldn’t.

Neither format is stable in the usual sense, but when we did the
refactoring for the AST dumping functionality to split it into text
and JSON dumpers, we took special care to only perturb existing tests
if it improved the readability of the output (provided more
information, was contextually in a better location, etc). The
difficulty we ran into was that small changes to the way some nodes
were visited would make some output better while making some output
worse. In those situations, we tried to strive for getting the same
output as the original.

That said, we don't need identical output as a result in every situation.

~Aaron

>
> John.