[PATCH] Introduction of ASTMatcher sub-expression references

Tue May 7 08:23:26 PDT 2013

  Let me start by saying the patch is based on what I observed in the code since I was trying to fit in with the existing algorithm as much as possible. If I had access to what I call the "code philosophy", the underlying ideas and abstractions used in the design which are not necessarily expressed, perhaps the design would have been different. This is where your feedback is so helpful so thanks!

  > -----Original Message-----
  > From: Manuel Klimek [mailto:klimek at google.com]
  > Sent: Monday, May 06, 2013 4:16 AM
  > To: klimek at google.com; dgregor at apple.com; gribozavr at gmail.com; Vane,
  > Edwin
  > Cc: cfe-commits at cs.uiuc.edu
  > Subject: Re: [PATCH] Introduction of ASTMatcher sub-expression references
  >
  >
  >   First, thanks a lot for working on this - I think this is going to be awesome!
  >   Some minor comments inside, but first let's talk about the design. I'm not
  > convinced yet we can't solve this in a much simpler way :)
  >
  >   The main problem I currently see is that we build up BoundNodesTrees through
  > BoundNodesBuilders without their parent context, and only add them after they
  > match.
  >
  >   Now there are 2 things about the design of the proposed solution I'm unhappy
  > with:
  >   1. the ref-counted pointer design: the BoundNodesTreeBuilders are always on
  > the stack, and always in a hierarchy that mirrors the call hierarchy - thus, I think
  > ownership is very clear and straight forward, and I'd think we can solve this
  > problem without introducing less clear ownership semantics
  >   2. the child BoundNodesTrees are always built in the cause of one function
  > (they're on the stack), and knowing their parents (as they always will be added to
  > that parent in the end); so I think adding a simple parent pointer should be
  > enough to solve the immediate problem

  I started with this design actually. A single parent pointer is fine for handling roll-backs and getting access to the root. However, more information is required: each parent needs to know the 'child' in-progress so BoundNodesTrees can be created from the root. This relates to your comment about forEach* boundaries. I had in mind that equalsBoundNode() should be able to mention any previously bound node anywhere in the expression. This seems the most usable and understandable option to me. I think the use of the matcher would be diminished if we started saying that 'forEach*' matchers become boundaries across which equalsBoundNode() won't search.

  So given that, we need to create BoundNodesTrees from the root so all previously bound nodes are visible. To do this, a two-way link is necessary between parents and children-in-progress. Now the problem is keeping the links up to date. Good engineering principles say we should design interfaces that are hard to use incorrectly so to keep both links up-to-date with minimal chance of misuse I created the Grafter class. Trying to be as least invasive as possible in existing code I thought if we made the Grafter class an RAII object then letting it go out of scope is equivalent to discarding it which is what happens with a stack-based Builder goes out of scope. The copyTo() functions that were used to apply a Builder after a successful match were replaced by Grafter::apply() calls.

  And it's the RAII idea that unfortunately lead to use of ref-counting. I had this idea that Builders should create Grafter objects. But to pass ownership of the object to the caller where it really belongs I needed something like auto_ptr. OwningPtr isn't the same so I turned to the more heavy-weight ref-counting pointers in LLVM. However, I think the stack-based option is not really that much more complex or easy to misuse. Code can create Grafters directly passing in the parent builder.

  >   Then, there's some fundamental design decisions that I think we should discuss
  > first. The basic stuff is pretty straight forward, as there are actually no branches
  > in the BoundNodesTree.
  >   For branches in the BoundNodesTree there are 2 different cases:
  >   a) branches due to has() and hasDescendant(). The interesting point here is that
  > there is only one match in the end, and the branching is mainly there so we can
  > roll back sub-matches in case we need to follow a different branch higher up
  >   b) branches due to forEach() and forEachDescendant(). Those are the really
  > hard cases :) For example:
  >   stmt(
  >     forEachDescendant(stmt().bind("x")),
  >     forEachDescendant(stmt(equalsBoundNode("x"))))
  >   Which combinations do we want to test here? You're basically defining it to
  > "first" in the first match (probably as you want to avoid the combinatorial
  > explosion), but I'm not sure yet that's the right approach. If we want that, I feel
  > like there should be a much simpler way to implement it. Anyway, I'd like to have
  > that strategy pinned down in one place in the code if at all possible, so we can
  > change it if it turns out that users find it non-intuitive.

  By combinatorial explosion are you referring to the fact that every stmt(equalsBoundNode()) will cause a search through all bound nodes? I agree with you that I don't really like the current state of equalsBoundNode() comparing against only the first found node with the same name. The combinatorial explosion case should be preferred for usability I think. If performance is found to be an issue we can look at solutions then? The existing algorithm employed by BoundNodesTree::visitMatches() is really copy heavy and the matcher you gave above would certainly drag it to a halt. I'm sure something better can be achieved just for iterating through nodes. There are plenty of other options to explore in any case.

  >   My current intuition is: checking against the *first* in a previous forEach branch
  > is bad - this goes against the "imperative" intuition that the last iteration in a
  > loop overwrites a value.
  >   I think we should either do the full combinatorial explosion when going across
  > forEach* boundaries, or we should not support it at all - I'd like to keep as much
  > as possible "functional style".
  >
  >
  > ================
  > Comment at: include/clang/ASTMatchers/ASTMatchers.h:3343
  > @@ +3342,3 @@
  > +                                              llvm::StringRef>
  > +equalsBoundNode(StringRef id) {
  > +  return internal::PolymorphicMatcherWithParam1<
  > ----------------
  > s/id/ID/
  >
  > ================
  > Comment at: include/clang/ASTMatchers/ASTMatchersInternal.h:71
  > @@ -67,1 +70,3 @@
  > +
  > +public:
  >    /// \brief Adds \c Node to the map with key \c ID.
  > ----------------
  > Why the extra public:?

  Not sure if this is against LLVM style but I tend to split public types, public functions, and public data into their own sections repeating the access modifier. Same for private. If it's against style I'll fix it and use comments instead.

  > ================
  > Comment at: include/clang/ASTMatchers/ASTMatchersInternal.h:174
  > @@ -159,2 +173,3 @@
  >  public:
  > -  BoundNodesTreeBuilder();
  > +  class Grafter : public llvm::RefCountedBase<Grafter> {
  > +  public:
  > ----------------
  > I really dislike the name Grafter. First, for a non-native speaker it's a rather rare
  > word to be used, and second, even after looking up its meaning, the name
  > doesn't tell me anything about what this class does.

  Names can be changed. I was coming from a botanical viewpoint here: grafting branches onto trees. How about TempBuilder or UnmatchedBuilder?

  > ================
  > Comment at: unittests/ASTMatchers/ASTMatchersTest.cpp:3946
  > @@ +3945,3 @@
  > +
  > +TEST(SameAsMatcher, UsingForEachDescendant) {
  > +  const std::string Fragment = "\
  > ----------------
  > General notes on the tests:
  > 1. Please use one string literal per line, as that makes it consistent with the rest
  > of the file 2. I am doubtful the tests always have to be that complex (especially
  > they contain too many different things). In general, try to write tests that are as
  > small as possible to test one thing, and use as few C++ features as possible to
  > achieve that (minimize number of types / constructs used).

  I will simplify tests and fix string literals. The test MatchesFirstBoundNode is probably going to be replaced completely anyway based on the discussion above about full combinatorial matches.

  >
  > http://llvm-reviews.chandlerc.com/D740

http://llvm-reviews.chandlerc.com/D740