[PATCH v4 5/8] Start adding the infrastructure for handling TypoExprs.

Tue Oct 7 11:43:21 PDT 2014

On Mon, Oct 6, 2014 at 3:46 PM, Kaelyn Takata <rikka at google.com> wrote:

> On Fri, Oct 3, 2014 at 11:45 AM, David Blaikie <dblaikie at gmail.com> wrote:
>
>> Given that these are parts of a patch series, some of which have already
>> been signed off on/I haven't looked at, it's a bit hard to review the
>> overall design - but I believe you & Richard talked that over, so I'm not
>> too worried about that (just explaining why my review feedback's going to
>> be a bit narrow)
>>
>> +  if (NumTypos > 0 && !ExprEvalContexts.empty())
>> +    ExprEvalContexts.back().NumTypos += NumTypos;
>> +  else
>> +    assert(NumTypos == 0 && "There are outstanding typos after popping the "
>> +                            "last ExpressionEvaluationContextRecord");
>>
>> Could this be simplified? I think it's equivalent to:
>>
>>   if (!ExprEvalContexts.empty())
>>     ExprEvalContexts.back().NumTypos += NumTypos;
>>   else
>>     assert(NumTypos == 0 ... );
>>
>> or:
>>
>>   assert(ExprEvalContexts.empty() || NumTypos != 0 && ...)
>>   if (!ExprEvalContexts.empty())
>>     ...
>>
>> Yes, the "NumTypos > 0" was unnecessary and has been removed. (FYI, the
> second form you suggested doesn't quite work because the assert would fail
> on the common case where ExprEvalContexts is not empty and there are no
> typos.)
>

Hmm, right, maybe:

  assert(ExprEvalContexts.empty() ^ NumTypos != 0 && ... )
  if (!ExprEvalContexts.empty())
    ExprEvalContexts.back().NumTypos += NumTypos;

but perhaps that's just too confusing. I just find an assertion as the only
thing inside a block of code to be a bit off... but it's certainly not
unprecedented in LLVM and is probably more legible - just my own hangup to
get over :)

> There's an unnecessary semicolon at the end of the "while (true)" loop in TransformTypos::Transform
>>
>> Removed.
>
>>
>>
>>
>> I'd roll this up a bit:
>>
>>
>> +      auto &State = SemaRef.getTypoExprState(TE);
>> +      if (State.DiagHandler)
>> +        (*State.DiagHandler)(State.Consumer->getCurrentCorrection());
>>
>> into:
>>
>>   if (auto *Handler = SemaRef.getTypoExprState(TE).DiagHandler)
>>     (*DiagHandler)(State.Consumer->getCurrentCorrection());
>>
>> That rollup doesn't work because State would be unknown when accessing
> State.Consumer in the argument to the DiagHandler.
>

Oh, right!

> count in the condition, plus insert in the body of the if in TransformTypos::TransformTypoExpr would be more efficient as a single insert-and-check (by avoiding two separate lookups into the set):
>>   if (TypoExprs.insert(E).second) // means this element previously didn't exist in the set, and now does
>>
>> I've changed the count-then-insert to just an insert since
> SmallPtrSetImpl::insert returns true when the item wasn't already in the
> set.
>
>> This:
>>
>>   TransformCache.find(E) != TransformCache.end()
>>
>> might commonly be written as:
>>
>>   TransformCache.count(E)
>>
>> Optionally with "!= 0", but often count is used directly as a boolean test, especially when it's a single set/map (not a multi set/map).
>>
>> Changed.
>
>> This comment didn't quite make sense to me:
>>
>>
>> +      // If corrections for the first TypoExpr have been exhausted, retry them
>> +      // against the next combination of substitutions for all of the other
>> +      // TypoExprs.
>>
>> When you say "retry them" which them do you mean? The corrections for the first TypoExpr? But what does it mean for them to be exhausted then, if they can still be considered?
>>
>> Maybe I need a bit of help with the higher level algorithm here - I assume the idea is:
>>
>> Call transform.
>> As it visits TypoExpr nodes, it tries their first typo candidate available.
>> If we get to the end and have invalid code, reset the state of all the typo corrections except the first one (on its next query, it'll give back its second alternative).
>>
>> Here is where the confusion is at. As the transform visits each TypoExpr
> node it tries the first candidate available. If the first candidates of all
> the TypoExprs don't work, it then tries the next (and subsequent)
> candidates of only the first TypoExpr that was seen. When it exhausts the
> candidates from the first TypoExpr, it then resets the stream of candidates
> for the first TypoExpr and fetches the next candidate of the TypoExpr,
> continuing in this way until all combinations of candidates of all
> TypoExprs have been tried.
>

So I'm still a bit confused about which parts of the retry logic are in
which parts of the code.

OK, I think I've got it. Transform is called and we don't even know which
TypoExprs are in the Expr. So we transform.

As transform runs it comes across TypoExprs, and attempts to transform
them. For each one, it tries candidates until it finds a good one (based on
the immediate context - what are the cases where BuildDeclarationNameExpr
fails here? (when does NE.isInvalid fail?)).

So, in your example below, maybe the right answer is BEH, but AF is valid,
with no correct candidate for the 3 choice.

Consumers start at "before the first" element, calling "getNextCorrection"
on a new consumer gives you the first correction.

On the first call to TransformExpr, the first call to TransformTypoExpr
chooses A, records it, continues.
Gets to the second TransformTypoExpr, tries D and E, succeeds at F.
On the third TransformTypoExpr, it tries H and I, both fail, so it returns
from the last line ("return TransformCache[E] = ExprError]).

We get back to Transform with a failure. So we walk the TypoExprs (in
CheckAndAdvanceTypoExprCorrectionStreams) we reset the 3rd consumer
(because it was finished).

We transform again:

First typo) We are at A so we call next and get B
Second typo) From F we continue and try G, fail, then we're finished for
the second typo.
Third typo) Perhaps we find some solution for this, or not, it doesn't
matter.

We get back to Transform with a failure (second typo), So we walk the
TypoExprs again, resetting the second (and maybe the third).

First typo) We are at B so we call next and get to C, let's say we fail
here - C isn't even remotely valid.
Now it really matters whether we find/fail on the second and third typo. If
we fail on the second and third... we're going to give up. We'll find that
all the corrections are finished and we'll conclude that we've tried
everything.

Am I following this right?

> So for 3 TypoExprs... let's say 1 has the candidates (A, B, C), 2 has the
> candidates (D, E, F, G) and 3 has the candidates (H, I) the iterations
> would test the combinations in order:
>
> A D H
> B D H
>

I suppose a simpler form of my above example is: if, on the first pass, we
get ADH, on the next pass the code as-is would result in BCI, wouldn't it?
It'll "next" each of the corrections.

And I'm still not quite understanding why there's the  E != TypoExprs[0]
condition... maybe that's part of what I'm missing. It looks like that code
would cause, in a single Transform, multiple instances of the first typo to
be computed with different/subsequent typo results... which seems
confusing/strange. (though I don't fully understand how/where multiple
instances of the same TypoExpr will be visited in a single Transform)

(some nitpicks of the code - there's a duplicate lookup of TransformCache
(count then [], or count then insert further down the algorithm - it should
probably just be:

ExprResult &CacheEntry = TransformCache[E];
if (CacheEntry)
  return CacheEntry;
...
    return CacheEntry = NE;
}
return CacheEntry = ExprError();

)

> C D H
> A E H
> B E H
> C E H
> A F H
> B F H
> C F H
> A G H
> B G H
> C G H
> A D I
> B D I
> C D I
> A E I
> B E I
> C E I
> A F I
> B F I
> C F I
> A G I
> B G I
> C G I
>
>> If we get to the end, have invalid code, and the first typo has no alternatives left to try - I guess we just want to print errors for everything, without typo corrections?
>>
>>
> If all of the combinations of typo corrections failed to create an
> acceptable Expr, then the current correction for all of TypoExprs would be
> an empty TypoCorrection which would cause the corresponding diagnostic
> handlers to print out errors without suggestions.
>
>> It might be helpful to split out some of these chunks (the state resetting code and maybe the diagnostic printing at the end) into separate functions to give them logical names & make the higher level algorithm more clear?
>>
>> I've split the diagnostics emission and the state handling into separate
> helpers, and expanded their description comments. I'm not keen on the name
> for the state handler but it is the best name I could come up with that was
> even remotely descriptive.
>

Roughly works, though I'm still quite confused, it seems. (alternative name
for that function would be "reset" rather than "advance" - but once I
understand the algorithm better I might have better suggestions)

> The loop at the end ("for (auto E : TypoExprs) {") is over a hashing data structure, right? Is that going to cause unstable output of diagnostics? (eg: different machines, different memory layout, the same program might get the typos printed in a different order) or is there something else that will fix that?
>>
>>
>> If you have to change the set to be a SetVector, or similar (to stabilize output) - then you won't need the FirstTypoExpr anymore - it'll just be the first one in the set, since you iterate in insertion order at that point.
>>
>> Switched it from a SmallPtrSet to a SmallSetVector like it should have
> been. Doing so also allowed me to simplify the state handling code by being
> able to assume a deterministic order, and eliminating FirstTypoExpr.
>
>>
>>
>> "while (TypoCorrection TC = State.Consumer->getNextCorrection()) {" doesn't seem to be a loop - it has an unconditional return at the end of the body. Should it just be an 'if'?
>>
>> There was actually a missing check on whether the result of
> Sema::BuildDeclarationNameExpr was valid.
>

Were there missing/added test cases to cover this situation?

> "+ (FullExpr.get()->isTypeDependent() ||
>>
>> +       FullExpr.get()->isValueDependent() ||
>> +       FullExpr.get()->isInstantiationDependent())) {"
>>
>> What are those checks for? I don't know what condition we would be in if we had typo corrections but it wasn't type, value, or instantiation dependent. Perhaps a comment could explain that?
>>
>> The check is to avoid running the tree transform on Expr trees that are
> known to not have a TypoExpr in them, regardless of whether the current
> expression evaluation context still indicates there are uncorrected typos.
> I've added a comment to that effect.
>

Hmm - I suppose my question might then be: When would we have NumTypos > 0
but be in a situation where the full expression cannot have any typos in it?

> To simplify Sema::clearDelayedTypo you could add MapVector::erase(const KeyType&) (given that std::map has such an operation, it seems like a reasonable one for MapVector to have).
>>
>> Done (simple patch attached).
>

Great - if you'd like to commit this separately, it should have a unit
test. Otherwise it's probably simple enough to commit along with your new
usage if you prefer.

Sema::getTypoExprState looks like it could be a const function (it looks
>> like it's never meant to be called when the element isn't already in the
>> map) - using 'lookup' instead of operator[] will be a const operation on
>> the map (& assert if the element doesn't exist in the map).
>>
>
> Sadly I cannot do that because the operator[] cannot be used if the method
> is made const, and .lookup() can't be used because it returns by-value
> instead of by-reference and TypoExprState is not copyable:
>
> ../tools/clang/lib/Sema/SemaLookup.cpp:4600:10: warning: returning
> reference to local temporary object [-Wreturn-stack-address]
>   return DelayedTypos.lookup(TE);
>
> ../include/llvm/ADT/MapVector.h:88:30: error: call to implicitly-deleted
> copy constructor of 'const clang::Sema::TypoExprState'
>     return Pos == Map.end()? ValueT() : Vector[Pos->second].second;
>

Oh, weird - my mistake in assuming/misremembering what lookup actually
does. I'd still make getTypoExprState const and use find+assert just to
constrain the interface a bit more rather than leaving it possible to
accidentally create a new entry in the map when the API is only intended to
query existing entries.

>
>
>> Otherwise this all looks like rather neat stuff - though I'm not as able
>> to review in detail some of the refactoring of existing typo correction
>> stuff. It looks plausible.
>>
>
> I've attached the updated patch. Thanks for reviewing this, David!
>
>
>>
>
>>
> On Thu, Sep 25, 2014 at 2:00 PM, Kaelyn Takata <rikka at google.com> wrote:
>>
>>> ping
>>>
>>> On Tue, Aug 26, 2014 at 11:05 AM, Kaelyn Takata <rikka at google.com>
>>> wrote:
>>>
>>>>
>>>> +dblaikie
>>>>
>>>>
>>>> On Thu, Jul 31, 2014 at 1:20 PM, Kaelyn Takata <rikka at google.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Part of the infrastructure is a map from a TypoExpr to the
>>>>> Sema-specific
>>>>> state needed to correct it, along with helpers to ease dealing with the
>>>>> state.
>>>>>
>>>>> The the typo count is propagated up the stack of
>>>>> ExpressionEvaluationContextRecords when one is popped off of to
>>>>> avoid accidentally dropping TypoExprs on the floor. For example,
>>>>> the attempted correction of g() in test/CXX/class/class.mem/p5-0x.cpp
>>>>> happens with an ExpressionEvaluationContextRecord that is popped off
>>>>> the stack prior to ActOnFinishFullExpr being called and the tree
>>>>> transform for TypoExprs being run.
>>>>> ---
>>>>>  include/clang/Sema/Sema.h           |  44 +++++
>>>>>  include/clang/Sema/SemaInternal.h   |  15 +-
>>>>>  include/clang/Sema/TypoCorrection.h |   2 +-
>>>>>  lib/Sema/SemaExpr.cpp               |   7 +
>>>>>  lib/Sema/SemaExprCXX.cpp            | 108 ++++++++++++
>>>>>  lib/Sema/SemaLookup.cpp             | 316
>>>>> ++++++++++++++++++++++++------------
>>>>>  6 files changed, 384 insertions(+), 108 deletions(-)
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20141007/af896328/attachment.html>