[cfe-commits] PATCH: Add support for C++ namespace-aware typo correcting

Fri May 20 16:17:13 PDT 2011

Here's a WIP version of my rewritten namespace-aware typo correction patch
based on your feedback and our previous discussion.  Right now there are two
versions of Sema::CorrectTypo, with one for compatibility with the original
version of the method, and only lib/Sema/SemaExpr.cpp has been modified to
use the new one directly.  With just SemaExpr.cpp hooked up, the number of
errors in the new unit test dropped from 58 to 48, and currently the only
failure in existing tests which I still need to iron out is
test/CXX/basic/basic.lookup/basic.lookup.argdep/p4.cpp (I expect there will
be more in C++ tests as new corrections become possible).

I still have a fair bit of work to do such as having the rest of the callers
of CorrectTypo look for correction results in the new TypoCorrection object
instead of LookupResults, and to see about streamlining Sema::CorrectTypo a
bit more.  I'm sending what I have so far not for submission but to get your
feedback on the new approach before I put a lot of effort into polishing it.

I also have a couple of questions regarding the unqualified typo correction.
The current behavior is that if the same typo shows up again, one of two
things happen:
 - if the cached value indicated a correction wasn't found last time then
the same is returned this time
 - otherwise CorrectTypo checks for the best match using the cached
identifier plus all possible keywords in the current context (and it would
be theoretically possible to have the same typo yield a different correction
if the typo is a closer match to a now-valid keyword than to the non-keyword
identifier that had been returned last time).
Do we want to stick to this semi-convoluted behavior, or shall I simplify
the caching to always return the cached correction as long as it is still in
scope or is a null match? And with the new probing into namespaces, do we
want to always retry the search when the cache indicates a null match (or
possibly for a subset of instances, such as if the current DeclContext is
different)?

Cheers,
Kaelyn

On Sat, Mar 26, 2011 at 4:51 AM, Douglas Gregor <dgregor at apple.com> wrote:

>
> On Mar 25, 2011, at 10:15 PM, Kaelyn Uhrain wrote:
>
>
>
> On Mon, Mar 14, 2011 at 10:28 AM, Douglas Gregor <dgregor at apple.com>wrote:
>
>>
>>        1) Don't walk the namespaces at all; let the check against all of
>> the identifiers in the translation unit winnow the list down to the best
>> candidates based on name alone. We don't know which of these candidates is
>> visible, or how to get there.
>>
>
> The problem with first checking by name alone is that you can mistakenly
> discard the best match--I had that problem in early iterations of my patch.
> Take for example:
>
> namespace foo {
>   namespace bar {
>      void tree_branches() { /* ... */ }
>   }
> }
>
> void three_branches() { /* ... */ }
>
> int main() {
>   tree_branches();
> }
>
> Currently my patch will suggest three_branches as it has an edit distance
> of 1 versus an edit distance of 2 for foo::bar::tree_branches.  Checking
> against all identifiers first would cause the TypoCorrectionConsumer to
> discard three_branches in favor of tree_branches since it would think
> tree_branches has an edit distance of 0... but will later realize
> tree_branches is unreachable from the current context and discard it (as the
> code would do now), or possibly suggest a sub-optimal correction (if the
> code waited to compute the qualifier).  So the trick would be to figure out
> the qualifier needed for each identifier before making a judgment about edit
> distance without doing an unqualified lookup and potentially trying to
> figure out a valid qualified lookup for each identifier.  The trick is
> trying to get the right info at the right time without walking through the
> namespaces in the AST given that the IdentifierInfo objects stored in
> Context.Idents lack any reference to a declaration context.
>
>
> Okay, that makes sense. Since we have to go through all of the identifiers
> anyway, we could certainly just keep several different buckets---best edit
> distance, best edit distance-1, best edit distance-2---so we can catch these
> cases. Then, if nothing in the first bucket matches, we go on to the
> next-best bucket, and so on.
>
>         2) When we're performing lookup for each candidate (e.g., the calls
>> to LookupPotentialTypoResult), first try unqualified lookup. If that fails,
>> save the candidate in a separate "try qualified name lookup" set.
>>
>
> With my above comments in mind, the unqualified lookup would have to be
> done for potentially every identifier in the translation unit (any
> subsequent identifier that has a greater edit distance than the closest
> correction so far that was successfully lookup can be skipped, but the
> worst-case scenario is for all identifiers to need to be looked up),
> followed by qualifier computation (involving walking the decls in the AST
> for child namespaces, unless I'm missing some handy method for building
> qualified identifiers) on every identifier that failed the unqualified
> lookup but does not have a greater edit distance than the closest valid
> unqualified identifier.
>
>
> We can't look up all of the identifiers. It's much, much too expensive;
> that's why we have the system we have, so that we can quickly narrow down to
> a small set of entities that we perform lookups on.
>
> It's better to guarantee acceptable performance and miss a few typo
> correction suggestions than it is to slow compilation down to a crawl due
> to, e.g., a missing header inclusion.
>
>        3) If none of the candidates can be found by unqualified lookup,
>> start looking in each of the namespaces we know about to see if qualified
>> name lookup would work to find each of the candidates. You'll likely need a
>> side structure containing all of the (unique, canonicalized) NamespaceDecls
>> in the translation unit to make this efficient.
>>
>
> How would building the side structure containing all of those
> NamespaceDecls be more efficient than  ScanNamespaceForCorrections(), which
> stops walking through namespaces once the distance between a namespace and
> the current context is greater than the best edit distance known to the
> TypoCorrectionConsumer?  Or walk any less of the AST than
> ScanNamespaceForCorrections?
>
>
> If we're only looking up a small number of identifiers, then we can perform
> targeted lookups of those identifiers, which (in the PCH case) boils down to
> a fairly quick search in an on-disk hash table. Since the number of
> namespaces is generally much fewer than the number of total declarations,
> we're doing less work.
>
>
>> Implemented this way, there's no additional penalty for typo correction of
>> visible names. Moreover, we skip the AST walk entirely, and only perform
>> targeted lookups for names that we know exist (because they were in the
>> identifier table). This also permits a few more optimizations:
>>
>>        - We could (roughly) order namespaces based on their depth, so that
>> we'll tend to visit the more top-level (and, therefore, more likely to have
>> short qualifiers) namespaces first. If the qualifier for a namespace would
>> be longer than the shortest qualifier we've found so far, we don't have to
>> look in that namespace.
>>
>
>  ScanNamespaceForCorrections already skips namespaces that would require
> overly-long qualifiers.
>
>
> Sure, but in my experience, most large namespaces (std, boost, llvm, clang)
> are top-level namespaces, so pruning based on overly-long qualifiers is
> likely to only prune away the small namespaces.
>
> I still advocate doing that pruning, but I want quick hash-table lookups
> based on the candidate identifiers rather than a weak of all of the contents
> of these namespaces.
>
>        - For a given namespace, we could look for all of the candidates in
>> that namespace at the same time, improving locality (which is particularly
>> import when those lookups hit the disk in the PCH case).
>>
>
> ScanNamespaceForCorrections currently feeds candidates to the
> TypoCorrectionConsumer by namespace.  Admittedly, it could store child
> NamespaceDecls to be searched after searching the current DeclContext
> instead of searching them as they are found, so that the overall search for
> candidates is breadth-first instead depth-first.  (And now that I've thought
> of this, I'm going to be updating ScanNamespaceForCorrections to do so since
> it will further limit the number of namespaces searched).
>
>
> As I see it, the main issue for scalability is to avoid having to walk
> through the declarations in the AST, because that's going to cause massive
> deserialization from the PCH file. By relying primarily on the walk we have
> to perform through all of the known identifiers, and then doing targeted
> lookups into namespaces we know about, we can bound the problem to O(# of
> candidate identifiers * # of namespaces that are close enough).
>
> - Doug
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20110520/2bec1e7f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clang-namespace-aware-typo-corrections2.diff
Type: text/x-diff
Size: 46533 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20110520/2bec1e7f/attachment.diff>