[cfe-commits] cfe-commits Digest, Vol 40, Issue 115

Douglas Gregor dgregor at apple.com
Wed Oct 20 17:06:34 PDT 2010


Sent from my iPhone

On Oct 20, 2010, at 1:13 PM, Matthieu Monrocq <matthieu.monrocq at gmail.com> wrote:

> Date: Tue, 19 Oct 2010 19:39:10 -0000
> From: Douglas Gregor <dgregor at apple.com>
> Subject: [cfe-commits] r116849 - /cfe/trunk/lib/Sema/SemaLookup.cpp
> To: cfe-commits at cs.uiuc.edu
> Message-ID: <20101019193911.0338D2A6C12C at llvm.org>
> Content-Type: text/plain; charset="utf-8"
> 
> Author: dgregor
> Date: Tue Oct 19 14:39:10 2010
> New Revision: 116849
> 
> URL: http://llvm.org/viewvc/llvm-project?rev=116849&view=rev
> Log:
> Improve the performance of typo correction, by using a simple
> computation to compute the lower bound of the edit distance, so that
> we can avoid computing the edit distance for names that will clearly
> be rejected later. Since edit distance is such an expensive algorithm
> (M x N), this leads to a 7.5x speedup when correcting NSstring ->
> NSString in the presence of a Cocoa PCH.
> 
> Modified:
>    cfe/trunk/lib/Sema/SemaLookup.cpp
> 
> Modified: cfe/trunk/lib/Sema/SemaLookup.cpp
> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Sema/SemaLookup.cpp?rev=116849&r1=116848&r2=116849&view=diff
> ==============================================================================
> --- cfe/trunk/lib/Sema/SemaLookup.cpp (original)
> +++ cfe/trunk/lib/Sema/SemaLookup.cpp Tue Oct 19 14:39:10 2010
> @@ -2730,6 +2730,12 @@
>  }
> 
>  void TypoCorrectionConsumer::FoundName(llvm::StringRef Name) {
> +  // Use a simple length-based heuristic to determine the minimum possible
> +  // edit distance. If the minimum isn't good enough, bail out early.
> +  unsigned MinED = abs((int)Name.size() - (int)Typo.size());
> +  if (MinED > BestEditDistance || (MinED && Typo.size() / MinED < 3))
> +    return;
> +
>   // Compute the edit distance between the typo and the name of this
>   // entity. If this edit distance is not worse than the best edit
>   // distance we've seen so far, add it to the list of results.
> 
> Hi Doug,
> 
> another simple optimization could be to count the number of occurences of each character in both names, then add the absolute difference for each character. If the sum of absolute differences is superior to the best edit distance so far, then no combination of addition / deletion / substitution (in this limit) can morph one string to another.
> 
> I've not measured it though, so it might slow down the general case.

We would need to gather more data to know if this would help. My gut reaction is that it's too expensive In the general case. 

> I was also wondering if this optimization would not be better suited in `StringRef::edit_distance` method ? (so that all users may benefit from it)

Checking the lower bound in edit_distance makes sense. The real solution is to implement a subquadratic  edit-distance algorithm. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20101020/646febb8/attachment.html>


More information about the cfe-commits mailing list