[LLVMdev] Spell Correction Efficiency

Wed Feb 2 17:12:14 PST 2011

On Jan 15, 2011, at 8:31 AM, Matthieu Monrocq wrote:

> Hello Doug,
> 
> *putting llvmdev in copy since they are concerned too*
> 
> I've finally got around to finish a working implementation of the typical Levenshtein Distance with the diagonal optimization.
> 
> I've tested it against the original llvm implementation and checked it on a set of ~18k by randomly generating a variation of each word and checking that both implementations would return the same result... took me some time to patch the diagonal version, but they now do.

Okay. Did you do any performance comparisons? The point of introducing a new algorithm is to improve performance.

> I have left the older method in place, especially because I saw that it was used in FileCheck and was a bit wary of any side-effect I might introduce there.
> 
> I have thus created a new method: unsigned StringRef::levenshtein_distance(StringRef other, unsigned maxEditDistance) and replace the two uses of edit_distance in Sema\SemaLookup.cpp with it.

I would much rather just replace edit_distance completely. Compiler writers aren't string-savvy enough to pick between the two algorithms :)

> I have ditched the "AllowReplace" argument since it further complicated the logic of an already ~150 lines method for no obvious gain (as far as I could see) and I have also explicited the computations a bit by using named temporaries.

Okay. I can't imagine why anyone would want to turn off replacement anyway.

> Unfortunately I have only a Windows machine at hand for the moment and the tests don't work that well... so I'd appreciate if anyone could check my changes.

A bunch of Clang tests failed when I applied your patch, so it's not quite ready yet. These are the failures:

Failing Tests (10):
    Clang :: CXX/class/class.friend/p1.cpp
    Clang :: FixIt/fixit.cpp
    Clang :: Parser/objc-quirks.m
    Clang :: Parser/recovery.c
    Clang :: Sema/MicrosoftExtensions.c
    Clang :: SemaCXX/unknown-type-name.cpp
    Clang :: SemaObjC/category-1.m
    Clang :: SemaObjC/super.m
    Clang :: SemaObjC/synth-provisional-ivars.m
    Clang :: SemaObjCXX/category-lookup.mm

I looked at the first one, and for the strings "UndeclaredSoFar" and "_Imaginary"  and a maxEditDistance of 5 (= the difference in their lengths, if that matters), the algorithm returns an edit distance of 2.

One of the other tests (Parser/objc-quirks.m) is actually a crash.

+  // maxDistance (aka D) directly influences the width of the strip (2*D+1)
+  // we arbitrarily restrict it in case the caller did not see fit to do it
+  // (from clang\lib\Sema\SemaLookup.cpp in TypoCorrectionConsumer::FoundName)
+  unsigned const maxDistance = std::min(unsigned((otherLength+2)/3),
+                                        maxEditDistance);

I don't think this belongs here. We can count on the client to supply a reasonable maxEditDistance. If our performance is going to degrade horribly when an unreasonable maxEditDistance is provided (e.g., > the sum of the string lengths), we could just fall back to the simplistic algorithm we currently have.

Oh, and please look at the LLVM Coding Standards (http://llvm.org/docs/CodingStandards.html), and reformat your patch accordingly. Thanks!

	- Doug