[cfe-dev] Spell Correction Efficiency

Matthieu Monrocq matthieu.monrocq at gmail.com
Sat Feb 5 11:30:19 PST 2011


Hello Doug,

thanks for the review :)

I went ahead and reworked the patches according to your remarks, namely:
- I modified edit_distance directly, rather than introducing a new method,
dropping the "AllowReplacement" parameter and refactoring the calls
- I have changed the "artificial" lowering, a levenshtein cannot be higher
than the longer of the two strings (perform "short.length()" replacements
and the rest as additions to turn the short into the long)
- I checked the 80-columns line
- I realized that the K&R brace style was used

If there is anything I missed in the Coding Standard, please let met know; I
have notably interpreted the "camelCase" guideline as lower camelCase, since
otherwise "PascalCase" would have been clearer I guess.


The tests still don't work well on windows (exception in clang's lit.cfg,
I've tried to fix it but it yielded a fatal error instead so... :/). But I
did succeeded in compiling the whole of llvm and clang, from scratch, after
wiping out the build directory and reruning CMake (and I do hate VC++
sluggishness, btw).


I've tested it in isolation (against the original algorithm), it looks like
I had made an error when transposing my implementation to the body of
StringRef last time (lost in translation...) so this time I worked directly
on StringRef to avoid further issues. It looks like the algo had a bug ('='
instead of '+') and would not report distances greater than 1 or 2... *hum*


I have benchmarked the performances using the following method (compiled in
Release mode in VC++2010):
> pickup a corpus of about ~18,000 distinct words, containing letters,
digits and underscore (to match C++ identifiers), with length varying from 1
to 18.
> randomly select 100 words in the corpus, and perform up to 5 random
alterations on them
> for each altered word
  - initialize max to the altered word length
  - loop over the corpus, compute the distance between the current word and
the altered one (bounded by max), update max if necessary and stop if it
reaches 0
  - report the "best" score achieved (check that both algorithm report the
same score)
  - rerun 5000 times this search (over the whole corpus) first for the
"classic" distance then for the "diagonal" distance
> record each "time" (measured from outside the 5000 loop) depending on the
"best" score achieved
> display the average, per "best" score, as well as the speed-up achieved

    Distance         Brute      Diagonal  Speedup
    Distance: 0     0.00234   0.00057   x4.11
    Distance: 1     0.00550   0.00150   x3.67
    Distance: 2     0.00785   0.00337   x2.33
    Distance: 3     0.01003   0.00636   x1.57
    Distance: 4     0.01303   0.00993   x1.31
    Distance: 5     0.01701   0.01087   x1.56

I don't have any explanation of why 5 would perform better than 4, it looks
like an artifact of the tests (and may be due to the distance tightening
sooner). The trend should, logically, be decreasing, since the number of
computations is directly impacted by the "best distance so far" in the loop.
Anyway, it looks like it performs rather well.


Note: there are two patches, one for llvm and one for clang. The patches are
correlated and cannot be applied separately, because the signature of
edit_distance changed (we lost a parameter).

2011/2/3 Douglas Gregor <dgregor at apple.com>

>
> On Jan 15, 2011, at 8:31 AM, Matthieu Monrocq wrote:
>
> > Hello Doug,
> >
> > *putting llvmdev in copy since they are concerned too*
> >
> > I've finally got around to finish a working implementation of the typical
> Levenshtein Distance with the diagonal optimization.
> >
> > I've tested it against the original llvm implementation and checked it on
> a set of ~18k by randomly generating a variation of each word and checking
> that both implementations would return the same result... took me some time
> to patch the diagonal version, but they now do.
>
> Okay. Did you do any performance comparisons? The point of introducing a
> new algorithm is to improve performance.
>
> > I have left the older method in place, especially because I saw that it
> was used in FileCheck and was a bit wary of any side-effect I might
> introduce there.
> >
> > I have thus created a new method: unsigned
> StringRef::levenshtein_distance(StringRef other, unsigned maxEditDistance)
> and replace the two uses of edit_distance in Sema\SemaLookup.cpp with it.
>
> I would much rather just replace edit_distance completely. Compiler writers
> aren't string-savvy enough to pick between the two algorithms :)
>
> > I have ditched the "AllowReplace" argument since it further complicated
> the logic of an already ~150 lines method for no obvious gain (as far as I
> could see) and I have also explicited the computations a bit by using named
> temporaries.
>
> Okay. I can't imagine why anyone would want to turn off replacement anyway.
>
> > Unfortunately I have only a Windows machine at hand for the moment and
> the tests don't work that well... so I'd appreciate if anyone could check my
> changes.
>
> A bunch of Clang tests failed when I applied your patch, so it's not quite
> ready yet. These are the failures:
>
> Failing Tests (10):
>    Clang :: CXX/class/class.friend/p1.cpp
>    Clang :: FixIt/fixit.cpp
>    Clang :: Parser/objc-quirks.m
>    Clang :: Parser/recovery.c
>    Clang :: Sema/MicrosoftExtensions.c
>    Clang :: SemaCXX/unknown-type-name.cpp
>    Clang :: SemaObjC/category-1.m
>    Clang :: SemaObjC/super.m
>    Clang :: SemaObjC/synth-provisional-ivars.m
>    Clang :: SemaObjCXX/category-lookup.mm
>
> I looked at the first one, and for the strings "UndeclaredSoFar" and
> "_Imaginary"  and a maxEditDistance of 5 (= the difference in their lengths,
> if that matters), the algorithm returns an edit distance of 2.
>
> One of the other tests (Parser/objc-quirks.m) is actually a crash.
>
> +  // maxDistance (aka D) directly influences the width of the strip
> (2*D+1)
> +  // we arbitrarily restrict it in case the caller did not see fit to do
> it
> +  // (from clang\lib\Sema\SemaLookup.cpp in
> TypoCorrectionConsumer::FoundName)
> +  unsigned const maxDistance = std::min(unsigned((otherLength+2)/3),
> +                                        maxEditDistance);
>
> I don't think this belongs here. We can count on the client to supply a
> reasonable maxEditDistance. If our performance is going to degrade horribly
> when an unreasonable maxEditDistance is provided (e.g., > the sum of the
> string lengths), we could just fall back to the simplistic algorithm we
> currently have.
>
> Oh, and please look at the LLVM Coding Standards (
> http://llvm.org/docs/CodingStandards.html), and reformat your patch
> accordingly. Thanks!
>
>        - Doug
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110205/c7f889df/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: edit_distance_clang.diff
Type: application/octet-stream
Size: 1154 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110205/c7f889df/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: edit_distance_llvm.diff
Type: application/octet-stream
Size: 19816 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110205/c7f889df/attachment-0001.obj>


More information about the cfe-dev mailing list