[cfe-dev] Spell Correction Efficiency

Sat Jan 15 08:31:38 PST 2011

Hello Doug,

*putting llvmdev in copy since they are concerned too*

I've finally got around to finish a working implementation of the typical
Levenshtein Distance with the diagonal optimization.

I've tested it against the original llvm implementation and checked it on a
set of ~18k by randomly generating a variation of each word and checking
that both implementations would return the same result... took me some time
to patch the diagonal version, but they now do.

I have left the older method in place, especially because I saw that it was
used in FileCheck and was a bit wary of any side-effect I might introduce
there.

I have thus created a new method: unsigned
StringRef::levenshtein_distance(StringRef other, unsigned maxEditDistance)
and replace the two uses of edit_distance in Sema\SemaLookup.cpp with it.

I have ditched the "AllowReplace" argument since it further complicated the
logic of an already ~150 lines method for no obvious gain (as far as I could
see) and I have also explicited the computations a bit by using named
temporaries.

Unfortunately I have only a Windows machine at hand for the moment and the
tests don't work that well... so I'd appreciate if anyone could check my
changes.

I have done two separate patches: one for llvm (stringref-path.diff) and one
for clang (sema-patch.diff), the llvm patch should be applied first of
course.

Since it is my first time proposing a patch, I hope I have done things right
:)

Regards,
Matthieu.

2010/12/1 Douglas Gregor <dgregor at apple.com>

>
> On Nov 21, 2010, at 11:58 AM, Matthieu Monrocq wrote:
>
> Hi Doug,
>
> 2010/11/9 Douglas Gregor <dgregor at apple.com>
>
>> Hi Matthieu,
>>
>> On Nov 8, 2010, at 3:05 PM, Matthieu Monrocq wrote:
>>
>> Hi Doug,
>>
>> We had a discussion a month ago I think about the Spell Correction
>> algorithm which was used in CLang, notably for auto-completion, based on the
>> Levenstein distance.
>>
>> It turns out I just came upon this link today:
>> http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html
>>
>> The idea is to use bigrams (2-letters parts of a word) to build an index
>> of the form (bigram > all words containing this bigram), then use this index
>> to retrieve all the words with enough bigrams in common with the word you
>> are currently trying to approximate.
>>
>> This drastically reduces the set of identifiers on which computing the
>> Levenstein distance, especially if we directly trim those which have a
>> length incompatible anyway from the beginning. Furthermore, the result may
>> be ordered by the number of common bigrams, and thus the Levenstein distance
>> may be computed first on the most promising candidates in order to have the
>> maximum edit distance drop quickly.
>>
>> I have implemented a quick algorithm in python (~70 lines, few comments
>> though), to test it out, and I find the results quite promising.
>>
>> I have used the following corpus: http://norvig.com/big.txt  (~6 MB)
>> which can be found on Peter Norvig's page:
>> http://norvig.com/spell-correct.html
>>
>> For example, looking for the word "neaer":
>>
>> Number of distinct Words from the corpus: 48911
>> Number of distinct Bigrams from the corpus: 1528
>> Number of results: [(1, 34), (2, 1133)]
>> Most promising results: (1, ['Kearney', 'NEAR', 'nearing', 'linear',
>> 'learner', 'uneasy', 'nearest', 'neat', 'lineage', 'leaned', 'Nearer',
>> 'Learned', 'Nearly', 'cleaner', 'cleaned', 'neatly', 'nearer', 'earned', 'n
>> eared', 'nearly', 'learned', 'nearby', 'Nearest', 'near', 'meanest',
>> 'earnest', 'Near', 'beneath', 'gleaned', 'Beneath', 'kneaded', 'weaned',
>> 'Freneau', 'guineas'])
>>
>> Or for the word "sppose":
>>
>> Number of distinct Words from the corpus: 48911
>> Number of distinct Bigrams from the corpus: 1528
>> Number of results: [(1, 14), (2, 136), (3, 991)]
>> Most promising results: (1, ['disposal', 'opposed', 'opposing', 'Oppose',
>> 'suppose', 'Dispose', 'dispose', 'apposed', 'disposed', 'oppose',
>> 'supposed', 'opposite', 'supposes', 'Suppose'])
>>
>>
>> Do you think the method worth investigating ?
>>
>>
>> Yes, I do. Typo correction performance is very important (and has
>> historically been very poor).
>>
>> Note that there are also more-efficient algorithms for computing the
>> Levenstein distance (e.g., d*min(m, n) rather than m*n). We should also
>> consider those.
>>
>> - Doug
>>
>
> The diagonal optimization you are talking about seems to bring a *2 to *3
> speed-up, which is quite nice, however the algorithm is tricky and I am not
> sure yet that I have the correct distance in every case. I am going to
> settle and implement some fuzzy testers this week to help find bugs in the
> various algorithms.
>
>
> Okay. Replacing our current implementation with one of the faster
> algorithms would be great, since that would not require any infrastructure
> changes at all. Just replace the existing StringRef::edit_distance and we're
> done.
>
> Anyway I've been exploring a few options regarding typo correction these
> past weeks and I now have a little ecosystem of various solutions, however I
> currently lack 3 things:
> - a real lexicon (mine is about ~18k words) --> I would appreciate if
> someone pointed me to a more representative lexicon, otherwise I may try and
> index a subset of the boost library and see what I can come up with
>
>
> I tend to use Cocoa.h on Mac OS, which is great for Objective-C. Boost is
> reasonable for C++ code.
>
> - a real benchmark: I have no idea how to "select" good (typo-ed) words for
> testing / benchmarking the typo correction algorithms. I have started by
> always using the same word (at a distance of 1) but it clearly isn't
> representative and my knowledge of statistics is rather lacking so if you
> have an approach to suggest...
>
>
> Random permutations of randomly-selected words from the lexicon would be
> fine, I think.
>
> - a good Trie implementation, or hints in how to implement one. A Trie
> seems to be the best index (quite compact yet ordered), however I only have
> a naive implementation at the moment.
>
> My best concurrent so far seems to be based on CS Language theory and
> involves extracting the set of words recognized by two automatons, one being
> the Trie, which recognize all the known words, and one being crafted for the
> word we are looking for, and recognize all words in a certain distance.
> However for maximal efficiency, this requires some structures to be
> precomputed, and this structure grows exponentially with the distance. It
> works for distance 1 to 3, is already 2.8MB for distance 4 (and takes about
> 70s to be computed on my machine, though I suppose it would be faster to
> deserialize a precomputed version), and I haven't dared trying further as it
> seems to be a geometrical serie with a factor of ~28 !
>
>
> The issue with using a trie is that you have to actually build the trie. It
> can't be something that is maintained online, because typo correction must
> be zero cost unless it is needed (since most compilation runs never need
> typo correction). Also, performance in the presence of precompiled headers
> is also important: we don't want to have to construct the trie when building
> the precompiled header, for example.
>
> Would it be considered a loss to limit ourselves to words within distance
> min(word.length()/4+1, 3) ?
>
>
> I could live with that. Clang already has limits on the maximum edit
> distance it will accept for a typo correction.
>
> It occured to me that one transformation (the transposition of adjacent
> characters) was quite a common typo. I have toyed with the
> Damereau-Levenshtein distance but this is quite harder to implement the
> optmizations with it. Without it a simple transposition costs 2
> transformations (one addition and one deletion) so in this case 3 is rather
> limiting, but still I don't know if looking much further is very
> interesting.
>
>
> Making transposition a 1-unit transformation would be an improvement. It
> can be added to the current implementation or come in with a new
> implementation.
>
> - Doug
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110115/f304f400/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sema-patch.diff
Type: application/octet-stream
Size: 3422 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110115/f304f400/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stringref-patch.diff
Type: application/octet-stream
Size: 15364 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110115/f304f400/attachment-0001.obj>