[cfe-dev] Spell Correction Efficiency

Mon Nov 8 13:05:08 PST 2010

Hi Doug,

We had a discussion a month ago I think about the Spell Correction algorithm
which was used in CLang, notably for auto-completion, based on the
Levenstein distance.

It turns out I just came upon this link today:
http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html

The idea is to use bigrams (2-letters parts of a word) to build an index of
the form (bigram > all words containing this bigram), then use this index to
retrieve all the words with enough bigrams in common with the word you are
currently trying to approximate.

This drastically reduces the set of identifiers on which computing the
Levenstein distance, especially if we directly trim those which have a
length incompatible anyway from the beginning. Furthermore, the result may
be ordered by the number of common bigrams, and thus the Levenstein distance
may be computed first on the most promising candidates in order to have the
maximum edit distance drop quickly.

I have implemented a quick algorithm in python (~70 lines, few comments
though), to test it out, and I find the results quite promising.

I have used the following corpus: http://norvig.com/big.txt  (~6 MB) which
can be found on Peter Norvig's page: http://norvig.com/spell-correct.html

For example, looking for the word "neaer":

Number of distinct Words from the corpus: 48911
Number of distinct Bigrams from the corpus: 1528
Number of results: [(1, 34), (2, 1133)]
Most promising results: (1, ['Kearney', 'NEAR', 'nearing', 'linear',
'learner', 'uneasy', 'nearest', 'neat', 'lineage', 'leaned', 'Nearer',
'Learned', 'Nearly', 'cleaner', 'cleaned', 'neatly', 'nearer', 'earned', 'n
eared', 'nearly', 'learned', 'nearby', 'Nearest', 'near', 'meanest',
'earnest', 'Near', 'beneath', 'gleaned', 'Beneath', 'kneaded', 'weaned',
'Freneau', 'guineas'])

Or for the word "sppose":

Number of distinct Words from the corpus: 48911
Number of distinct Bigrams from the corpus: 1528
Number of results: [(1, 14), (2, 136), (3, 991)]
Most promising results: (1, ['disposal', 'opposed', 'opposing', 'Oppose',
'suppose', 'Dispose', 'dispose', 'apposed', 'disposed', 'oppose',
'supposed', 'opposite', 'supposes', 'Suppose'])

Do you think the method worth investigating ?

Regards,
Matthieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20101108/a3572fb8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spellcorrection.py
Type: application/octet-stream
Size: 2022 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20101108/a3572fb8/attachment.obj>